linux/kernel
Steven Rostedt 43fa5460fe sched: Try not to migrate higher priority RT tasks
When first working on the RT scheduler design, we concentrated on
keeping all CPUs running RT tasks instead of having multiple RT
tasks on a single CPU waiting for the migration thread to move
them. Instead we take a more proactive stance and push or pull RT
tasks from one CPU to another on wakeup or scheduling.

When an RT task wakes up on a CPU that is running another RT task,
instead of preempting it and killing the cache of the running RT
task, we look to see if we can migrate the RT task that is waking
up, even if the RT task waking up is of higher priority.

This may sound a bit odd, but RT tasks should be limited in
migration by the user anyway. But in practice, people do not do
this, which causes high prio RT tasks to bounce around the CPUs.
This becomes even worse when we have priority inheritance, because
a high prio task can block on a lower prio task and boost its
priority. When the lower prio task wakes up the high prio task, if
it happens to be on the same CPU it will migrate off of it.

But in reality, the above does not happen much either, because the
wake up of the lower prio task, which has already been boosted, if
it was on the same CPU as the higher prio task, it would then
migrate off of it. But anyway, we do not want to migrate them
either.

To examine the scheduling, I created a test program and examined it
under kernelshark. The test program created CPU * 2 threads, where
each thread had a different priority. The program takes different
options. The options used in this change log was to have priority
inheritance mutexes or not.

All threads did the following loop:

static void grab_lock(long id, int iter, int l)
{
	ftrace_write("thread %ld iter %d, taking lock %d\n",
		     id, iter, l);
	pthread_mutex_lock(&locks[l]);
	ftrace_write("thread %ld iter %d, took lock %d\n",
		     id, iter, l);
	busy_loop(nr_tasks - id);
	ftrace_write("thread %ld iter %d, unlock lock %d\n",
		     id, iter, l);
	pthread_mutex_unlock(&locks[l]);
}

void *start_task(void *id)
{
	[...]
	while (!done) {
		for (l = 0; l < nr_locks; l++) {
			grab_lock(id, i, l);
			ftrace_write("thread %ld iter %d sleeping\n",
				     id, i);
			ms_sleep(id);
		}
		i++;
	}
	[...]
}

The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The
ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes
to the ftrace buffer to help analyze via ftrace.

The higher the id, the higher the prio, the shorter it does the
busy loop, but the longer it spins. This is usually the case with
RT tasks, the lower priority tasks usually run longer than higher
priority tasks.

At the end of the test, it records the number of loops each thread
took, as well as the number of voluntary preemptions, non-voluntary
preemptions, and number of migrations each thread took, taking the
information from /proc/$$/sched and /proc/$$/status.

Running this on a 4 CPU processor, the results without changes to
the kernel looked like this:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:         53      3220       1470             98
  1:        562       773        724             98
  2:        752       933       1375             98
  3:        749        39        697             98
  4:        758         5        515             98
  5:        764         2        679             99
  6:        761         2        535             99
  7:        757         3        346             99

total:     5156       4977      6341            787

Each thread regardless of priority migrated a few hundred times.
The higher priority tasks, were a little better but still took
quite an impact.

By letting higher priority tasks bump the lower prio task from the
CPU, things changed a bit:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:         37      2835       1937             98
  1:        666      1821       1865             98
  2:        654      1003       1385             98
  3:        664       635        973             99
  4:        698       197        352             99
  5:        703       101        159             99
  6:        708         1         75             99
  7:        713         1          2             99

total:     4843       6594      6748            789

The total # of migrations did not change (several runs showed the
difference all within the noise). But we now see a dramatic
improvement to the higher priority tasks. (kernelshark showed that
the watchdog timer bumped the highest priority task to give it the
2 count. This was actually consistent with every run).

Notice that the # of iterations did not change either.

The above was with priority inheritance mutexes. That is, when the
higher prority task blocked on a lower priority task, the lower
priority task would inherit the higher priority task (which shows
why task 6 was bumped so many times). When not using priority
inheritance mutexes, the current kernel shows this:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:         56      3101       1892             95
  1:        594       713        937             95
  2:        625       188        618             95
  3:        628         4        491             96
  4:        640         7        468             96
  5:        631         2        501             96
  6:        641         1        466             96
  7:        643         2        497             96

total:     4458       4018      5870            765

Not much changed with or without priority inheritance mutexes. But
if we let the high priority task bump lower priority tasks on
wakeup we see:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:        115      3439       2782             98
  1:        633      1354       1583             99
  2:        652       919       1218             99
  3:        645       713        934             99
  4:        690         3          3             99
  5:        694         1          4             99
  6:        720         3          4             99
  7:        747         0          1            100

Which shows a even bigger change. The big difference between task 3
and task 4 is because we have only 4 CPUs on the machine, causing
the 4 highest prio tasks to always have preference.

Although I did not measure cache misses, and I'm sure there would
be little to measure since the test was not data intensive, I could
imagine large improvements for higher priority tasks when dealing
with lower priority tasks. Thus, I'm satisfied with making the
change and agreeing with what Gregory Haskins argued a few years
ago when we first had this discussion.

One final note. All tasks in the above tests were RT tasks. Any RT
task will always preempt a non RT task that is running on the CPU
the RT task wants to run on.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Gregory Haskins <ghaskins@novell.com>
LKML-Reference: <20100921024138.605460343@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-09-21 13:57:12 +02:00
..
debug Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-09-08 11:13:42 -07:00
gcov gcov: fix null-pointer dereference for certain module types 2010-09-09 18:57:23 -07:00
irq irq: Add new IRQ flag IRQF_NO_SUSPEND 2010-07-29 13:24:57 +02:00
power Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 2010-09-11 15:50:53 -07:00
time time: Workaround gcc loop optimization that causes 64bit div errors 2010-08-13 12:03:24 -07:00
trace Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-09-10 07:31:24 -07:00
.gitignore
Kconfig.freezer
Kconfig.hz
Kconfig.locks mutex: Better control mutex adaptive spinning config 2009-12-03 11:50:11 +01:00
Kconfig.preempt
Makefile Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify 2010-08-10 11:39:13 -07:00
acct.c pass a struct path to vfs_statfs 2010-08-09 16:48:42 -04:00
async.c async: use workqueue for worker pool 2010-07-14 11:29:46 +02:00
audit.c Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify 2010-08-10 11:39:13 -07:00
audit.h Audit: split audit watch Kconfig 2010-07-28 09:58:19 -04:00
audit_tree.c fanotify: use both marks when possible 2010-07-28 10:18:55 -04:00
audit_watch.c Revert "fsnotify: store struct file not struct path" 2010-08-12 14:23:04 -07:00
auditfilter.c audit: do not get and put just to free a watch 2010-07-28 09:58:17 -04:00
auditsc.c vfs: add helpers to get root and pwd 2010-08-11 00:28:20 -04:00
backtracetest.c
bounds.c kbuild: move bounds.h to include/generated 2009-12-12 13:08:14 +01:00
capability.c sched: Remove remaining USER_SCHED code 2010-04-02 20:12:00 +02:00
cgroup.c cgroups: fix API thinko 2010-09-09 18:57:23 -07:00
cgroup_freezer.c Freezer / cgroup freezer: Update stale locking comments 2010-05-10 23:18:47 +02:00
compat.c compat: Make compat_alloc_user_space() incorporate the access_ok() 2010-09-14 16:08:45 -07:00
configs.c
cpu.c sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining 2010-06-08 21:40:36 +02:00
cpuset.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-08-06 09:39:22 -07:00
cred.c Add a dummy printk function for the maintenance of unused printks 2010-08-12 09:51:35 -07:00
delayacct.c
dma.c
early_res.c kmemleak: Add support for NO_BOOTMEM configurations 2010-07-19 11:54:15 +01:00
elfcore.c elf coredump: add extended numbering support 2010-03-06 11:26:46 -08:00
exec_domain.c sys_personality: remove the bogus checks in sys_personality()->__set_personality() path 2010-08-09 20:45:05 -07:00
exit.c Fix unprotected access to task credentials in waitid() 2010-08-17 18:07:43 -07:00
extable.c
fork.c mm: make the vma list be doubly linked 2010-08-21 08:49:21 -07:00
freezer.c
futex.c futex: futex_find_get_task remove credentails check 2010-06-30 15:43:44 -07:00
futex_compat.c futex: Protect pid lookup in compat code with RCU 2009-12-09 14:22:14 +01:00
groups.c kernel/groups.c: fix integer overflow in groups_search 2010-09-09 18:57:24 -07:00
hrtimer.c gcc-4.6: kernel/*: Fix unused but set warnings 2010-09-05 14:36:58 +02:00
hung_task.c softlockup: Fix hung_task_check_count sysctl 2009-11-27 06:21:57 +01:00
hw_breakpoint.c Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-08-06 09:30:52 -07:00
itimer.c
kallsyms.c kdb: core for kgdb back end (2 of 2) 2010-05-20 21:04:21 -05:00
kexec.c kexec: return -EFAULT on copy_to_user() failures 2010-08-11 08:59:22 -07:00
kfifo.c kfifo: implement missing __kfifo_skip_r() 2010-08-20 09:34:54 -07:00
kmod.c Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
kprobes.c kprobes: Move enable/disable_kprobe() out from debugfs code 2010-05-08 18:08:30 +02:00
ksysfs.c sysfs: add struct file* to bin_attr callbacks 2010-05-21 09:37:31 -07:00
kthread.c kthread: implement kthread_data() 2010-06-29 10:07:09 +02:00
latencytop.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
lockdep.c sched_clock: Add local_clock() API and improve documentation 2010-06-09 10:34:49 +02:00
lockdep_internals.h lockdep: No need to disable preemption in debug atomic ops 2010-05-04 05:38:16 +02:00
lockdep_proc.c lockstat: Make lockstat counting per cpu 2010-04-06 00:15:37 +02:00
lockdep_states.h
module.c module: cleanup comments, remove noinline 2010-08-05 12:59:13 +09:30
mutex-debug.c
mutex-debug.h locking: Implement new raw_spinlock 2009-12-14 23:55:32 +01:00
mutex.c mutex: Fix annotations to include it in kernel-locking docbook 2010-09-03 08:19:51 +02:00
mutex.h
notifier.c sched: Use lockdep-based checking on rcu_dereference() 2010-02-25 10:34:26 +01:00
ns_cgroup.c
nsproxy.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
padata.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2010-08-04 15:23:14 -07:00
panic.c lib/bug.c: add oops end marker to WARN implementation 2010-08-11 08:59:22 -07:00
params.c param: locking for kernel parameters 2010-08-11 23:04:20 +09:30
perf_event.c perf: Fix CPU hotplug 2010-09-09 20:38:52 +02:00
pid.c pids: alloc_pidmap: remove the unnecessary boundary checks 2010-08-11 08:59:20 -07:00
pid_namespace.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
pm_qos_params.c PM QoS: Correct pr_debug() misuse and improve parameter checks 2010-09-11 00:53:05 +02:00
posix-cpu-timers.c Merge branch 'writable_limits' of git://decibel.fi.muni.cz/~xslaby/linux 2010-08-10 12:07:51 -07:00
posix-timers.c posix_timer: Move copy_to_user(created_timer_id) down in timer_create() 2010-07-23 15:08:12 +02:00
printk.c gcc-4.6: printk: use stable variable to dump kmsg buffer 2010-08-09 20:45:06 -07:00
profile.c numa: in-kernel profiling: use cpu_to_mem() for per cpu allocations 2010-05-27 09:12:57 -07:00
ptrace.c ptrace: optimize exit_ptrace() for the likely case 2010-08-11 08:59:19 -07:00
range.c kernel/range: remove unused definition of ARRAY_SIZE() 2010-08-09 20:45:06 -07:00
rcupdate.c tree/tiny rcu: Add debug RCU head objects 2010-06-14 16:37:26 -07:00
rcutiny.c tree/tiny rcu: Add debug RCU head objects 2010-06-14 16:37:26 -07:00
rcutiny_plugin.h rcu: slim down rcutiny by removing rcu_scheduler_active and friends 2010-05-10 11:08:34 -07:00
rcutorture.c sched_clock: Add local_clock() API and improve documentation 2010-06-09 10:34:49 +02:00
rcutree.c tree/tiny rcu: Add debug RCU head objects 2010-06-14 16:37:26 -07:00
rcutree.h rcu: reduce the number of spurious RCU_SOFTIRQ invocations 2010-05-10 11:08:35 -07:00
rcutree_plugin.h rcu: remove all rcu head initializations, except on_stack initializations 2010-05-11 16:10:47 -07:00
rcutree_trace.c rcu: reduce the number of spurious RCU_SOFTIRQ invocations 2010-05-10 11:08:35 -07:00
relay.c kernel/: convert cpu notifier to return encapsulate errno value 2010-05-27 09:12:48 -07:00
res_counter.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
resource.c resource: shared I/O region support 2010-05-11 12:01:10 -07:00
rtmutex-debug.c sched: Convert pi_lock to raw_spinlock 2009-12-14 23:55:33 +01:00
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c rtmutes: Convert rtmutex.lock to raw_spinlock 2009-12-14 23:55:33 +01:00
rtmutex.h
rtmutex_common.h
rwsem.c
sched.c Merge commit 'v2.6.36-rc5' into sched/core 2010-09-21 13:56:49 +02:00
sched_clock.c sched_clock: Add local_clock() API and improve documentation 2010-06-09 10:34:49 +02:00
sched_cpupri.c sched: No need for bootmem special cases 2010-07-17 12:06:22 +02:00
sched_cpupri.h sched: No need for bootmem special cases 2010-07-17 12:06:22 +02:00
sched_debug.c sched: Use correct macro to display sched_child_runs_first in /proc/sched_debug 2010-07-21 21:46:12 +02:00
sched_fair.c sched: Increment cache_nice_tries only on periodic lb 2010-09-21 13:57:11 +02:00
sched_features.h sched: Remove ASYM_GRAN feature 2010-03-11 18:32:53 +01:00
sched_idletask.c sched: Cure load average vs NO_HZ woes 2010-04-23 11:02:02 +02:00
sched_rt.c sched: Try not to migrate higher priority RT tasks 2010-09-21 13:57:12 +02:00
sched_stats.h sched: Remove the obsolete exit_state/signal hacks 2010-06-18 10:46:56 +02:00
seccomp.c
semaphore.c
signal.c CRED: Fix RCU warning due to previous patch fixing __task_cred()'s checks 2010-08-04 11:17:10 -07:00
smp.c kernel/: convert cpu notifier to return encapsulate errno value 2010-05-27 09:12:48 -07:00
softirq.c kernel/: fix BUG_ON checks for cpu notifier callbacks direct call 2010-06-04 15:21:45 -07:00
spinlock.c locking: Cleanup the name space completely 2009-12-14 23:55:33 +01:00
srcu.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
stacktrace.c
stop_machine.c stop_machine: struct cpu_stopper, remove alignment padding on 64 bits 2010-08-09 20:45:06 -07:00
sys.c pid: make setpgid() system call use RCU read-side critical section 2010-08-31 17:00:18 -07:00
sys_ni.c fanotify: sys_fanotify_mark declartion 2010-07-28 09:58:55 -04:00
sysctl.c gcc-4.6: kernel/*: Fix unused but set warnings 2010-09-05 14:36:58 +02:00
sysctl_binary.c sysctl: don't use own implementation of hex_to_bin() 2010-05-25 08:07:05 -07:00
sysctl_check.c ipv4 05/05: add sysctl to accept packets with local source addresses 2009-12-03 12:14:38 -08:00
taskstats.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
test_kprobes.c
time.c time: Kill off CONFIG_GENERIC_TIME 2010-07-27 12:40:54 +02:00
timeconst.pl
timer.c kernel/timer.c: fix kernel-doc function parameter warning 2010-08-10 15:33:09 -07:00
tracepoint.c tracing: Let tracepoints have data passed to tracepoint callbacks 2010-05-14 09:50:34 -04:00
tsacct.c mm: clean up mm_counter 2010-03-06 11:26:23 -08:00
uid16.c
up.c
user-return-notifier.c core: Clean up user return notifers use of per_cpu 2009-12-02 10:22:59 +01:00
user.c sched: Remove a stale comment 2010-05-10 08:48:39 +02:00
user_namespace.c user_ns: Introduce user_nsmap_uid and user_ns_map_gid. 2010-06-16 14:55:34 -07:00
utsname.c
utsname_sysctl.c
wait.c
watchdog.c lockup_detector: Sync touch_*_watchdog back to old semantics 2010-09-01 10:02:28 +02:00
workqueue.c workqueue: add documentation 2010-09-13 10:26:52 +02:00
workqueue_sched.h workqueue: implement concurrency managed dynamic worker pool 2010-06-29 10:07:14 +02:00