Same adustments that have been added to the ondemand recently.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
Dave,
Attached is an update of my patch against the cpufreq fixes branch.
Before applying the patch I compiled and booted the tree to see if the panic
was still there -- to my surprise it was not. This is because of the conversion
of cpufreq_cpu_governor to a char[].
While the panic is kaput, the problem of stale data continues and my patch is
still valid. It is possible to end up with the wrong governor after hotplug
events because CPUFREQ_DEFAULT_GOVERNOR is statically linked to a default,
while the cpu siblings may have had a different governor assigned by a user.
ie) the patch is still needed in order to keep the governors assigned
properly when hotplugging devices
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Dave Jones <davej@redhat.com>
ondemand and conservative governors are messing up time units in the
code path where NO_HZ is not enabled and ignore_nice is set. The walltime
idletime stored is in jiffies and nice time calculation is happening in
microseconds.
The problem was reported and diagnosed by Alexander here.
http://marc.info/?l=linux-kernel&m=125752550404513&w=2
The patch below fixes this thinko.
Reported-by: Alexander Miller <Miller@fmi.uni-stuttgart.de>
Tested-by: Alexander Miller <Miller@fmi.uni-stuttgart.de>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Currently on governer backup/restore path we storing governor's pointer.
This is wrong because one may unload governor's module after cpu goes
offline. As result use-after-free will take place on restored cpu.
It is not easy to exploit this bug, but still we have to close this
issue ASAP. Issue was introduced by following commit
084f349394
##TESTCASE##
#!/bin/sh -x
modprobe acpi_cpufreq
# Any non default governor, in may case it is "ondemand"
modprobe cpufreq_ondemand
echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
rmmod acpi_cpufreq
rmmod cpufreq_ondemand
modprobe acpi_cpufreq # << use-after-free here.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Dave Jones <davej@redhat.com>
This patch updates percpu related symbols in cpufreq such that percpu
symbols are unique and don't clash with local symbols. This serves
two purposes of decreasing the possibility of global percpu symbol
collision and allowing dropping per_cpu__ prefix from percpu symbols.
* drivers/cpufreq/cpufreq.c: s/policy_cpu/cpufreq_policy_cpu/
* drivers/cpufreq/freq_table.c: s/show_table/cpufreq_show_table/
* arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c: s/drv_data/acfreq_data/
s/old_perf/acfreq_old_perf/
Partly based on Rusty Russell's "alloc_percpu: rename percpu vars
which cause name clashes" patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq:
[CPUFREQ] Fix NULL ptr regression in powernow-k8
[CPUFREQ] Create a blacklist for processors that should not load the acpi-cpufreq module.
[CPUFREQ] Powernow-k8: Enable more than 2 low P-states
[CPUFREQ] remove rwsem lock from CPUFREQ_GOV_STOP call (second call site)
[CPUFREQ] ondemand - Use global sysfs dir for tuning settings
[CPUFREQ] Introduce global, not per core: /sys/devices/system/cpu/cpufreq
[CPUFREQ] Bail out of cpufreq_add_dev if the link for a managed CPU got created
[CPUFREQ] Factor out policy setting from cpufreq_add_dev
[CPUFREQ] Factor out interface creation from cpufreq_add_dev
[CPUFREQ] Factor out symlink creation from cpufreq_add_dev
[CPUFREQ] cleanup up -ENOMEM handling in cpufreq_add_dev
[CPUFREQ] Reduce scope of cpu_sys_dev in cpufreq_add_dev
[CPUFREQ] update Doc for cpuinfo_cur_freq and scaling_cur_freq
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits)
powerpc64: convert to dynamic percpu allocator
sparc64: use embedding percpu first chunk allocator
percpu: kill lpage first chunk allocator
x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
percpu: update embedding first chunk allocator to handle sparse units
percpu: use group information to allocate vmap areas sparsely
vmalloc: implement pcpu_get_vm_areas()
vmalloc: separate out insert_vmalloc_vm()
percpu: add chunk->base_addr
percpu: add pcpu_unit_offsets[]
percpu: introduce pcpu_alloc_info and pcpu_group_info
percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward
percpu: add @align to pcpu_fc_alloc_fn_t
percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()
percpu: drop @static_size from first chunk allocators
percpu: generalize first chunk allocator selection
percpu: build first chunk allocators selectively
percpu: rename 4k first chunk allocator to page
percpu: improve boot messages
percpu: fix pcpu_reclaim() locking
...
Fix trivial conflict as by Tejun Heo in kernel/sched.c
remove rwsem lock from CPUFREQ_GOV_STOP call (second call site)
commit 42a06f2166
Missed a call site for CPUFREQ_GOV_STOP to remove the rwlock taken around the
teardown. To make a long story short, the rwlock write-lock causes a circular
dependency with cancel_delayed_work_sync(), because the timer handler takes the
read lock.
Note that all callers to __cpufreq_set_policy are taking the rwsem. All sysfs
callers (writers) hold the write rwsem at the earliest sysfs calling stage.
However, the rwlock write-lock is not needed upon governor stop.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
CC: rjw@sisk.pl
CC: mingo@elte.hu
CC: Shaohua Li <shaohua.li@intel.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Dave Young <hidave.darkstar@gmail.com>
CC: "Rafael J. Wysocki" <rjw@sisk.pl>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: trenn@suse.de
CC: sven.wegener@stealer.net
CC: cpufreq@vger.kernel.org
Signed-off-by: Dave Jones <davej@redhat.com>
Ondemand has only global variables for userspace tunings via sysfs.
But they were exposed per CPU which wrongly implies to the user that
his settings are applied per cpu. Also locking sysfs against concurrent
access won't be necessary anymore after deprecation time.
This means the ondemand config dir is moved:
/sys/devices/system/cpu/cpu*/cpufreq/ondemand ->
/sys/devices/system/cpu/cpufreq/ondemand
The old files will still exist, but reading or writing to them will
result in one (printk_once) deprecation msg to syslog per file.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
Currently everything in the cpufreq layer is per core based.
This does not reflect reality, for example ondemand on conservative
governors have global sysfs variables.
Introduce a global cpufreq directory and add the kobject to the governor
struct, so that governors can easily access it.
The directory is initialized in the cpufreq_core_init initcall and thus will
always be created if cpufreq is compiled in, even if no cpufreq driver is
active later.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
Doing:
echo 0 >cpu1/online
echo 1 >cpu1/online
on a managed CPU will result in:
Jul 22 15:15:37 linux kernel: [ 80.013864] WARNING: at fs/sysfs/dir.c:487 sysfs_add_one+0xcf/0xe6()
Jul 22 15:15:37 linux kernel: [ 80.013866] Hardware name: To Be Filled By O.E.M.
Jul 22 15:15:37 linux kernel: [ 80.013868] sysfs: cannot create duplicate filename '/devices/system/cpu/cpu1/cpufreq'
Jul 22 15:15:37 linux kernel: [ 80.013870] Modules linked in: powernow_k8
Jul 22 15:15:37 linux kernel: [ 80.013874] Pid: 5750, comm: bash Not tainted 2.6.31-rc2 #40
Jul 22 15:15:37 linux kernel: [ 80.013876] Call Trace:
Jul 22 15:15:37 linux kernel: [ 80.013879] [<ffffffff8112ebda>] ? sysfs_add_one+0xcf/0xe6
Jul 22 15:15:37 linux kernel: [ 80.013884] [<ffffffff81041926>] warn_slowpath_common+0x77/0xa4
Jul 22 15:15:37 linux kernel: [ 80.013888] [<ffffffff810419a0>] warn_slowpath_fmt+0x3c/0x3e
Jul 22 15:15:37 linux kernel: [ 80.013891] [<ffffffff8112ebda>] sysfs_add_one+0xcf/0xe6
Jul 22 15:15:37 linux kernel: [ 80.013894] [<ffffffff8112f213>] create_dir+0x58/0x87
Jul 22 15:15:37 linux kernel: [ 80.013898] [<ffffffff8112f27a>] sysfs_create_dir+0x38/0x4f
Jul 22 15:15:37 linux kernel: [ 80.013902] [<ffffffff811ffb8a>] kobject_add_internal+0x11f/0x1de
Jul 22 15:15:37 linux kernel: [ 80.013905] [<ffffffff811ffd21>] kobject_add_varg+0x41/0x4e
Jul 22 15:15:37 linux kernel: [ 80.013908] [<ffffffff811ffd7a>] kobject_init_and_add+0x4c/0x57
Jul 22 15:15:37 linux kernel: [ 80.013913] [<ffffffff810667bc>] ? mark_lock+0x22/0x228
Jul 22 15:15:37 linux kernel: [ 80.013918] [<ffffffff813e8a3b>] cpufreq_add_dev_interface+0x40/0x1e4
...
This bug slipped in by git commit:
150b06f7f223cfd0f808737a5243cceca8ea47fa
When splitting up cpufreq_add_dev, the whole cpufreq_add_dev function
is not left anymore, only cpufreq_add_dev_policy.
This patch should reconstruct the identical functionality again as it
was before the split.
CC: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
Commit 4bc5d34135 is broken and causes regressions:
(1) cpufreq_driver->resume() and ->suspend() were only called on
__powerpc__, but you could set them on all architectures. In fact,
->resume() was defined and used before the PPC-related commit
42d4dc3f4e complained about in 4bc5d34135.
(2) Therfore, the resume functions in acpi_cpufreq and speedstep-smi
would never be called.
(3) This means speedstep-smi would be unusuable after suspend or resume.
The _real_ problem was calling cpufreq_driver->get() with interrupts
off, but it re-enabling interrupts on some platforms. Why is ->get()
necessary?
Some systems like to change the CPU frequency behind our
back, especially during BIOS-intensive operations like suspend or
resume. If such systems also use a CPU frequency-dependant timing loop,
delays might be off by large factors. Therefore, we need to ascertain
as soon as possible that the CPU frequency is indeed at the speed we
think it is. You can do this two ways: either setting it anew, or trying
to get it. The latter is what was done, the former also has the same IRQ
issue.
So, let's try something different: defer the checking to after interrupts
are re-enabled, by calling cpufreq_update_policy() (via schedule_work()).
Timings may be off until this later stage, so let's watch out for
resume regressions caused by the deferred handling of frequency changes
behind the kernel's back.
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Dave Jones <davej@redhat.com>
Conflicts:
arch/sparc/kernel/smp_64.c
arch/x86/kernel/cpu/perf_counter.c
arch/x86/kernel/setup_percpu.c
drivers/cpufreq/cpufreq_ondemand.c
mm/percpu.c
Conflicts in core and arch percpu codes are mostly from commit
ed78e1e078dd44249f88b1dd8c76dafb39567161 which substituted many
num_possible_cpus() with nr_cpu_ids. As for-next branch has moved all
the first chunk allocators into mm/percpu.c, the changes are moved
from arch code to mm/percpu.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
The suspend code runs with interrupts disabled, and the powerpc workaround we
do in the cpufreq suspend hook calls the drivers ->get method.
powernow-k8's ->get does an smp_call_function_single
which needs interrupts enabled
cpufreq's suspend/resume code was added in 42d4dc3f4e to work around
a hardware problem on ppc powerbooks. If we make all this code
conditional on powerpc, we avoid the issue above.
Signed-off-by: Dave Jones <davej@redhat.com>
The first offline/online cycle is successful, the second not.
Doing:
echo 0 >cpu1/online
echo 1 >cpu1/online
echo 0 >cpu1/online
The last command will trigger:
Jul 22 14:39:50 linux kernel: [ 593.210125] ------------[ cut here ]------------
Jul 22 14:39:50 linux kernel: [ 593.210139] WARNING: at lib/kref.c:43 kref_get+0x23/0x2b()
Jul 22 14:39:50 linux kernel: [ 593.210144] Hardware name: To Be Filled By O.E.M.
Jul 22 14:39:50 linux kernel: [ 593.210148] Modules linked in: powernow_k8
Jul 22 14:39:50 linux kernel: [ 593.210158] Pid: 378, comm: kondemand/2 Tainted: G W 2.6.31-rc2 #38
Jul 22 14:39:50 linux kernel: [ 593.210163] Call Trace:
Jul 22 14:39:50 linux kernel: [ 593.210171] [<ffffffff812008e8>] ? kref_get+0x23/0x2b
Jul 22 14:39:50 linux kernel: [ 593.210181] [<ffffffff81041926>] warn_slowpath_common+0x77/0xa4
Jul 22 14:39:50 linux kernel: [ 593.210190] [<ffffffff81041962>] warn_slowpath_null+0xf/0x11
Jul 22 14:39:50 linux kernel: [ 593.210198] [<ffffffff812008e8>] kref_get+0x23/0x2b
Jul 22 14:39:50 linux kernel: [ 593.210206] [<ffffffff811ffa19>] kobject_get+0x1a/0x22
Jul 22 14:39:50 linux kernel: [ 593.210214] [<ffffffff813e815d>] cpufreq_cpu_get+0x8a/0xcb
Jul 22 14:39:50 linux kernel: [ 593.210222] [<ffffffff813e87d1>] __cpufreq_driver_getavg+0x1d/0x67
Jul 22 14:39:50 linux kernel: [ 593.210231] [<ffffffff813ea18f>] do_dbs_timer+0x158/0x27f
Jul 22 14:39:50 linux kernel: [ 593.210240] [<ffffffff810529ea>] worker_thread+0x200/0x313
...
The output continues on every do_dbs_timer ondemand freq checking poll.
This regression was introduced by git commit:
3f4a782b5c
The policy is released when the cpufreq device is removed in:
__cpufreq_remove_dev():
/* if this isn't the CPU which is the parent of the kobj, we
* only need to unlink, put and exit
*/
Not creating the symlink is not sever at all.
As long as:
sysfs_remove_link(&sys_dev->kobj, "cpufreq");
handles it gracefully that the symlink did not exist.
Possibly no error should be returned at all, because ondemand
governor would still provide the same functionality.
Userspace in userspace gov case might be confused if the link
is missing.
Resolves http://bugzilla.kernel.org/show_bug.cgi?id=13903
CC: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
Suspend/Resume fails on multi socket, multi core systems because the cpufreq
code erroneously sets the per_cpu policy_cpu value when a logical cpu is
offline.
This most notably results in missing sysfs files that are used to set the
cpu frequencies of the various cpus.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Commit ee88415caf
introduced this regression when it removed enable bit in cpu_dbs_info_s.
That added a possibility of dbs_cpufreq_notifier getting called for a
CPU that is not yet managed by conservative governor. That will happen
as the transition notifier is set as soon as one CPU switches to
conservative governor and other CPUs can get a NULL pointer dereference
without the enable bit check. Add the enable bit back again.
Reported-by: Lermytte Christophe <Christophe.Lermytte@thomson.net>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
OK, I've tried to clean it up the best I could, but please test this with
concurrent cpu hotplug and cpufreq add/remove in loops. I'm sure we will make
other interesting findings.
This is step one of fixing the overall locking dependency mess in cpufreq.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
CC: rjw@sisk.pl
CC: mingo@elte.hu
CC: Shaohua Li <shaohua.li@intel.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Dave Young <hidave.darkstar@gmail.com>
CC: "Rafael J. Wysocki" <rjw@sisk.pl>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: sven.wegener@stealer.net
CC: cpufreq@vger.kernel.org
CC: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
Redesign the locking inside conservative driver. Make dbs_mutex handle all the
global state changes inside the driver and invent a new percpu mutex
to serialize percpu timer and frequency limit change.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Redesign the locking inside ondemand driver. Make dbs_mutex handle all the
global state changes inside the driver and invent a new percpu mutex
to serialize percpu timer and frequency limit change.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Commit b14893a62c although it was very
much needed to properly cleanup ondemand timer, opened-up a can of worms
related to locking dependencies in cpufreq.
Patch here defines the need for dbs_mutex and cleans up its usage in
ondemand governor. This also resolves the lockdep warnings reported here
http://lkml.indiana.edu/hypermail/linux/kernel/0906.1/01925.htmlhttp://lkml.indiana.edu/hypermail/linux/kernel/0907.0/00820.html
and few others..
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Percpu variable definition is about to be updated such that all percpu
symbols including the static ones must be unique. Update percpu
variable definitions accordingly.
* as,cfq: rename ioc_count uniquely
* cpufreq: rename cpu_dbs_info uniquely
* xen: move nesting_count out of xen_evtchn_do_upcall() and rename it
* mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and
rename it
* ipv4,6: rename cookie_scratch uniquely
* x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to
pmc_irq_entry and nmi_entry to pmc_nmi_entry
* perf_counter: rename disable_count to perf_disable_count
* ftrace: rename test_event_disable to ftrace_test_event_disable
* kmemleak: rename test_pointer to kmemleak_test_pointer
* mce: rename next_interval to mce_next_interval
[ Impact: percpu usage cleanups, no duplicate static percpu var names ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: linux-mm <linux-mm@kvack.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Update the documentation accordingly.
Cleanup and use printk_once.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
With this patch you have following minimal sampling rate restrictions:
Kernel restrictions:
If CONFIG_NO_HZ is set, the limit is 10ms fixed.
If CONFIG_NO_HZ is not set or no_hz=off boot parameter is used, the
limits depend on the CONFIG_HZ option:
HZ=1000: min=20000us (20ms)
HZ=250: min=80000us (80ms)
HZ=100: min=200000us (200ms)
HW restrictions:
Do not sample/poll more often than HW latency * 100 exported by the low
level cpufreq HW driver
The higher value of above restrictions is the minimal sampling rate
that can be set (and can be seen via ondemand/sampling_rate_min sysfs file)
Default sampling rate still is HW latency * 1000, but this will now end
up in lower values on latest (Intel and AMD) hardware as these can switch
really fast and sampling rate mostly was limited to the 80ms or 200ms
(depending on whether HZ=250 or HZ=1000 is used).
Signed-off-by: Thomas Renninger <trenn@suse.de>
Cc: Pallipadi Venkatesh <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
These are defined as static cpumask_var_t so if MAXSMP is not used,
they are cleared already. Avoid surprises when MAXSMP is enabled.
Signed-off-by: Yinghai Lu <yinghai.lu@kernel.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* Rafael J. Wysocki (rjw@sisk.pl) wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.28 and 2.6.29.
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.28 and 2.6.29. Please verify if it still should
> be listed and let me know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13186
> Subject : cpufreq timer teardown problem
> Submitter : Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Date : 2009-04-23 14:00 (24 days old)
> References : http://marc.info/?l=linux-kernel&m=124049523515036&w=4
> Handled-By : Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Patch : http://patchwork.kernel.org/patch/19754/
> http://patchwork.kernel.org/patch/19753/
>
(updated changelog)
cpufreq fix timer teardown in ondemand governor
The problem is that dbs_timer_exit() uses cancel_delayed_work() when it should
use cancel_delayed_work_sync(). cancel_delayed_work() does not wait for the
workqueue handler to exit.
The ondemand governor does not seem to be affected because the
"if (!dbs_info->enable)" check at the beginning of the workqueue handler returns
immediately without rescheduling the work. The conservative governor in
2.6.30-rc has the same check as the ondemand governor, which makes things
usually run smoothly. However, if the governor is quickly stopped and then
started, this could lead to the following race :
dbs_enable could be reenabled and multiple do_dbs_timer handlers would run.
This is why a synchronized teardown is required.
The following patch applies to, at least, 2.6.28.x, 2.6.29.1, 2.6.30-rc2.
Depends on patch
cpufreq: remove rwsem lock from CPUFREQ_GOV_STOP call
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: gregkh@suse.de
CC: stable@kernel.org
CC: cpufreq@vger.kernel.org
CC: Ingo Molnar <mingo@elte.hu>
CC: rjw@sisk.pl
CC: Ben Slusky <sluskyb@paranoiacs.org>
Signed-off-by: Dave Jones <davej@redhat.com>
* Rafael J. Wysocki (rjw@sisk.pl) wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.28 and 2.6.29.
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.28 and 2.6.29. Please verify if it still should
> be listed and let me know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13186
> Subject : cpufreq timer teardown problem
> Submitter : Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Date : 2009-04-23 14:00 (24 days old)
> References : http://marc.info/?l=linux-kernel&m=124049523515036&w=4
> Handled-By : Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Patch : http://patchwork.kernel.org/patch/19754/
> http://patchwork.kernel.org/patch/19753/
>
(re-send with updated changelog)
cpufreq fix timer teardown in conservative governor
The problem is that dbs_timer_exit() uses cancel_delayed_work() when it should
use cancel_delayed_work_sync(). cancel_delayed_work() does not wait for the
workqueue handler to exit.
The ondemand governor does not seem to be affected because the
"if (!dbs_info->enable)" check at the beginning of the workqueue handler returns
immediately without rescheduling the work. The conservative governor in
2.6.30-rc has the same check as the ondemand governor, which makes things
usually run smoothly. However, if the governor is quickly stopped and then
started, this could lead to the following race :
dbs_enable could be reenabled and multiple do_dbs_timer handlers would run.
This is why a synchronized teardown is required.
Depends on patch
cpufreq: remove rwsem lock from CPUFREQ_GOV_STOP call
The following patch applies to 2.6.30-rc2. Stable kernels have a similar
issue which should also be fixed, but the code changed between 2.6.29
and 2.6.30, so this patch only applies to 2.6.30-rc.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: gregkh@suse.de
CC: stable@kernel.org
CC: cpufreq@vger.kernel.org
CC: Ingo Molnar <mingo@elte.hu>
CC: rjw@sisk.pl
CC: Ben Slusky <sluskyb@paranoiacs.org>
Signed-off-by: Dave Jones <davej@redhat.com>
* Rafael J. Wysocki (rjw@sisk.pl) wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.28 and 2.6.29.
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.28 and 2.6.29. Please verify if it still should
> be listed and let me know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13186
> Subject : cpufreq timer teardown problem
> Submitter : Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Date : 2009-04-23 14:00 (24 days old)
> References : http://marc.info/?l=linux-kernel&m=124049523515036&w=4
> Handled-By : Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Patch : http://patchwork.kernel.org/patch/19754/
> http://patchwork.kernel.org/patch/19753/
The patches linked above depend on the following patch to remove
circular locking dependency :
cpufreq: remove rwsem lock from CPUFREQ_GOV_STOP call
(the following issue was faced when using cancel_delayed_work_sync() in the
timer teardown (which fixes a race).
* KOSAKI Motohiro (kosaki.motohiro@jp.fujitsu.com) wrote:
> Hi
>
> my box output following warnings.
> it seems regression by commit 7ccc7608b836e58fbacf65ee4f8eefa288e86fac.
>
> A: work -> do_dbs_timer() -> cpu_policy_rwsem
> B: store() -> cpu_policy_rwsem -> cpufreq_governor_dbs() -> work
>
>
Hrm, I think it must be due to my attempt to fix the timer teardown race
in ondemand governor mixed with new locking behavior in 2.6.30-rc.
The rwlock seems to be taken around the whole call to
cpufreq_governor_dbs(), when it should be only taken around accesses to
the locked data, and especially *not* around the call to
dbs_timer_exit().
Reverting my fix attempt would put the teardown race back in place
(replacing the cancel_delayed_work_sync by cancel_delayed_work).
Instead, a proper fix would imply modifying this critical section :
cpufreq.c: __cpufreq_remove_dev()
...
if (cpufreq_driver->target)
__cpufreq_governor(data, CPUFREQ_GOV_STOP);
unlock_policy_rwsem_write(cpu);
To make sure the __cpufreq_governor() callback is not called with rwsem
held. This would allow execution of cancel_delayed_work_sync() without
being nested within the rwsem.
Applies on top of the 2.6.30-rc5 tree.
Required to remove circular dep in teardown of both conservative and
ondemande governors so they can use cancel_delayed_work_sync().
CPUFREQ_GOV_STOP does not modify the policy, therefore this locking seemed
unneeded.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Greg KH <greg@kroah.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: "Rafael J. Wysocki" <rjw@sisk.pl>
CC: Ben Slusky <sluskyb@paranoiacs.org>
CC: Chris Wright <chrisw@sous-sol.org>
CC: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Jones <davej@redhat.com>
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq: (35 commits)
[CPUFREQ] Prevent p4-clockmod from auto-binding to the ondemand governor.
[CPUFREQ] Make cpufreq-nforce2 less obnoxious
[CPUFREQ] p4-clockmod reports wrong frequency.
[CPUFREQ] powernow-k8: Use a common exit path.
[CPUFREQ] Change link order of x86 cpufreq modules
[CPUFREQ] conservative: remove 10x from def_sampling_rate
[CPUFREQ] conservative: fixup governor to function more like ondemand logic
[CPUFREQ] conservative: fix dbs_cpufreq_notifier so freq is not locked
[CPUFREQ] conservative: amend author's email address
[CPUFREQ] Use swap() in longhaul.c
[CPUFREQ] checkpatch cleanups for acpi-cpufreq
[CPUFREQ] powernow-k8: Only print error message once, not per core.
[CPUFREQ] ondemand/conservative: sanitize sampling_rate restrictions
[CPUFREQ] ondemand/conservative: deprecate sampling_rate{min,max}
[CPUFREQ] powernow-k8: Always compile powernow-k8 driver with ACPI support
[CPUFREQ] Introduce /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_transition_latency
[CPUFREQ] checkpatch cleanups for powernow-k8
[CPUFREQ] checkpatch cleanups for ondemand governor.
[CPUFREQ] checkpatch cleanups for powernow-k7
[CPUFREQ] checkpatch cleanups for speedstep related drivers.
...
This reverts commit e088e4c9cd.
Removing the sysfs interface for p4-clockmod was flagged as a
regression in bug 12826.
Course of action:
- Find out the remaining causes of overheating, and fix them
if possible. ACPI should be doing the right thing automatically.
If it isn't, we need to fix that.
- mark p4-clockmod ui as deprecated
- try again with the removal in six months.
It's not really feasible to printk about the deprecation, because
it needs to happen at all the sysfs entry points, which means adding
a lot of strcmp("p4-clockmod".. calls to the core, which.. bleuch.
Signed-off-by: Dave Jones <davej@redhat.com>
AMD users get particular hit by this issue (bug 8081) as it caps at
typically 90 seconds as the minimum period for a frequency change.
Harsh eh? Years ago I borked this buy puting the 10x in the wrong
place...I fix that by removing it altogether.
Signed-off-by: Alexander Clouter <alex@digriz.org.uk>
Signed-off-by: Dave Jones <davej@redhat.com>
As conservative is based off ondemand the codebases occasionally need to be
resync'd. This patch, although ugly, does this.
Signed-off-by: Alexander Clouter <alex@digriz.org.uk>
Signed-off-by: Dave Jones <davej@redhat.com>
When someone added the dbs_cpufreq_notifier section to the governor the
code ended up causing the frequency to only fall. This is because
requested_freq is tinkered with and that should only modified if it has
an invlaid value due to changes in the available frequency ranges
This should fix#10055.
Signed-off-by: Alexander Clouter <alex@digriz.org.uk>
Signed-off-by: Dave Jones <davej@redhat.com>
Limit sampling rate to transition_latency * 100 or kernel limits.
If sampling_rate is tried to be set too low, set the lowest allowed value.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
The same info can be obtained via the transition_latency sysfs file
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
It's not only useful for the ondemand and conservative governors, but
also for userspace daemons to know about the HW transition latency of
the CPU.
It is especially useful for userspace to know about this value when
the ondemand or conservative governors are run. The sampling rate
control value depends on it and for userspace being able to set sane
tuning values there it has to know about the transition latency.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
ondemand micro-accounting of idle time changes broke ignore_nice_load
sysfs setting due to a thinko in the code.
The bug entry:
http://bugzilla.kernel.org/show_bug.cgi?id=12310
Reported-by: Jim Bray <jimsantelmo@gmail.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
* 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
[IA64] fix typo in cpumask_of_pcibus()
x86: fix x86_32 builds for summit and es7000 arch's
cpumask: use work_on_cpu in acpi-cpufreq.c for read_measured_perf_ctrs
cpumask: use work_on_cpu in acpi-cpufreq.c for drv_read and drv_write
cpumask: use cpumask_var_t in acpi-cpufreq.c
cpumask: use work_on_cpu in acpi/cstate.c
cpumask: convert struct cpufreq_policy to cpumask_var_t
cpumask: replace CPUMASK_ALLOC etc with cpumask_var_t
x86: cleanup remaining cpumask_t ops in smpboot code
cpumask: update pci_bus_show_cpuaffinity to use new cpumask API
cpumask: update local_cpus_show to use new cpumask API
ia64: cpumask fix for is_affinity_mask_valid()
It is always "an" if there is a vowel _spoken_ (not written).
So it is:
"an hour" (spoken vowel)
but
"a uniform" (spoken 'j')
Signed-off-by: Frederik Schwarzer <schwarzerf@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Impact: use new cpumask API to reduce memory usage
This is part of an effort to reduce structure sizes for machines
configured with large NR_CPUS. cpumask_t gets replaced by
cpumask_var_t, which is either struct cpumask[1] (small NR_CPUS) or
struct cpumask * (large NR_CPUS).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Previously driver resume would always set the current policy min/max with
the cpuinfo min/max, defined by user_policy.min/max. Resulting in a reset
of policy settings when policy.min/max != cpuinfo.min/max when coming out
of suspend. Now user_policy is saved as the policy instead of cpuinfo to
preserve what the user actually set.
Signed-off-by: Mike Chan <mike@android.com>
Signed-off-by: Dave Jones <davej@redhat.com>
p4-clockmod has a long history of abuse. It pretends to be a CPU
frequency scaling driver, even though it doesn't actually change
the CPU frequency, but instead just modulates the frequency with
wait-states.
The biggest misconception is that when running at the lower 'frequency'
p4-clockmod is saving power. This isn't the case, as workloads running
slower take longer to complete, preventing the CPU from entering deep C states.
However p4-clockmod does have a purpose. It can prevent overheating.
Having it hooked up to the cpufreq interfaces is the wrong way to achieve
cooling however. It should instead be hooked up to ACPI.
This diff introduces a means for a cpufreq driver to register with the
cpufreq core, but not present a sysfs interface.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Dave Jones <davej@redhat.com>
We don't need to export the governors for use as the default governor,
because the default governor will be built-in anyway and we can access
the symbol directly.
This also fixes the following sparse warnings:
drivers/cpufreq/cpufreq_conservative.c:578:25: warning: symbol 'cpufreq_gov_conservative' was not declared. Should it be static?
drivers/cpufreq/cpufreq_ondemand.c:582:25: warning: symbol 'cpufreq_gov_ondemand' was not declared. Should it be static?
drivers/cpufreq/cpufreq_performance.c:39:25: warning: symbol 'cpufreq_gov_performance' was not declared. Should it be static?
drivers/cpufreq/cpufreq_powersave.c:38:25: warning: symbol 'cpufreq_gov_powersave' was not declared. Should it be static?
drivers/cpufreq/cpufreq_userspace.c:190:25: warning: symbol 'cpufreq_gov_userspace' was not declared. Should it be static?
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Signed-off-by: Dave Jones <davej@redhat.com>
Use get_cpu_idle_time_us() to get micro-accounted idle information.
This enables ondemand to get more accurate idle and busy timings
than the jiffy based calculation. As a result, we can decrease
the ondemand safety gaurd band from 80-10 to 95-3.
Results in more aggressive power savings.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Use a parameter for down differential, instead of hardcoded 10%. Follow-on
patch changes the down-differential dynamically, based on whether
we are using idle micro-accounting or not.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Preparatory changes for doing idle micro-accounting in ondemand governor.
get_cpu_idle_time() gets extra parameter and returns idle time and also the
wall time that corresponds to the idle time measurement.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Change the load calculation algorithm in ondemand to work well with software
coordination of frequency across the dependent cpus.
Multiply individual CPU utilization with the average freq of that logical CPU
during the measurement interval (using getavg call). And find the max CPU
utilization number in terms of CPU freq. That number is then used to
get to the target freq for next sampling interval.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Add a cpu parameter to __cpufreq_driver_getavg(). This is needed for software
cpufreq coordination where policy->cpu may not be same as the CPU on which we
want to getavg frequency.
A follow-on patch will use this parameter to getavg freq from all cpus
in policy->cpus.
Change since last patch. Fix the offline/online and suspend/resume
oops reported by Youquan Song <youquan.song@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Venki Pallipadi made a similar change to the ondemand governor a while
back (in commit 28287033e1). It seems to
work just as well in the conservative governor, leading to fewer wakeups
as reported by powertop.
Signed-off-by: Ben Slusky <sluskyb@paranoiacs.org>
Signed-off-by: Dave Jones <davej@redhat.com>
After calling cpufreq_cpu_get, error handling code should call
cpufreq_cpu_put.
The semantic match that finds this problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@r@
expression x,E;
statement S;
position p1,p2,p3;
@@
(
if ((x = cpufreq_cpu_get@p1(...)) == NULL || ...) S
|
x = cpufreq_cpu_get@p1(...)
... when != x
if (x == NULL || ...) S
)
<...
if@p3 (...) { ... when != cpufreq_cpu_put(x)
when != if (x) { ... cpufreq_cpu_put(x); ...}
return@p2 ...;
}
...>
(
return x;
|
return 0;
|
x = E
|
E = x
|
cpufreq_cpu_put(x)
)
@exists@
position r.p1,r.p2,r.p3;
expression x;
int ret != 0;
statement S;
@@
* x = cpufreq_cpu_get@p1(...)
<...
* if@p3 (...)
S
...>
* return@p2 \(NULL\|ret\);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Dave Jones <davej@redhat.com>
Add error handling for cpufreq_register_governor() error
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: cpufreq@lists.linux.org.uk
Signed-off-by: Dave Jones <davej@redhat.com>
drivers/cpufreq/cpufreq_conservative.c:336:15: warning: symbol 'freq_step' shadows an earlier one
Just rename the local variable.
Signed-off-by: Dave Jones <davej@redhat.com>
Ingo Molnar provided a fix to not call _PPC at processor driver
initialization time in "[PATCH] ACPI: fix cpufreq regression" (git
commit e4233dec74)
But it can still happen that _PPC is called at processor driver
initialization time.
This patch should make sure that this is not possible anymore.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
NR_CPUS: Replace NR_CPUS in speedstep-centrino.c
cpumask: Provide a generic set of CPUMASK_ALLOC macros, FIXUP
NR_CPUS: Replace NR_CPUS in cpufreq userspace routines
NR_CPUS: Replace per_cpu(..., smp_processor_id()) with __get_cpu_var
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genapic_flat_64.c
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genx2apic_uv_x.c
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/proc.c
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/mcheck/mce_64.c
cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c, fix
cpumask: Use optimized CPUMASK_ALLOC macros in the centrino_target
cpumask: Provide a generic set of CPUMASK_ALLOC macros
cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c
cpumask: Optimize cpumask_of_cpu in kernel/time/tick-common.c
cpumask: Optimize cpumask_of_cpu in drivers/misc/sgi-xp/xpc_main.c
cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/ldt.c
cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/io_apic_64.c
cpumask: Replace cpumask_of_cpu with cpumask_of_cpu_ptr
Revert "cpumask: introduce new APIs"
cpumask: make for_each_cpu_mask a bit smaller
net: Pass reference to cpumask variable in net/sunrpc/svc.c
...
Fix up trivial conflicts in drivers/cpufreq/cpufreq.c manually
Format string bug. Not exploitable, as this is only writable by root,
but worth fixing all the same.
Spotted-by: Ilja van Sprundel <ilja@netric.org>
Signed-off-by: Dave Jones <davej@redhat.com>
If cpu specific cpufreq driver(i.e. longrun) has "setpolicy" function,
governor object isn't set into cpufreq_policy object at "__cpufreq_set_policy"
function in driver/cpufreq/cpufreq.c .
This causes a null object access at "store_scaling_setspeed" and
"show_scaling_setspeed" function in driver/cpufreq/cpufreq.c when reading or
writing through /sys interface (ex. cat
/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed)
Addresses:
http://bugzilla.kernel.org/show_bug.cgi?id=10654https://bugzilla.redhat.com/show_bug.cgi?id=443354
Signed-off-by: CHIKAMA Masaki <masaki.chikama@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Chuck Ebbert <cebbert@redhat.com>
Acked-by: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In drivers/cpufreq/cpufreq.c the function cpufreq_add_dev() takes the
error exit 'err_out_unregister' from different places once with the
'cpu_policy_rwsem' lock held, once with the lock released:
| if (ret)
| goto err_out_unregister;
| }
|
| policy->governor = NULL; /* to assure that the starting sequence is
| * run in cpufreq_set_policy */
|
| /* set default policy */
| ret = __cpufreq_set_policy(policy, &new_policy);
| policy->user_policy.policy = policy->policy;
| policy->user_policy.governor = policy->governor;
|
| unlock_policy_rwsem_write(cpu);
|
| if (ret) {
| dprintk("setting policy failed\n");
| goto err_out_unregister;
| }
This leads to the following error message in case of a failing
__cpufreq_set_policy() call:
=====================================
[ BUG: bad unlock balance detected! ]
-------------------------------------
swapper/1 is trying to release lock (&per_cpu(cpu_policy_rwsem, cpu)) at:
[<c01b4564>] unlock_policy_rwsem_write+0x30/0x40
but there are no more locks to release!
other info that might help us debug this:
1 lock held by swapper/1:
#0: (sysdev_drivers_lock){--..}, at: [<c018fd18>] sysdev_driver_register+0x74/0x130
stack backtrace:
[<c002f588>] (dump_stack+0x0/0x14) from [<c00692fc>] (print_unlock_inbalance_bug+0xc8/0x104)
[<c0069234>] (print_unlock_inbalance_bug+0x0/0x104) from [<c006b7ac>] (lock_release_non_nested+0xc4/0x19c)
r6:00000028 r5:c3c1ab80 r4:c01b4564
[<c006b6e8>] (lock_release_non_nested+0x0/0x19c) from [<c006b9e0>] (lock_release+0x15c/0x18c)
r8:60000013 r7:00000001 r6:c01b4564 r5:c0541bb4 r4:c3c1ab80
[<c006b884>] (lock_release+0x0/0x18c) from [<c0061ba0>] (up_write+0x24/0x30)
r8:c0541b80 r7:00000000 r6:ffffffea r5:c3c34828 r4:c0541b8c
[<c0061b7c>] (up_write+0x0/0x30) from [<c01b4564>] (unlock_policy_rwsem_write+0x30/0x40)
r4:c3c34884
[<c01b4534>] (unlock_policy_rwsem_write+0x0/0x40) from [<c01b4c40>] (cpufreq_add_dev+0x324/0x398)
[<c01b491c>] (cpufreq_add_dev+0x0/0x398) from [<c018fd64>] (sysdev_driver_register+0xc0/0x130)
[<c018fca4>] (sysdev_driver_register+0x0/0x130) from [<c01b3574>] (cpufreq_register_driver+0xbc/0x174)
Signed-off-by: Lothar WaĆmann <LW@KARO-electronics.de>
Signed-off-by: Dave Jones <davej@redhat.com>
Change references from for_each_cpu_mask to for_each_cpu_mask_nr
where appropriate
Reviewed-by: Paul Jackson <pj@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Change cpufreq_policy and cpufreq_governor pointer tables
from arrays to per_cpu variables in the cpufreq subsystem.
Also some minor complaints from checkpatch.pl fixed.
Based on:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Sometimes old_index != stat->last_index, see cpufreq_update_policy, bios can
change cpu setting in resume. In my test, after resume cpu is in lowest
speed, but the stat info shows cpu is in full speed. This patch makes the
stat info correct after a resume.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Jones <davej@redhat.com>
Allow use of the powersave cpufreq governor as the default one for EMBEDDED
configs.
Signed-off-by: Alessandro Guido <alessandro.guido@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Jones <davej@redhat.com>
Currently, affected_cpus shows which CPUs need to have their frequency
coordinated in software. When hardware coordination is in use, the contents
of this file appear the same as when no coordination is required. This can
lead to some confusion among user-space programs, for example, that do not
know that extra coordination is required to force a CPU core to a particular
speed to control power consumption.
To fix this, create a "related_cpus" attribute that always displays the
coordination map regardless of whatever coordination strategy the cpufreq
driver uses (sw or hw). If the cpufreq driver does not provide a value, fall
back to policy->cpus.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Jones <davej@redhat.com>
Fix show_trans_table when it overflows PAGE_SIZE.
* Not all snprintf calls were protected against being passed a negative
length.
* When show_trans_table overflows, len might be > PAGE_SIZE. In that case,
returns PAGE_SIZE.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
If cpufreq_register_notifier is called before pure initcalls,
init_cpufreq_transition_notifier_list will overwrite whatever it did,
causing notifiers to be ignored.
Print some noise to the kernel log if that happens.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Fix the following warnings:
WARNING: vmlinux.o(.text+0xfe6711): Section mismatch in reference from the function cpufreq_unregister_driver() to the variable .cpuinit.data:cpufreq_cpu_notifier
WARNING: vmlinux.o(.text+0xfe68af): Section mismatch in reference from the function cpufreq_register_driver() to the variable .cpuinit.data:cpufreq_cpu_notifier
WARNING: vmlinux.o(.exit.text+0xc4fa): Section mismatch in reference from the function cpufreq_stats_exit() to the variable .cpuinit.data:cpufreq_stat_cpu_notifier
The warnings were casued by references to unregister_hotcpu_notifier()
from normal functions or exit functions.
This is flagged by modpost as a potential error because
it does not know that for the non HOTPLUG_CPU
scenario the unregister_hotcpu_notifier() is a nop.
Silence the warning by replacing the __initdata
annotation with a __refdata annotation.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
The cpufreq core should not take an extra kobject reference count for no
reason, and then refuse to release it. This has been reported as
keeping machines from properly powering down all the way.
Signed-off-by: Balaji Rao <balajirrao@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Yi Yang <yi.y.yang@intel.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Frans Pop <elendil@planet.nl>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
cpufreq support can't be built as a module. Fix the related configuration
help message.
Signed-off-by: Stefano Brivio <stefano.brivio@polimi.it>
Signed-off-by: Dave Jones <davej@redhat.com>
Eliminate cpufreq_userspace scaling_setspeed deadlock.
Luming Yu recently uncovered yet another cpufreq related deadlock.
One thread that continuously switches the governors and the other thread that
repeatedly cats the contents of cpufreq directory causes both these threads to
go into a deadlock.
Detailed examination of the deadlock showed the exact flow before the deadlock
as:
Thread 1 Thread 2
________ ________
cats files under /sys/devices/.../cpufreq/
Set governor to userspace
Adds a new sysfs entry for
scaling_setspeed
cats files under /sys/devices/.../cpufreq/
Set governor to performance
Holds cpufreq_rw_sem in write
mode
Sends a STOP notify to
userspace governor
cat /sys/devices/.../cpufreq/scaling_setspeed
Gets a handle on the above sysfs entry with
sysfs_get_active
Blocks while trying to get cpufreq_rw_sem
in read mode
Remove a sysfs entry for
scaling_setspeed
Blocks on sysfs_deactivate
while waiting for earlier
get_active (on other thread)
to drain
At this point both threads go into deadlock and any other thread that tries to
do anything with sysfs cpufreq will also block.
There seems to be no easy way to avoid this deadlock as long as
cpufreq_userspace adds/removes the sysfs entry under same kobject as cpufreq.
Below patch moves scaling_setspeed to cpufreq.c, keeping it always and calling
back the governor on read/write. This is the cleanest fix I could think of,
even though adding two callbacks in governor structure just for this seems
unnecessary.
Note that the change makes scaling_setspeed under /sys/.../cpufreq permanent
and returns <unsupported> when governor is not userspace.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
In freq_table.c, show_available_freqs()'s comment is oberviously wrong.
Change the comment to a new one to avoid confusion.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Jones <davej@redhat.com>
There is no need for kobject_unregister() anymore, thanks to Kay's
kobject cleanup changes, so replace all instances of it with
kobject_put().
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Stop using kobject_register, as this way we can control the sending of
the uevent properly, after everything is properly initialized.
Cc: Dominik Brodowski <linux@brodo.de>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Shin <jacob.shin@amd.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When the cpufreq driver starts up at boot time, it calls into the default
governor which might not be initialised yet. This hurts when the
governor's worker function relies on memory that is not yet set up by its
init function.
This migrates all governors from module_init() to fs_initcall() when being
the default, as was already done in cpufreq_performance when it was the
only possible choice. The performance governor is always initialized early
because it might be used as fallback even when not being the default.
Fixes at least one actual oops where ondemand is the default governor and
cpufreq_governor_dbs() uses the uninitialised kondemand_wq work-queue
during boot-time.
Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cpufreq_stats_free_table() mustn't be __cpuexit since it's called by the
__cpuinit cpufreq_stat_cpu_callback().
This patch fixes the following section mismatch reported by
Chris Clayton:
WARNING: vmlinux.o(.init.text+0x143dd): Section mismatch: reference to .exit.text:cpufreq_stats_free_table (between 'cpufreq_stat_cpu_callback' and 'cpufreq_stats_init')
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Chris Clayton <chris2553@googlemail.com>
Acked-by: Dave Jones <davej@codemonkey.org.uk>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ingo hit some BUG_ONs that were probably caused by these missing unlocks
causing an unbalance. He couldn't reproduce the bug reliably, so it's
unknown that it's definitly fixing the problem he hit, but it's a fairly
good chance, and this fixes an obvious bug.
[ Dave: "Ingo followed up that he hit some lockdep related output with
this applied, so it may not be right. I'll look at it after
xmas if no-one has it figured out before then."
Akpm: "It looks pretty correct to me though." ]
Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make cpufreq_conservative handle out-of-sync events properly
Currently, the cpufreq_conservative governor doesn't get notified when the
actual frequency the cpu is running at differs from what cpufreq thought it
was. As a result the cpu may stay at the maximum frequency after a s2ram /
resume cycle even though the system is idle.
Signed-off-by: Elias Oltmanns <eo@nebensachen.de>
Signed-off-by: Dave Jones <davej@redhat.com>
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6: (75 commits)
PM: merge device power-management source files
sysfs: add copyrights
kobject: update the copyrights
kset: add some kerneldoc to help describe what these strange things are
Driver core: rename ktype_edd and ktype_efivar
Driver core: rename ktype_driver
Driver core: rename ktype_device
Driver core: rename ktype_class
driver core: remove subsystem_init()
sysfs: move sysfs file poll implementation to sysfs_open_dirent
sysfs: implement sysfs_open_dirent
sysfs: move sysfs_dirent->s_children into sysfs_dirent->s_dir
sysfs: make sysfs_root a regular directory dirent
sysfs: open code sysfs_attach_dentry()
sysfs: make s_elem an anonymous union
sysfs: make bin attr open get active reference of parent too
sysfs: kill unnecessary NULL pointer check in sysfs_release()
sysfs: kill unnecessary sysfs_get() in open paths
sysfs: reposition sysfs_dirent->s_mode.
sysfs: kill sysfs_update_file()
...
A number of different drivers incorrect access the kobject name field
directly. This is not correct as the name might not be in the array.
Use the proper accessor function instead.
I don't see any reason to take an expensive lock in cpufreq_quick_get()
Reading policy->cur is a single atomic operation and after
the lock is dropped again the state could change any time anyways.
So don't take the lock in the first place.
This also makes this function interrupt safe which is useful
for some code of mine.
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: "Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
* Stop referencing the callback directly from the __init and __exit
functions of this driver, and instead explicitly call
cpufreq_update_policy() et al. This enables the callback function
to be marked as __cpuinit (and the notifier_block __cpuinitdata),
thereby saving space when HOTPLUG_CPU=n. This also enables us to
use other tricks to replace __cpuinit{data} in future.
* cpufreq_stats_free_table() is only called from __cpuinit or __exit
marked functions, making it an ideal candidate for __cpuexit.
* Fix missing space in the module description
Signed-off-by: Satyam Sharma <satyam@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Jones <davej@redhat.com>
The notifier_block is already __cpuinitdata, thereby allowing us to safely
mark the callback function as __cpuinit also, thereby saving space when
HOTPLUG_CPU=n.
Signed-off-by: Satyam Sharma <satyam@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Jones <davej@redhat.com>
Depending on the transition latency of the HW for cpufreq switches, the
ondemand or conservative governor cannot be used with certain cpufreq
drivers. Still the ondemand should be the default governor on a wide range
of systems. This patch allows this and lets the governor fallback to the
performance governor at cpufreq driver load time, if the driver does not
support fast enough frequency switching.
Main benefit is that on e.g. installation or other systems without
userspace support a working dynamic cpufreq support can be achieved on most
systems by simply loading the cpufreq driver. This is especially essential
for recent x86(_64) laptop hardware which may rely on working dynamic
cpufreq OS support.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Bryan Wu <bryan.wu@analog.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Bryan Wu <bryan.wu@analog.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Jones <davej@redhat.com>
Negative side effect: needs NR_CPUs pointer array of memory in
CONFIG_HOTPLUG_CPU case.
Still needs userspace track keeping and rewriting of governors if governors
change while a CPU is not active (always the governor at CPU remove time is
restored).
Move of policy->user_policy.governor assignment is just a minor cleanup.
http://bugzilla.kernel.org/show_bug.cgi?id=8671
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Mattia Dongili <malattia@linux.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Jones <davej@redhat.com>
There is a frequency scaling issue that I encountered with the performance
governor in combination with CPU hotplug.
In cpufreq.c CPU frequency is reduced to its minimum before the CPU gets
unregistered and set offline. Does that have a particular reason?
Since the (k8-)governor does not monitor CPU frequency that setting also
applies then to the remaining CPU as well and lets the system run on the
lowest frequency although performance is chose as the policy.
Signed-off-by: Peter Oruba <peter.oruba@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Jones <davej@redhat.com>
* master.kernel.org:/pub/scm/linux/kernel/git/davej/cpufreq:
[CPUFREQ] Fix sysfs_create_file return value handling
[CPUFREQ] ondemand: fix tickless accounting and software coordination bug
[CPUFREQ] ondemand: add a check to avoid negative load calculation
[CPUFREQ] Keep userspace governor quiet when it is not being used
[CPUFREQ] Longhaul - Proper register access
[CPUFREQ] Kconfig powernow-k8 driver should depend on ACPI P-States driver
[CPUFREQ] Longhaul - Replace ACPI functions with direct I/O
[CPUFREQ] Longhaul - Remove duplicate multipliers
[CPUFREQ] Longhaul - Embedded "conservative"
[CPUFREQ] acpi-cpufreq: Proper ReadModifyWrite of PERF_CTL MSR
[CPUFREQ] check return value of sysfs_create_file
[CPUFREQ] Longhaul - Check ACPI "BM DMA in progress" bit
[CPUFREQ] Longhaul - Move old_ratio to correct place
[CPUFREQ] Longhaul - VT8237 support
[CPUFREQ] Longhaul - Use all kinds of support
[CPUFREQ] powernow-k8: clarify number of cores.
sysfs is now completely out of driver/module lifetime game. After
deletion, a sysfs node doesn't access anything outside sysfs proper,
so there's no reason to hold onto the attribute owners. Note that
often the wrong modules were accounted for as owners leading to
accessing removed modules.
This patch kills now unnecessary attribute->owner. Note that with
this change, userland holding a sysfs node does not prevent the
backing module from being unloaded.
For more info regarding lifetime rule cleanup, please read the
following message.
http://article.gmane.org/gmane.linux.kernel/510293
(tweaked by Greg to not delete the field just yet, to make it easier to
merge things properly.)
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Commit 0a4b2ccc55 in cpufreq.git
eliminates the build warnings but does not pass on the error code of
sysfs_create_file to the function calling cpufreq_add_dev. Instead some
previous value of ret would be returned.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Dave Jones <davej@redhat.com>
With tickless kernel and software coordination os P-states, ondemand
can look at wrong idle statistics. This can happen when ondemand sampling
is happening on CPU 0 and due to software coordination sampling also looks at
utilization of CPU 1. If CPU 1 is in tickless state at that moment, its idle
statistics will not be uptodate and CPU 0 thinks CPU 1 is idle for less
amount of time than it actually is.
This can be resolved by looking at all the busy times of CPUs, which is
accurate, even with tickless, and use that to determine idle time in a
round about way (total time - busy time).
Thanks to Arjan for originally reporting the ondemand bug on
Lenovo T61.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Due to rounding and inexact jiffy accounting, idle_ticks can sometimes
be higher than total_ticks. Make sure those cases are handled as
zero load case.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Userspace governor registers a frequency change notifier at init time, even
when no CPU is set to userspace governor. Make it register only when
atleast one CPU is using userspace.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Eliminate build warning (sysfs_create_file return value must be checked)
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress. This
patch introduces such notifications and causes them to be used during
suspend and resume transitions. It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).
[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new deferrable delayed work init. This can be used to schedule work
that are 'unimportant' when CPU is idle and can be called later, when CPU
eventually comes out of idle.
Use this init in cpufreq ondemand governor.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adds proper lines to help output of kconfig so people can find the module names.
Also fixed some broken leading spaces versus tabs.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Dave Jones <davej@redhat.com>
Remove deprecated /proc/acpi/processor/performance write support
Writing to /proc/acpi/processor/xy/performance interferes with sysfs
cpufreq interface. Also removes buggy cpufreq_set_policy exported symbol.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
References:
https://bugzilla.novell.com/show_bug.cgi?id=231107https://bugzilla.novell.com/show_bug.cgi?id=264077
Fix limited cpufreq when booted on battery
If booted on battery:
cpufreq_set_policy (evil) is invoked which calls verify_within_limits.
max_freq gets lowered and therefore users_policy.max, which
is used to restore higher freqs via update_policy later is set to the
already limited frequency -> you can never go up again, even BIOS
allows higher freqs later.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
Ingo reported it on lkml in the thread
"2.6.21-rc5: maxcpus=1 crash in cpufreq: kernel BUG at drivers/cpufreq/cpufreq.c:82!"
This check added to remove_dev is symmetric to one in add_dev and handles
callbacks for offline cpus cleanly.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit aeeddc1435, which was
half-baked and broken. It just resulted in compile errors, since
cpufreq_register_driver() still changes the 'driver_data' by setting
bits in the flags field. So claiming it is 'const' _really_ doesn't
work.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Looks like dbs_timer() is very careful wrt per_cpu(cpu_dbs_info),
and it doesn't need the help of WORK_STRUCT_NOAUTOREL.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-By: David Howells <dhowells@redhat.com>
Signed-off-by: Dave Jones <davej@redhat.com>
After Al Viro (finally) succeeded in removing the sched.h #include in module.h
recently, it makes sense again to remove other superfluous sched.h includes.
There are quite a lot of files which include it but don't actually need
anything defined in there. Presumably these includes were once needed for
macros that used to live in sched.h, but moved to other header files in the
course of cleaning it up.
To ease the pain, this time I did not fiddle with any header files and only
removed #includes from .c-files, which tend to cause less trouble.
Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha,
arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig,
allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all
configs in arch/arm/configs on arm. I also checked that no new warnings were
introduced by the patch (actually, some warnings are removed that were emitted
by unnecessarily included header files).
Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CPU_FREQ_TABLE enables helper code and gets select'ed when it's required.
Building it as a module when it's not required doesn't seem to make much sense.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Jones <davej@redhat.com>
Eliminate flush_workqueue in cpufreq_governor(STOP) callpath. Using flush
there has a deadlock potential as in
http://uwsg.iu.edu/hypermail/linux/kernel/0611.3/1223.html
Also, cleanup the locking issues with do_dbs_timer delayed_work callback. As
it changes the CPU frequency using __cpufreq_target, it needs to have
policy_rwsem in write mode, which also protects it from hot plug.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Jones <davej@redhat.com>
Restructure the delayed_work callback in ondemand.
This eliminates the need for smp_processor_id in the callback function and
also helps in proper locking and avoiding flush_workqueue when stopping the
governor (done in subsequent patch).
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Jones <davej@redhat.com>
Yet another attempt to resolve cpufreq and hotplug locking issues.
Patchset has 3 patches:
* Rewrite the lock infrastructure of cpufreq using a per cpu rwsem.
* Minor restructuring of work callback in ondemand driver.
* Use the new cpufreq rwsem infrastructure in ondemand work.
This patch:
Convert policy->lock to rwsem and move it to per_cpu area.
This rwsem will protect against both changing/accessing policy
related parameters and CPU hot plug/unplug.
[malattia@linux.it: fix oops in kref_put()]
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Mattia Dongili <malattia@linux.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Jones <davej@redhat.com>
The hotplug CPU locking in cpufreq is horrendous. No-one seems to care
enough to fix it, so just remove it so that the 99.9% of the real world
users of this code can use cpufreq without being bothered by warnings.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Jones <davej@redhat.com>
Trivial patch to check sysfs_create_link return values.
Fail gracefully if needed.
Signed-off-by: Ahmed Darwish <darwish.07@gmail.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Fixes the oops in cpufreq_stats with acpi_cpufreq driver. The issue was
that the frequency was reported as 0 in acpi-cpufreq.c. The bug is due to
different indicies for freq_table and ACPI perf table.
Also adds a check in cpufreq_stats to check for error return from
freq_table_get_index() and avoid using the error return value.
Patch fixes the issue reported at
http://www.ussg.iu.edu/hypermail/linux/kernel/0611.2/0629.html
and also other similar issue here
http://bugme.osdl.org/show_bug.cgi?id=7383 comment 53
Signed-off-by: Dhaval Giani <dhaval.giani@gmail.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Dave Jones <davej@redhat.com>
There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
generating compiler warnings of unused symbols, hence forcing people to add
#ifdefs.
the compiler can skip truly unused functions just fine:
text data bss dec hex filename
1624412 728710 3674856 6027978 5bfaca vmlinux.before
1624412 728710 3674856 6027978 5bfaca vmlinux.after
[akpm@osdl.org: topology.c fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pass the work_struct pointer to the work function rather than context data.
The work function can use container_of() to work out the data.
For the cases where the container of the work_struct may go away the moment the
pending bit is cleared, it is made possible to defer the release of the
structure by deferring the clearing of the pending bit.
To make this work, an extra flag is introduced into the management side of the
work_struct. This governs auto-release of the structure upon execution.
Ordinarily, the work queue executor would release the work_struct for further
scheduling or deallocation by clearing the pending bit prior to jumping to the
work function. This means that, unless the driver makes some guarantee itself
that the work_struct won't go away, the work function may not access anything
else in the work_struct or its container lest they be deallocated.. This is a
problem if the auxiliary data is taken away (as done by the last patch).
However, if the pending bit is *not* cleared before jumping to the work
function, then the work function *may* access the work_struct and its container
with no problems. But then the work function must itself release the
work_struct by calling work_release().
In most cases, automatic release is fine, so this is the default. Special
initiators exist for the non-auto-release case (ending in _NAR).
Signed-Off-By: David Howells <dhowells@redhat.com>
This is a quick hack to overcome the fact that SRCU currently does not
allow static initializers, and we need to sometimes initialize those
things before any other initializers (even "core" ones) can do so.
Currently we don't allow this at all for modules, and the only user that
needs is right now is cpufreq. As reported by Thomas Gleixner:
"Commit b4dfdbb3c7 ("[PATCH] cpufreq:
make the transition_notifier chain use SRCU breaks cpu frequency
notification users, which register the callback > on core_init
level."
Cc: Thomas Gleixner <tglx@timesys.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Andrew Morton <akpm@osdl.org>,
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Clean up cpufreq subsystem to fix coding style issues and to improve
the readability.
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Enable ondemand governor and acpi-cpufreq to use IA32_APERF and IA32_MPERF MSR
to get active frequency feedback for the last sampling interval. This will
make ondemand take right frequency decisions when hardware coordination of
frequency is going on.
Without APERF/MPERF, ondemand can take wrong decision at times due
to underlying hardware coordination or TM2.
Example:
* CPU 0 and CPU 1 are hardware cooridnated.
* CPU 1 running at highest frequency.
* CPU 0 was running at highest freq. Now ondemand reduces it to
some intermediate frequency based on utilization.
* Due to underlying hardware coordination with other CPU 1, CPU 0 continues to
run at highest frequency (as long as other CPU is at highest).
* When ondemand samples CPU 0 again next time, without actual frequency
feedback from APERF/MPERF, it will think that previous frequency change
was successful and can go to wrong target frequency. This is because it
thinks that utilization it has got this sampling interval is when running at
intermediate frequency, rather than actual highest frequency.
More information about IA32_APERF IA32_MPERF MSR:
Refer to IA-32 IntelĀ® Architecture Software Developer's Manual at
http://developer.intel.com
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>