Commit graph

14779 commits

Author SHA1 Message Date
Stephane Eranian
2481c5fa6d perf: Disable PERF_SAMPLE_BRANCH_* when not supported
PERF_SAMPLE_BRANCH_* is disabled for:

 - SW events (sw counters, tracepoints)
 - HW breakpoints
 - ALL but Intel x86 architecture
 - AMD64 processors

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-10-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 14:55:42 +01:00
Stephane Eranian
3e702ff6d1 perf/x86: Add LBR software filter support for Intel CPUs
This patch adds an internal sofware filter to complement
the (optional) LBR hardware filter.

The software filter is necessary:

 - as a substitute when there is no HW LBR filter (e.g., Atom, Core)
 - to complement HW LBR filter in case of errata (e.g., Nehalem/Westmere)
 - to provide finer grain filtering (e.g., all processors)

Sometimes the LBR HW filter cannot distinguish between two types
of branches. For instance, to capture syscall as CALLS, it is necessary
to enable the LBR_FAR filter which will also capture JMP instructions.
Thus, a second pass is necessary to filter those out, this is what the
SW filter can do.

The SW filter is built on top of the internal x86 disassembler. It
is a best effort filter especially for user level code. It is subject
to the availability of the text page of the program.

The SW filter is enabled on all Intel processors. It is bypassed
when the user is capturing all branches at all priv levels.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-9-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 14:55:42 +01:00
Stephane Eranian
60ce0fbd07 perf/x86: Implement PERF_SAMPLE_BRANCH for Intel CPUs
This patch implements PERF_SAMPLE_BRANCH support for Intel
x86processors. It connects PERF_SAMPLE_BRANCH to the actual LBR.

The patch adds the hooks in the PMU irq handler to save the LBR
on counter overflow for both regular and PEBS modes.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-8-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 14:55:41 +01:00
Stephane Eranian
88c9a65e13 perf/x86: Disable LBR support for older Intel Atom processors
The patch adds a restriction for Intel Atom LBR support. Only
steppings 10 (PineView) and more recent are supported. Older models
do not have a functional LBR. Their LBR does not freeze on PMU
interrupt which makes LBR unusable in the context of perf_events.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-7-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 14:55:41 +01:00
Stephane Eranian
c5cc2cd906 perf/x86: Add Intel LBR mappings for PERF_SAMPLE_BRANCH filters
This patch adds the mappings from the generic PERF_SAMPLE_BRANCH_*
filters to the actual Intel x86LBR filters, whenever they exist.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-6-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 14:55:41 +01:00
Stephane Eranian
ff3fb511ba perf/x86: Sync branch stack sampling with precise_sampling
If precise sampling is enabled on Intel x86 then perf_event uses PEBS.
To correct for the off-by-one error of PEBS, perf_event uses LBR when
precise_sample > 1.

On Intel x86 PERF_SAMPLE_BRANCH_STACK is implemented using LBR,
therefore both features must be coordinated as they may not
configure LBR the same way.

For PEBS, LBR needs to capture all branches at the priv level of
the associated event.

This patch checks that the branch type and priv level of BRANCH_STACK
is compatible with that of the PEBS LBR requirement, thereby allowing:

   $ perf record -b any,u -e instructions:upp ....

But:

   $ perf record -b any_call,u -e instructions:upp

Is not possible.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-5-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 14:55:40 +01:00
Stephane Eranian
b36817e886 perf/x86: Add Intel LBR sharing logic
The Intel LBR on some recent processor is capable
of filtering branches by type. The filter is configurable
via the LBR_SELECT MSR register.

There are limitation on how this register can be used.

On Nehalem/Westmere, the LBR_SELECT is shared by the two HT threads
when HT is on. It is private to each core when HT is off.

On SandyBridge, the LBR_SELECT register is private to each thread
when HT is on. It is private to each core when HT is off.

The kernel must manage the sharing of LBR_SELECT. It allows
multiple users on the same logical CPU to use LBR_SELECT as
long as they program it with the same value. Across sibling
CPUs (HT threads), the same restriction applies on NHM/WSM.

This patch implements this sharing logic by leveraging the
mechanism put in place for managing the offcore_response
shared MSR.

We modify __intel_shared_reg_get_constraints() to cause
x86_get_event_constraint() to be called because LBR may
be associated with events that may be counter constrained.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-4-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 14:55:40 +01:00
Stephane Eranian
225ce53910 perf/x86: Add Intel LBR MSR definitions
This patch adds the LBR definitions for NHM/WSM/SNB and Core.
It also adds the definitions for the architected LBR MSR:
LBR_SELECT, LBRT_TOS.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 14:55:39 +01:00
Stephane Eranian
bce38cd53e perf: Add generic taken branch sampling support
This patch adds the ability to sample taken branches to the
perf_event interface.

The ability to capture taken branches is very useful for all
sorts of analysis. For instance, basic block profiling, call
counts, statistical call graph.

This new capability requires hardware assist and as such may
not be available on all HW platforms. On Intel x86 it is
implemented on top of the Last Branch Record (LBR) facility.

To enable taken branches sampling, the PERF_SAMPLE_BRANCH_STACK
bit must be set in attr->sample_type.

Sampled taken branches may be filtered by type and/or priv
levels.

The patch adds a new field, called branch_sample_type, to the
perf_event_attr structure. It contains a bitmask of filters
to apply to the sampled taken branches.

Filters may be implemented in HW. If the HW filter does not exist
or is not good enough, some arch may also implement a SW filter.

The following generic filters are currently defined:
- PERF_SAMPLE_USER
  only branches whose targets are at the user level

- PERF_SAMPLE_KERNEL
  only branches whose targets are at the kernel level

- PERF_SAMPLE_HV
  only branches whose targets are at the hypervisor level

- PERF_SAMPLE_ANY
  any type of branches (subject to priv levels filters)

- PERF_SAMPLE_ANY_CALL
  any call branches (may incl. syscall on some arch)

- PERF_SAMPLE_ANY_RET
  any return branches (may incl. syscall returns on some arch)

- PERF_SAMPLE_IND_CALL
  indirect call branches

Obviously filter may be combined. The priv level bits are optional.
If not provided, the priv level of the associated event are used. It
is possible to collect branches at a priv level different from the
associated event. Use of kernel, hv priv levels is subject to permissions
and availability (hv).

The number of taken branch records present in each sample may vary based
on HW, the type of sampled branches, the executed code. Therefore
each sample contains the number of taken branches it contains.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 14:55:39 +01:00
Ingo Molnar
737f24bda7 Merge branch 'perf/urgent' into perf/core
Conflicts:
	tools/perf/builtin-record.c
	tools/perf/builtin-top.c
	tools/perf/perf.h
	tools/perf/util/top.h

Merge reason: resolve these cherry-picking conflicts.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 09:20:08 +01:00
Alex Shi
901b04450a x86/numa: Improve internode cache alignment
Currently cache alignment among nodes in the kernel is still 128
bytes on x86 NUMA machines - we got that X86_INTERNODE_CACHE_SHIFT
default from old P4 processors.

But now most modern x86 CPUs use the same size: 64 bytes from L1 to
last level L3. so let's remove the incorrect setting, and directly
use the L1 cache size to do SMP cache line alignment.

This patch saves some memory space on kernel data, and it also
improves the cache locality of kernel data.

The System.map is quite different with/without this change:

	before patch			after patch
  ...
  000000000000b000 d tlb_vector_|  000000000000b000 d tlb_vector
  000000000000b080 d cpu_loops_p|  000000000000b040 d cpu_loops_
  ...

Signed-off-by: Alex Shi <alex.shi@intel.com>
Cc: asit.k.mallick@intel.com
Link: http://lkml.kernel.org/r/1330774047-18597-1-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 09:19:20 +01:00
Paul Gortmaker
187f1882b5 BUG: headers with BUG/BUG_ON etc. need linux/bug.h
If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
other BUG variant in a static inline (i.e. not in a #define) then
that header really should be including <linux/bug.h> and not just
expecting it to be implicitly present.

We can make this change risk-free, since if the files using these
headers didn't have exposure to linux/bug.h already, they would have
been causing compile failures/warnings.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-03-04 17:54:34 -05:00
Rafael J. Wysocki
643161ace2 Merge branch 'pm-sleep'
* pm-sleep:
  PM / Freezer: Remove references to TIF_FREEZE in comments
  PM / Sleep: Add more wakeup source initialization routines
  PM / Hibernate: Enable usermodehelpers in hibernate() error path
  PM / Sleep: Make __pm_stay_awake() delete wakeup source timers
  PM / Sleep: Fix race conditions related to wakeup source timer function
  PM / Sleep: Fix possible infinite loop during wakeup source destruction
  PM / Hibernate: print physical addresses consistently with other parts of kernel
  PM: Add comment describing relationships between PM callbacks to pm.h
  PM / Sleep: Drop suspend_stats_update()
  PM / Sleep: Make enter_state() in kernel/power/suspend.c static
  PM / Sleep: Unify kerneldoc comments in kernel/power/suspend.c
  PM / Sleep: Remove unnecessary label from suspend_freeze_processes()
  PM / Sleep: Do not check wakeup too often in try_to_freeze_tasks()
  PM / Sleep: Initialize wakeup source locks in wakeup_source_add()
  PM / Hibernate: Refactor and simplify freezer_test_done
  PM / Hibernate: Thaw kernel threads in hibernation_snapshot() in error/test path
  PM / Freezer / Docs: Document the beauty of freeze/thaw semantics
  PM / Suspend: Avoid code duplication in suspend statistics update
  PM / Sleep: Introduce generic callbacks for new device PM phases
  PM / Sleep: Introduce "late suspend" and "early resume" of devices
2012-03-04 23:11:14 +01:00
Jiri Kosina
e37aade316 x86, memblock: Move mem_hole_size() to .init
mem_hole_size() is being called only from __init-marked functions, and as
such should be moved to .init section as well. Fixes this warning:

WARNING: vmlinux.o(.text+0x35511): Section mismatch in reference from the function mem_hole_size() to the function .init.text:absent_pages_in_range()

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1202281614450.31150@pobox.suse.cz
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-03-03 15:51:20 -08:00
Myron Stowe
63ab387ca0 x86/PCI: add spinlock held check to 'pcibios_fwaddrmap_lookup()'
'pcibios_fwaddrmap_lookup()' is used to maintain FW-assigned BIOS BAR
values for reinstatement when normal resource assignment attempts
fail and must be called with the 'pcibios_fwaddrmap_lock' spinlock
held.

This patch adds a WARN_ON notification if the spinlock is not currently
held by the caller.

Signed-off-by: Myron Stowe <myron.stowe@redhat.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2012-03-02 12:03:58 -08:00
Joerg Roedel
1018faa6cf perf/x86/kvm: Fix Host-Only/Guest-Only counting with SVM disabled
It turned out that a performance counter on AMD does not
count at all when the GO or HO bit is set in the control
register and SVM is disabled in EFER.

This patch works around this issue by masking out the HO bit
in the performance counter control register when SVM is not
enabled.

The GO bit is not touched because it is only set when the
user wants to count in guest-mode only. So when SVM is
disabled the counter should not run at all and the
not-counting is the intended behaviour.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Avi Kivity <avi@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: stable@vger.kernel.org # v3.2
Link: http://lkml.kernel.org/r/1330523852-19566-1-git-send-email-joerg.roedel@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-02 12:16:39 +01:00
Jonathan Nieder
a97f4f5e52 x86/PCI: do not tie MSI MS-7253 use_crs quirk to BIOS version
Carlos was getting

	WARNING: at drivers/pci/pci.c:118 pci_ioremap_bar+0x24/0x52()

when probing his sound card, and sound did not work.  After adding
pci=use_crs to the kernel command line, no more trouble.

Ok, we can add a quirk.  dmidecode output reveals that this is an MSI
MS-7253, for which we already have a quirk, but the short-sighted
author tied the quirk to a single BIOS version, making it not kick in
on Carlos's machine with BIOS V1.2.  If a later BIOS update makes it
no longer necessary to look at the _CRS info it will still be
harmless, so let's stop trying to guess which versions have and don't
have accurate _CRS tables.

Addresses https://bugtrack.alsa-project.org/alsa-bug/view.php?id=5533
Also see <https://bugzilla.kernel.org/show_bug.cgi?id=42619>.

Reported-by: Carlos Luna <caralu74@gmail.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2012-03-01 10:56:37 -08:00
Thomas Gleixner
bd2f55361f sched/rt: Use schedule_preempt_disabled()
Coccinelle based conversion.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-24swm5zut3h9c4a6s46x8rws@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01 10:28:03 +01:00
Paul Gortmaker
50af5ead3b bug.h: add include of it to various implicit C users
With bug.h currently living right in linux/kernel.h there
are files that use BUG_ON and friends but are not including
the header explicitly.  Fix them up so we can remove the
presence in kernel.h file.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-02-29 17:15:08 -05:00
H. Peter Anvin
a51f404775 x86, build: Fix portability issues when cross-building
It would appear that we never actually generated a correct CRC when
building on a bigendian machine.  Depending on the word size, we would
either generate an all-zero CRC (64-bit machine) or a byte-swapped
CRC (32-bit machine.)  Fix the types used so we don't arbitrarily use
a 64-bit word to hold 32-bit numbers, and pass the CRC through
put_unaligned_le32() like all the other numbers.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Matt Fleming <matt@console-pimps.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Bowler <nbowler@elliptictech.com>
Link: http://lkml.kernel.org/r/20120229111322.9eb4b23ff1672e8853ad3b3b@canb.auug.org.au
2012-02-28 23:40:56 -08:00
H. Peter Anvin
b8d43cb504 x86, tools: Remove unneeded header files from tools/build.c
We include <sys/sysmacros.h> and <asm/boot.h>, but none of those
header files actually provide anything this file needs.  Furthermore,
it breaks cross-compilation, so just remove them.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Matt Fleming <matt@console-pimps.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Bowler <nbowler@elliptictech.com>
Link: http://lkml.kernel.org/r/20120229111322.9eb4b23ff1672e8853ad3b3b@canb.auug.org.au
2012-02-28 23:40:15 -08:00
Paul Gortmaker
f649e9388c x86: relocate get/set debugreg fcns to include/asm/debugreg.
Since we already have a debugreg.h header file, move the
assoc. get/set functions to it.  In addition to it being the
logical home for them, it has a secondary advantage.  The
functions that are moved use BUG().  So we really need to
have linux/bug.h in scope.  But asm/processor.h is used about
600 times, vs. only about 15 for debugreg.h -- so adding bug.h
to the latter reduces the amount of time we'll be processing
it during a compile.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
2012-02-28 17:48:04 -05:00
Jonathan Nieder
8411371709 x86/PCI: use host bridge _CRS info on MSI MS-7253
In the spirit of commit 29cf7a30f8 ("x86/PCI: use host bridge _CRS
info on ASUS M2V-MX SE"), this DMI quirk turns on "pci_use_crs" by
default on a board that needs it.

This fixes boot failures and oopses introduced in 3e3da00c01
("x86/pci: AMD one chain system to use pci read out res").  The quirk
is quite targetted (to a specific board and BIOS version) for two
reasons:

 (1) to emphasize that this method of tackling the problem one quirk
     at a time is a little insane

 (2) to give BIOS vendors an opportunity to use simpler tables and
     allow us to return to generic behavior (whatever that happens to
     be) with a later BIOS update

In other words, I am not at all happy with having quirks like this.
But it is even worse for the kernel not to work out of the box on
these machines, so...

Reference: https://bugzilla.kernel.org/show_bug.cgi?id=42619
Reported-by: Svante Signell <svante.signell@telia.com>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2012-02-28 11:09:09 -08:00
Matt Fleming
92f42c50f2 x86, efi: Fix endian issues and unaligned accesses
We may need to convert the endianness of the data we read from/write
to 'buf', so let's use {get,put}_unaligned_le32() to do that. Failure
to do so can result in accessing invalid memory, leading to a
segfault.  Stephen Rothwell noticed this bug while cross-building an
x86_64 allmodconfig kernel on PowerPC.

We need to read from and write to 'buf' a byte at a time otherwise
it's possible we'll perform an unaligned access, which can lead to bus
errors when cross-building an x86 kernel on risc architectures.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Nick Bowler <nbowler@elliptictech.com>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Link: http://lkml.kernel.org/r/1330436245-24875-6-git-send-email-matt@console-pimps.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-28 10:23:02 -08:00
Matt Fleming
d40f833630 x86, boot: Restrict CFLAGS for hostprogs
Currently tools/build has access to all the kernel headers in
$(srctree). This is unnecessary and could potentially allow
tools/build to erroneously include kernel headers when it should only
be including userspace-exported headers.

Unfortunately, mkcpustr still needs access to some of the asm kernel
headers, so explicitly special case that hostprog.

Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Link: http://lkml.kernel.org/r/1330436245-24875-5-git-send-email-matt@console-pimps.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-28 10:22:59 -08:00
Matt Fleming
12871c5683 x86, mkpiggy: Don't open code put_unaligned_le32()
Use the new headers in tools/include instead of rolling our own
put_unaligned_le32() implementation.

Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Link: http://lkml.kernel.org/r/1330436245-24875-4-git-send-email-matt@console-pimps.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-28 10:22:57 -08:00
Matt Fleming
55f9709cd0 x86, relocs: Don't open code put_unaligned_le32()
Use the new headers in tools/include instead of rolling our own
put_unaligned_le32() implementation.

Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Link: http://lkml.kernel.org/r/1330436245-24875-3-git-send-email-matt@console-pimps.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-28 10:22:55 -08:00
Ingo Molnar
e24b90b282 Merge branch 'tip/x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into x86/asm 2012-02-28 10:28:24 +01:00
Ingo Molnar
458ce2910a Merge branch 'linus' into x86/asm
Sync up the latest NMI fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-28 10:27:36 +01:00
Linus Torvalds
e25bda5642 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce/AMD: Fix UP build error
  x86: Specify a size for the cmp in the NMI handler
  x86/nmi: Test saved %cs in NMI to determine nested NMI case
  x86/amd: Fix L1i and L2 cache sharing information for AMD family 15h processors
  x86/microcode: Remove noisy AMD microcode warning
2012-02-27 07:55:51 -08:00
Mark Wielaard
928282e432 x86-64: Fix CFI data for common_interrupt()
Commit eab9e6137f ("x86-64: Fix CFI data for interrupt frames")
introduced a DW_CFA_def_cfa_expression in the SAVE_ARGS_IRQ
macro. To later define the CFA using a simple register+offset
rule both register and offset need to be supplied. Just using
CFI_DEF_CFA_REGISTER leaves the offset undefined. So use
CFI_DEF_CFA with reg+off explicitly at the end of
common_interrupt.

Signed-off-by: Mark Wielaard <mjw@redhat.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/1330079527-30711-1-git-send-email-mjw@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-27 10:46:14 +01:00
Jan Beulich
d93c4071b7 x86/time: Eliminate unused irq0_irqs counter
As of v2.6.38 this counter is being maintained without ever being
read.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/4F4787930200007800074A10@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-27 08:46:25 +01:00
Jan Beulich
f0ba662a6e x86: Properly _init-annotate NMI selftest code
After all, this code is being run once at boot only (if
configured in at all).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/r/4F478C010200007800074A3D@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-27 08:43:37 +01:00
Linus Torvalds
500dd2370e Two fixes to fix a memory corruption bug when WC pages never get
converted back to WB but end up being recycled in the general memory
 pool as WC.
 
 Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQEcBAABAgAGBQJPStrRAAoJEFjIrFwIi8fJovAH/RBUJdeDw8x5ki2yDhAz/80S
 +yZKiGaaUYYCB0Fo/BIwVhBQeDabGz8rJCdOv40tRpRCiRD7JIfMo5tCS6QIFF7P
 UvhVuJcqltxIoRjz7nGX8iSUl48JKy9vqmqWXIucG3rYQ7YOkadwVTbhsg4a9U6P
 fcqexzUuXb4fr6CNBBpL3LqHfDaKNovgESHlAmzrcaRGbOADp9LVlWkR6kwiTnIA
 e5yU/DEW9Ej6wJM90Mx9Rg3y22hBZEL1p5NJjaiMrOY2LzX7bE4+mTgtk+a4FNGD
 8WJZm/WWhdsWrKlj8vCKOuJkIgQYJURVMySEGdzM91P1FpJ3edJxIM3qlA958vc=
 =jggO
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-fixes-3.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Two fixes to fix a memory corruption bug when WC pages never get
converted back to WB but end up being recycled in the general memory
pool as WC.

There is a better way of fixing this, but there is not enough time to do
the full benchmarking to pick one of the right options - so picking the
one that favors stability for right now.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

* tag 'stable/for-linus-fixes-3.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/pat: Disable PAT support for now.
  xen/setup: Remove redundant filtering of PTE masks.
2012-02-26 21:03:16 -08:00
Siddhesh Poyarekar
42dfc43ee5 x86_64: Record stack pointer before task execution begins
task->thread.usersp is unusable immediately after a binary is exec()'d
until it undergoes a context switch cycle. The start_thread() function
called during execve() saves the stack pointer into pt_regs and into
old_rsp, but fails to record it into task->thread.usersp.

Because of this, KSTK_ESP(task) returns an incorrect value for a
64-bit program until the task is switched out and back in since
switch_to swaps %rsp values in and out into task->thread.usersp.

Signed-off-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Link: http://lkml.kernel.org/r/1330273075-2949-1-git-send-email-siddhesh.poyarekar@gmail.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-26 12:59:04 -08:00
Jussi Kivilinna
8940426489 crypto: twofish-x86_64/i586 - set alignmask to zero
x86 has fast unaligned accesses, so twofish-x86_64/i586 does not need to enforce
alignment.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-02-25 17:20:24 +08:00
Jussi Kivilinna
919e2c3249 crypto: blowfish-x86_64 - set alignmask to zero
x86 has fast unaligned accesses, so blowfish-x86_64 does not need to enforce
alignment.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-02-25 17:20:24 +08:00
Jussi Kivilinna
435d3e51af crypto: serpent-sse2 - combine ablk_*_init functions
Driver name in ablk_*_init functions can be constructed runtime. Therefore
use single function ablk_init to reduce object size.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-02-25 17:20:23 +08:00
Jussi Kivilinna
d433208cfc crypto: blowfish-x86_64 - use crypto_[un]register_algs
Combine all crypto_alg to be registered and use new crypto_[un]register_algs
functions. Simplifies init/exit code and reduce object size.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-02-25 17:20:23 +08:00
Jussi Kivilinna
53709ddee3 crypto: twofish-x86_64-3way - use crypto_[un]register_algs
Combine all crypto_alg to be registered and use new crypto_[un]register_algs
functions. Simplifies init/exit code and reduce object size.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-02-25 17:20:22 +08:00
Jussi Kivilinna
35474c3bb7 crypto: serpent-sse2 - use crypto_[un]register_algs
Combine all crypto_alg to be registered and use new crypto_[un]register_algs
functions. Simplifies init/exit code and reduce object size.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-02-25 17:20:22 +08:00
Yinghai Lu
73e3b590f3 PCI: Use class for quirk for pci_fixup_video
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2012-02-24 14:34:46 -08:00
Yinghai Lu
4082cf2d7b PCI: Use class quirk for intel fix_transparent_bridge
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2012-02-24 14:34:45 -08:00
Yinghai Lu
c484b2418b PCI: Use class for quirk for via_no_dac
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2012-02-24 14:34:43 -08:00
Steven Rostedt
79fb4ad63e x86: Fix the NMI nesting comments
Some of the comments for the nesting NMI algorithm were stale and
had some references to some prototypes that were first tried.

I also updated the comments to be a little easier to understand
the flow of the code. It definitely needs the documentation.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-24 15:55:13 -05:00
Jan Beulich
69466466ce x86-64: Improve insn scheduling in SAVE_ARGS_IRQ
In one case, use an address register that was computed earlier (and
with a simpler instruction), thus reducing the risk of a stall.

In the second case, eliminate a branch by using a conditional move (as
is already done in call_softirq and xen_do_hypervisor_callback).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/4F4788A50200007800074A26@nat28.tlf.novell.com
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-02-24 11:46:28 -08:00
Jan Beulich
6261091302 x86-64: Fix CFI annotations for NMI nesting code
The saving and restoring of %rdx wasn't annotated at all, and the
jumping over sections where state gets partly restored wasn't handled
either.

Further, by folding the pushing of the previous frame in repeat_nmi
into that which so far was immediately preceding restart_nmi (after
moving the restore of %rdx ahead of that, since it doesn't get used
anymore when pushing prior frames), annotations of the replicated
frame creations can be made consistent too.

v2: Fully fold repeat_nmi into the normal code flow (adding a single
    redundant instruction to the "normal" code path), thus retaining
    the special protection of all instructions between repeat_nmi and
    end_repeat_nmi.

Link: http://lkml.kernel.org/r/4F478B630200007800074A31@nat28.tlf.novell.com

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-24 14:05:14 -05:00
Ingo Molnar
11b91d6fe7 Symbolic defines for architectural MCACOD constants
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJPRpJQAAoJEKurIx+X31iBpdUQAJlKRxgGXcaJOykfzIys4kaF
 PtyqUaV4nmlDCkwoP0MBn3192IAxPRX+PN7u6VCmniiB7H8nGtXHiAK7yhJ1UhDB
 zDCnNmJ5hliICZ5nORolr9zpWgoZ/RkIest//OZUdO7hjnpEqZ2MP7wCHkpA6/Xu
 M6OeU/R671pTea9JluS+8AiVl+szxyVoFzbDJ4xXl5Dr2xYCP7tZfNkD1odf0LNk
 cun5jUov6+UmPTJGm/ve61/eJeOjvrEYDBrWpYtCh7cZkSbLC+Qg9XrHNNQxBEIx
 7wFvdhEXX5SiKV2p5DkMzDf6yDlLnf/qerbY3UhUqybbGVXrTmuUI1NAJY5nteIJ
 InXRJWeEuzpqSqbozE7V9kovmxMO8O6LPLhe9R/cAJAkt5qxL1sQc9FgWZ4c7VCb
 tFmyVzRmjijdw9EE13U+5GTBvHr28WaOKaveI5uMZakfE93Pt1rHbpxWHSyq05gM
 3nr+Vm4VjVeQRww+lsRJ3TZ/F/ThwMRmdwS0QXjuncUkVcU5iOioNsWFeupjPsbg
 DjjaVLQd4SYXBzvOcsEEAOGtmY1jiLYJnxGTAkM/gnN9uNyLSWCA5YnnrYx74yds
 HXpm155BiCtir1A5do+QbcrMaTig/zt12tpnHnoYiifKNYhO5AqLbsKy5ZtDWGXz
 7yYTV5WNwUzQfy7S2f8B
 =Gu3A
 -----END PGP SIGNATURE-----

Merge tag 'mce-recovery-for-tip' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/mce

Add symbolic defines for architectural MCACOD constants
2012-02-24 16:26:39 +01:00
Ingo Molnar
c5905afb0e static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc|dec]()
So here's a boot tested patch on top of Jason's series that does
all the cleanups I talked about and turns jump labels into a
more intuitive to use facility. It should also address the
various misconceptions and confusions that surround jump labels.

Typical usage scenarios:

        #include <linux/static_key.h>

        struct static_key key = STATIC_KEY_INIT_TRUE;

        if (static_key_false(&key))
                do unlikely code
        else
                do likely code

Or:

        if (static_key_true(&key))
                do likely code
        else
                do unlikely code

The static key is modified via:

        static_key_slow_inc(&key);
        ...
        static_key_slow_dec(&key);

The 'slow' prefix makes it abundantly clear that this is an
expensive operation.

I've updated all in-kernel code to use this everywhere. Note
that I (intentionally) have not pushed through the rename
blindly through to the lowest levels: the actual jump-label
patching arch facility should be named like that, so we want to
decouple jump labels from the static-key facility a bit.

On non-jump-label enabled architectures static keys default to
likely()/unlikely() branches.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jason Baron <jbaron@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: a.p.zijlstra@chello.nl
Cc: mathieu.desnoyers@efficios.com
Cc: davem@davemloft.net
Cc: ddaney.cavm@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.hu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-24 10:05:59 +01:00
Olof Johansson
1adbfa3511 x86, efi: Allow basic init with mixed 32/64-bit efi/kernel
Traditionally the kernel has refused to setup EFI at all if there's been
a mismatch in 32/64-bit mode between EFI and the kernel.

On some platforms that boot natively through EFI (Chrome OS being one),
we still need to get at least some of the static data such as memory
configuration out of EFI. Runtime services aren't as critical, and
it's a significant amount of work to implement switching between the
operating modes to call between kernel and firmware for thise cases. So
I'm ignoring it for now.

v5:
* Fixed some printk strings based on feedback
* Renamed 32/64-bit specific types to not have _ prefix
* Fixed bug in printout of efi runtime disablement

v4:
* Some of the earlier cleanup was accidentally reverted by this patch, fixed.
* Reworded some messages to not have to line wrap printk strings

v3:
* Reorganized to a series of patches to make it easier to review, and
  do some of the cleanups I had left out before.

v2:
* Added graceful error handling for 32-bit kernel that gets passed
  EFI data above 4GB.
* Removed some warnings that were missed in first version.

Signed-off-by: Olof Johansson <olof@lixom.net>
Link: http://lkml.kernel.org/r/1329081869-20779-6-git-send-email-olof@lixom.net
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-23 18:54:51 -08:00
Olof Johansson
140bf275d3 x86, efi: Add basic error handling
It's not perfect, but way better than before. Mark efi_enabled as false in
case of error and at least stop dereferencing pointers that are known to
be invalid.

The only significant missing piece is the lack of undoing the
memblock_reserve of the memory that efi marks as in use. On the other
hand, it's not a large amount of memory, and leaving it unavailable for
system use should be the safer choice anyway.

Signed-off-by: Olof Johansson <olof@lixom.net>
Link: http://lkml.kernel.org/r/1329081869-20779-5-git-send-email-olof@lixom.net
Acked-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-23 18:54:39 -08:00
Olof Johansson
a6a46f415d x86, efi: Cleanup config table walking
Trivial cleanup, move guid and table pointers to local copies to
make the code cleaner.

Signed-off-by: Olof Johansson <olof@lixom.net>
Link: http://lkml.kernel.org/r/1329081869-20779-4-git-send-email-olof@lixom.net
Acked-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-23 18:54:33 -08:00
Olof Johansson
e3cb3f5a35 x86, efi: Convert printk to pr_*()
Alright, I guess I'll go through and convert them, even though
there's no net gain to speak of.

v4:
* Switched to pr_fmt and removed some redundant use of "EFI" in
  messages.

Signed-off-by: Olof Johansson <olof@lixom.net>
Link: http://lkml.kernel.org/r/1329081869-20779-3-git-send-email-olof@lixom.net
Cc: Joe Perches <joe@perches.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-23 18:54:22 -08:00
Olof Johansson
83e7ee6657 x86, efi: Refactor efi_init() a bit
Break out some of the init steps into helper functions.

Only change to execution flow is the removal of the warning when the
kernel memdesc structure differ in size from what firmware specifies
since it's a bogus warning (it's a valid difference per spec).

v4:
* Removed memdesc warning as per above

Signed-off-by: Olof Johansson <olof@lixom.net>
Link: http://lkml.kernel.org/r/1329081869-20779-2-git-send-email-olof@lixom.net
Acked-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-23 18:53:56 -08:00
Grant Likely
b4e518547d irq_domain/x86: Convert x86 (embedded) to use common irq_domain
This patch removes the x86-specific definition of irq_domain and replaces
it with the common implementation.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
2012-02-23 14:37:47 -07:00
Alan Cox
823806ff6b x86/mrst/pci: avoid SoC fixups on non-SoC platforms
The PCI fixups get executed based upon whether they are linked in. We need
to avoid executing them if we boot a dual SoC/PC type kernel on a PC class
system.

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2012-02-23 12:33:27 -08:00
Jacob Pan
8ed3087280 x86/mrst/pci: v4l/atomisp: treat atomisp as real pci device
ATOMISP on Medfield is a real PCI device which should be handled differently
than the fake PCI devices on south complex. PCI type 1 access is used for
accessing config space this also has other impact such as PM D3 delay. There
shouldn't be any need for reading base address from IUNIT via msg bus.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2012-02-23 12:32:05 -08:00
Jacob Pan
990a30c50c x86/mrst/pci: assign d3_delay to 0 for Langwell devices
Langwell devices are not true pci devices, they are not subject to the 10 ms
d3 to d0 delay required by pci spec. This patch assigns d3_delay to 0 for all
langwell pci devices.

We can also power off devices that are not really used by the OS

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2012-02-23 12:31:36 -08:00
Yinghai Lu
1cc1c96c16 PCI: fix memleak when ACPI _CRS is not used.
warning:
unreferenced object 0xffff8801f6914200 (size 512):
  comm "swapper/0", pid 1, jiffies 4294893643 (age 2664.644s)
  hex dump (first 32 bytes):
    00 00 c0 fe 00 00 00 00 ff ff ff ff 00 00 00 00  ................
    60 58 2f f6 03 88 ff ff 00 02 00 00 00 00 00 00  `X/.............
  backtrace:
    [<ffffffff81c2408c>] kmemleak_alloc+0x26/0x43
    [<ffffffff8113764f>] __kmalloc+0x121/0x183
    [<ffffffff81ca8d93>] get_current_resources+0x5a/0xc6
    [<ffffffff81c5bedd>] pci_acpi_scan_root+0x13c/0x21c
    [<ffffffff81c2a745>] acpi_pci_root_add+0x1e1/0x421
    [<ffffffff81408f50>] acpi_device_probe+0x50/0x190
    [<ffffffff8149edc7>] really_probe+0x99/0x126
    [<ffffffff8149ef83>] driver_probe_device+0x3b/0x56
    [<ffffffff8149effd>] __driver_attach+0x5f/0x82
    [<ffffffff8149d860>] bus_for_each_dev+0x5c/0x88
    [<ffffffff8149eb87>] driver_attach+0x1e/0x20
    [<ffffffff8149e7cc>] bus_add_driver+0xca/0x21d
    [<ffffffff8149f47b>] driver_register+0x91/0xfe
    [<ffffffff81409d09>] acpi_bus_register_driver+0x43/0x45
    [<ffffffff8278bdc9>] acpi_pci_root_init+0x20/0x28
    [<ffffffff810001e7>] do_one_initcall+0x57/0x134

The system has _CRS for root buses, but they are not used because the machine
date is before the cutoff date for _CRS usage.

Try to free those unused resource arrays and names.

Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2012-02-23 12:01:33 -08:00
Naoya Horiguchi
fadd85f16a x86/mce: Fix return value of mce_chrdev_read() when erst is disabled
Current kernel MCE code reads ERST at the first reading of /dev/mcelog
(maybe in starting mcelogd,) even if the system does not support ERST,
which results in a fake "no such device" message (as described in [1].)
This problem is not critical, but can confuse system admins.
This patch fixes it by filtering the return value from lower (ACPI) layer.

 [1] http://thread.gmane.org/gmane.linux.kernel/1060250

Reported by: Jon Masters <jonathan@jonmasters.org>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Huang Ying <ying.huang@intel.com>
Link: https://lkml.org/lkml/2012/1/23/299
Signed-off-by: Tony Luck <tony.luck@intel.com>
2012-02-22 13:14:16 -08:00
Greg Kroah-Hartman
d6126ef5f3 x86/mce: Convert static array of pointers to per-cpu variables
When I previously fixed up the mce_device code, I used a static array of
the pointers.  It was (rightfully) pointed out to me that I should be
using the per_cpu code instead.

This patch converts the code over to that structure, moving the variable
back into the per_cpu area, like it used to be for 3.2 and earlier.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: https://lkml.org/lkml/2012/1/27/165
Signed-off-by: Tony Luck <tony.luck@intel.com>
2012-02-22 12:58:06 -08:00
Luck, Tony
140f190bc3 x86: Remove some noise from boot log when starting cpus
Printing the "start_ip" for every secondary cpu is very noisy on a large
system - and doesn't add any value. Drop this message.

Console log before:
Booting Node   0, Processors  #1
smpboot cpu 1: start_ip = 96000
 #2
smpboot cpu 2: start_ip = 96000
 #3
smpboot cpu 3: start_ip = 96000
 #4
smpboot cpu 4: start_ip = 96000
       ...
 #31
smpboot cpu 31: start_ip = 96000
Brought up 32 CPUs

Console log after:
Booting Node   0, Processors  #1 #2 #3 #4 #5 #6 #7 Ok.
Booting Node   1, Processors  #8 #9 #10 #11 #12 #13 #14 #15 Ok.
Booting Node   0, Processors  #16 #17 #18 #19 #20 #21 #22 #23 Ok.
Booting Node   1, Processors  #24 #25 #26 #27 #28 #29 #30 #31
Brought up 32 CPUs

Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4f452eb42507460426@agluck-desktop.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-22 10:11:05 -08:00
Borislav Petkov
3f806e5098 x86/mce/AMD: Fix UP build error
141168c36c ("x86: Simplify code by removing a !SMP #ifdefs
from 'struct cpuinfo_x86'") removed a bunch of CONFIG_SMP ifdefs
around code touching struct cpuinfo_x86 members but also caused
the following build error with Randy's randconfigs:

mce_amd.c:(.cpuinit.text+0x4723): undefined reference to `cpu_llc_shared_map'

Restore the #ifdef in threshold_create_bank() which creates
symlinks on the non-BSP CPUs.

There's a better patch series being worked on by Kevin Winchester
which will solve this in a cleaner fashion, but that series is
too ambitious for v3.3 merging - so we first queue up this trivial
fix and then do the rest for v3.4.

Signed-off-by: Borislav Petkov <bp@alien8.de>
Acked-by: Kevin Winchester <kjwinchester@gmail.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Nick Bowler <nbowler@elliptictech.com>
Link: http://lkml.kernel.org/r/20120203191801.GA2846@x1.osrc.amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-22 13:36:30 +01:00
Suresh Siddha
b0e5c77903 x86/tsc: Reduce the TSC sync check time for core-siblings
For each logical CPU that is coming online, we spend 20msec for
checking the TSC synchronization. And as this is done
sequentially for each logical CPU boot, this time gets added up
depending on the number of logical CPU's supported by the
platform.

Minimize this by using the socket topology information.

If the target CPU coming online doesn't have any of its
core-siblings online, a timeout of 20msec will be used for the
TSC-warp measurement loop. Otherwise a smaller timeout of 2msec
will be used, as we have some information about this socket
already (and this information grows as we have more and more
logical-siblings in that socket).

Ideally we should be able to skip the TSC sync check on the
other core-siblings, if the first logical CPU in a socket passed
the sync test. But as the TSC is per-logical CPU and can
potentially be modified wrongly by the bios before the OS boot,
TSC sync test for smaller duration should be able to catch such
errors. Also this will catch the condition where all the cores
in the socket doesn't get reset at the same time.

For example, with this modification, time spent in TSC sync
checks on a 4 socket 10-core with HT system gets reduced from
1580msec to 212msec.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jack Steiner <steiner@sgi.com>
Cc: venki@google.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1328581940.29790.20.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-22 11:49:40 +01:00
H. Peter Anvin
513c4ec6e4 x86, cpufeature: Add CPU features from Intel document 319433-012A
Add CPU features from the Intel Archicture Instruction Set Extensions
Programming Reference version 012A (Feb 2012), document number 319433-012A.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-02-21 17:25:50 -08:00
Linus Torvalds
1361b83a13 i387: Split up <asm/i387.h> into exported and internal interfaces
While various modules include <asm/i387.h> to get access to things we
actually *intend* for them to use, most of that header file was really
pretty low-level internal stuff that we really don't want to expose to
others.

So split the header file into two: the small exported interfaces remain
in <asm/i387.h>, while the internal definitions that are only used by
core architecture code are now in <asm/fpu-internal.h>.

The guiding principle for this was to expose functions that we export to
modules, and leave them in <asm/i387.h>, while stuff that is used by
task switching or was marked GPL-only is in <asm/fpu-internal.h>.

The fpu-internal.h file could be further split up too, especially since
arch/x86/kvm/ uses some of the remaining stuff for its module.  But that
kvm usage should probably be abstracted out a bit, and at least now the
internal FPU accessor functions are much more contained.  Even if it
isn't perhaps as contained as it _could_ be.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211340330.5354@i5.linux-foundation.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-02-21 14:12:54 -08:00
Linus Torvalds
8546c00892 i387: Uninline the generic FP helpers that we expose to kernel modules
Instead of exporting the very low-level internals of the FPU state
save/restore code (ie things like 'fpu_owner_task'), we should export
the higher-level interfaces.

Inlining these things is pointless anyway: sure, sometimes the end
result is small, but while 'stts()' can result in just three x86
instructions, those are not cheap instructions (writing %cr0 is a
serializing instruction and a very slow one at that).

So the overhead of a function call is not noticeable, and we really
don't want random modules mucking about with our internal state save
logic anyway.

So this unexports 'fpu_owner_task', and instead uninlines and exports
the actual functions that modules can use: fpu_kernel_begin/end() and
unlazy_fpu().

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211339590.5354@i5.linux-foundation.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-02-21 14:12:46 -08:00
Linus Torvalds
27e74da980 i387: export 'fpu_owner_task' per-cpu variable
(And define it properly for x86-32, which had its 'current_task'
declaration in separate from x86-64)

Bitten by my dislike for modules on the machines I use, and the fact
that apparently nobody else actually wanted to test the patches I sent
out.

Snif. Nobody else cares.

Anyway, we probably should uninline the 'kernel_fpu_begin()' function
that is what modules actually use and that references this, but this is
the minimal fix for now.

Reported-by: Josh Boyer <jwboyer@gmail.com>
Reported-and-tested-by: Jongman Heo <jongman.heo@samsung.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-20 19:34:10 -08:00
Steven Rostedt
a38449ef59 x86: Specify a size for the cmp in the NMI handler
Linus noticed that the cmp used to check if the code segment is
__KERNEL_CS or not did not specify a size. Perhaps it does not matter
as H. Peter Anvin noted that user space can not set the bottom two
bits of the %cs register. But it's best not to let the assembly choose
and change things between different versions of gas, but instead just
pick the size.

Four bytes are used to compare the saved code segment against
__KERNEL_CS. Perhaps this might mess up Xen, but we can fix that when
the time comes.

Also I noticed that there was another non-specified cmp that checks
the special stack variable if it is 1 or 0. This too probably doesn't
matter what cmp is used, but this patch uses cmpl just to make it non
ambiguous.

Link: http://lkml.kernel.org/r/CA+55aFxfAn9MWRgS3O5k2tqN5ys1XrhSFVO5_9ZAoZKDVgNfGA@mail.gmail.com

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-20 19:45:26 -05:00
Linus Torvalds
7e16838d94 i387: support lazy restore of FPU state
This makes us recognize when we try to restore FPU state that matches
what we already have in the FPU on this CPU, and avoids the restore
entirely if so.

To do this, we add two new data fields:

 - a percpu 'fpu_owner_task' variable that gets written any time we
   update the "has_fpu" field, and thus acts as a kind of back-pointer
   to the task that owns the CPU.  The exception is when we save the FPU
   state as part of a context switch - if the save can keep the FPU
   state around, we leave the 'fpu_owner_task' variable pointing at the
   task whose FP state still remains on the CPU.

 - a per-thread 'last_cpu' field, that indicates which CPU that thread
   used its FPU on last.  We update this on every context switch
   (writing an invalid CPU number if the last context switch didn't
   leave the FPU in a lazily usable state), so we know that *that*
   thread has done nothing else with the FPU since.

These two fields together can be used when next switching back to the
task to see if the CPU still matches: if 'fpu_owner_task' matches the
task we are switching to, we know that no other task (or kernel FPU
usage) touched the FPU on this CPU in the meantime, and if the current
CPU number matches the 'last_cpu' field, we know that this thread did no
other FP work on any other CPU, so the FPU state on the CPU must match
what was saved on last context switch.

In that case, we can avoid the 'f[x]rstor' entirely, and just clear the
CR0.TS bit.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-20 10:58:54 -08:00
Linus Torvalds
80ab6f1e8c i387: use 'restore_fpu_checking()' directly in task switching code
This inlines what is usually just a couple of instructions, but more
importantly it also fixes the theoretical error case (can that FPU
restore really ever fail? Maybe we should remove the checking).

We can't start sending signals from within the scheduler, we're much too
deep in the kernel and are holding the runqueue lock etc.  So don't
bother even trying.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-20 10:58:28 -08:00
Linus Torvalds
cea20ca3f3 i387: fix up some fpu_counter confusion
This makes sure we clear the FPU usage counter for newly created tasks,
just so that we start off in a known state (for example, don't try to
preload the FPU state on the first task switch etc).

It also fixes a thinko in when we increment the fpu_counter at task
switch time, introduced by commit 34ddc81a23 ("i387: re-introduce FPU
state preloading at context switch time").  We should increment the
*new* task fpu_counter, not the old task, and only if we decide to use
that state (whether lazily or preloaded).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-20 10:24:09 -08:00
Konrad Rzeszutek Wilk
8eaffa67b4 xen/pat: Disable PAT support for now.
[Pls also look at https://lkml.org/lkml/2012/2/10/228]

Using of PAT to change pages from WB to WC works quite nicely.
Changing it back to WB - not so much. The crux of the matter is
that the code that does this (__page_change_att_set_clr) has only
limited information so when it tries to the change it gets
the "raw" unfiltered information instead of the properly filtered one -
and the "raw" one tell it that PSE bit is on (while infact it
is not).  As a result when the PTE is set to be WB from WC, we get
tons of:

:WARNING: at arch/x86/xen/mmu.c:475 xen_make_pte+0x67/0xa0()
:Hardware name: HP xw4400 Workstation
.. snip..
:Pid: 27, comm: kswapd0 Tainted: G        W    3.2.2-1.fc16.x86_64 #1
:Call Trace:
: [<ffffffff8106dd1f>] warn_slowpath_common+0x7f/0xc0
: [<ffffffff8106dd7a>] warn_slowpath_null+0x1a/0x20
: [<ffffffff81005a17>] xen_make_pte+0x67/0xa0
: [<ffffffff810051bd>] __raw_callee_save_xen_make_pte+0x11/0x1e
: [<ffffffff81040e15>] ? __change_page_attr_set_clr+0x9d5/0xc00
: [<ffffffff8114c2e8>] ? __purge_vmap_area_lazy+0x158/0x1d0
: [<ffffffff8114cca5>] ? vm_unmap_aliases+0x175/0x190
: [<ffffffff81041168>] change_page_attr_set_clr+0x128/0x4c0
: [<ffffffff81041542>] set_pages_array_wb+0x42/0xa0
: [<ffffffff8100a9b2>] ? check_events+0x12/0x20
: [<ffffffffa0074d4c>] ttm_pages_put+0x1c/0x70 [ttm]
: [<ffffffffa0074e98>] ttm_page_pool_free+0xf8/0x180 [ttm]
: [<ffffffffa0074f78>] ttm_pool_mm_shrink+0x58/0x90 [ttm]
: [<ffffffff8112ba04>] shrink_slab+0x154/0x310
: [<ffffffff8112f17a>] balance_pgdat+0x4fa/0x6c0
: [<ffffffff8112f4b8>] kswapd+0x178/0x3d0
: [<ffffffff815df134>] ? __schedule+0x3d4/0x8c0
: [<ffffffff81090410>] ? remove_wait_queue+0x50/0x50
: [<ffffffff8112f340>] ? balance_pgdat+0x6c0/0x6c0
: [<ffffffff8108fb6c>] kthread+0x8c/0xa0

for every page. The proper fix for this is has been posted
and is https://lkml.org/lkml/2012/2/10/228
"x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations."
along with a detailed description of the problem and solution.

But since that posting has gone nowhere I am proposing
this band-aid solution so that at least users don't get
the page corruption (the pages that are WC don't get changed to WB
and end up being recycled for filesystem or other things causing
mysterious crashes).

The negative impact of this patch is that users of WC flag
(which are InfiniBand, radeon, nouveau drivers) won't be able
to set that flag - so they are going to see performance degradation.
But stability is more important here.

Fixes RH BZ# 742032, 787403, and 745574
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-02-20 10:41:35 -05:00
Konrad Rzeszutek Wilk
416d721474 xen/setup: Remove redundant filtering of PTE masks.
commit 7347b4082e "xen: Allow
unprivileged Xen domains to create iomap pages" added a redundant
line in the early bootup code to filter out the PTE. That
filtering is already done a bit earlier so this extra processing
is not required.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-02-20 10:40:54 -05:00
Linus Torvalds
986cb48c5a x86-32/irq: Don't switch to irq stack for a user-mode irq
If the irq happens in user mode, our kernel stack is empty
(apart from the pt_regs themselves, of course), so there's no
need or advantage to switch.

And it really doesn't save any stack space, quite the reverse:
it means that a nested interrupt cannot switch irq stacks. So
instead of saving kernel stack space, it actually causes the
potential for *more* stack usage.

Also simplify the preemption count copy when we do switch
stacks: just copy the whole preemption count, rather than just
the softirq parts of it.  There is no advantage to the partial
copy: it is more effort to get a less correct result.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202191139260.10000@i5.linux-foundation.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-20 09:30:18 +01:00
Steven Rostedt
45d5a1683c x86/nmi: Test saved %cs in NMI to determine nested NMI case
Currently, the NMI handler tests if it is nested by checking the
special variable saved on the stack (set during NMI handling)
and whether the saved stack is the NMI stack as well (to prevent
the race when the variable is set to zero).

But userspace may set their %rsp to any value as long as they do
not derefence it, and it may make it point to the NMI stack,
which will prevent NMIs from triggering while the userspace app
is running. (I tested this, and it is indeed the case)

Add another check to determine nested NMIs by looking at the
saved %cs (code segment register) and making sure that it is the
kernel code segment.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/1329687817.1561.27.camel@acer.local.home
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-20 09:09:57 +01:00
Dimitri Sivanich
b0deca2e02 x86/UV: Lower UV rtc clocksource rating
Lower the rating of the UV rtc clocksource to just below that of
the tsc, to improve performance.

Reading the tsc clocksource has lower latency than reading the
rtc, so favor it in situations where it is synchronized and
stable.  When the tsc is unsynchronized, the rtc needs to be the
chosen clocksource.

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Jack Steiner <steiner@sgi.com>
Link: http://lkml.kernel.org/r/20120217141641.GA28063@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-20 09:07:56 +01:00
Linus Torvalds
34ddc81a23 i387: re-introduce FPU state preloading at context switch time
After all the FPU state cleanups and finally finding the problem that
caused all our FPU save/restore problems, this re-introduces the
preloading of FPU state that was removed in commit b3b0870ef3 ("i387:
do not preload FPU state at task switch time").

However, instead of simply reverting the removal, this reimplements
preloading with several fixes, most notably

 - properly abstracted as a true FPU state switch, rather than as
   open-coded save and restore with various hacks.

   In particular, implementing it as a proper FPU state switch allows us
   to optimize the CR0.TS flag accesses: there is no reason to set the
   TS bit only to then almost immediately clear it again.  CR0 accesses
   are quite slow and expensive, don't flip the bit back and forth for
   no good reason.

 - Make sure that the same model works for both x86-32 and x86-64, so
   that there are no gratuitous differences between the two due to the
   way they save and restore segment state differently due to
   architectural differences that really don't matter to the FPU state.

 - Avoid exposing the "preload" state to the context switch routines,
   and in particular allow the concept of lazy state restore: if nothing
   else has used the FPU in the meantime, and the process is still on
   the same CPU, we can avoid restoring state from memory entirely, just
   re-expose the state that is still in the FPU unit.

   That optimized lazy restore isn't actually implemented here, but the
   infrastructure is set up for it.  Of course, older CPU's that use
   'fnsave' to save the state cannot take advantage of this, since the
   state saving also trashes the state.

In other words, there is now an actual _design_ to the FPU state saving,
rather than just random historical baggage.  Hopefully it's easier to
follow as a result.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-18 14:03:48 -08:00
Linus Torvalds
f94edacf99 i387: move TS_USEDFPU flag from thread_info to task_struct
This moves the bit that indicates whether a thread has ownership of the
FPU from the TS_USEDFPU bit in thread_info->status to a word of its own
(called 'has_fpu') in task_struct->thread.has_fpu.

This fixes two independent bugs at the same time:

 - changing 'thread_info->status' from the scheduler causes nasty
   problems for the other users of that variable, since it is defined to
   be thread-synchronous (that's what the "TS_" part of the naming was
   supposed to indicate).

   So perfectly valid code could (and did) do

	ti->status |= TS_RESTORE_SIGMASK;

   and the compiler was free to do that as separate load, or and store
   instructions.  Which can cause problems with preemption, since a task
   switch could happen in between, and change the TS_USEDFPU bit. The
   change to TS_USEDFPU would be overwritten by the final store.

   In practice, this seldom happened, though, because the 'status' field
   was seldom used more than once, so gcc would generally tend to
   generate code that used a read-modify-write instruction and thus
   happened to avoid this problem - RMW instructions are naturally low
   fat and preemption-safe.

 - On x86-32, the current_thread_info() pointer would, during interrupts
   and softirqs, point to a *copy* of the real thread_info, because
   x86-32 uses %esp to calculate the thread_info address, and thus the
   separate irq (and softirq) stacks would cause these kinds of odd
   thread_info copy aliases.

   This is normally not a problem, since interrupts aren't supposed to
   look at thread information anyway (what thread is running at
   interrupt time really isn't very well-defined), but it confused the
   heck out of irq_fpu_usable() and the code that tried to squirrel
   away the FPU state.

   (It also caused untold confusion for us poor kernel developers).

It also turns out that using 'task_struct' is actually much more natural
for most of the call sites that care about the FPU state, since they
tend to work with the task struct for other reasons anyway (ie
scheduling).  And the FPU data that we are going to save/restore is
found there too.

Thanks to Arjan Van De Ven <arjan@linux.intel.com> for pointing us to
the %esp issue.

Cc: Arjan van de Ven <arjan@linux.intel.com>
Reported-and-tested-by: Raphael Prevost <raphael@buro.asia>
Acked-and-tested-by: Suresh Siddha <suresh.b.siddha@intel.com>
Tested-by: Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-18 10:19:41 -08:00
Ingo Molnar
09bda4432a Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core 2012-02-17 12:55:07 +01:00
Linus Torvalds
4903062b54 i387: move AMD K7/K8 fpu fxsave/fxrstor workaround from save to restore
The AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception is
pending.  In order to not leak FIP state from one process to another, we
need to do a floating point load after the fxsave of the old process,
and before the fxrstor of the new FPU state.  That resets the state to
the (uninteresting) kernel load, rather than some potentially sensitive
user information.

We used to do this directly after the FPU state save, but that is
actually very inconvenient, since it

 (a) corrupts what is potentially perfectly good FPU state that we might
     want to lazy avoid restoring later and

 (b) on x86-64 it resulted in a very annoying ordering constraint, where
     "__unlazy_fpu()" in the task switch needs to be delayed until after
     the DS segment has been reloaded just to get the new DS value.

Coupling it to the fxrstor instead of the fxsave automatically avoids
both of these issues, and also ensures that we only do it when actually
necessary (the FP state after a save may never actually get used).  It's
simply a much more natural place for the leaked state cleanup.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-16 19:11:15 -08:00
Linus Torvalds
b3b0870ef3 i387: do not preload FPU state at task switch time
Yes, taking the trap to re-load the FPU/MMX state is expensive, but so
is spending several days looking for a bug in the state save/restore
code.  And the preload code has some rather subtle interactions with
both paravirtualization support and segment state restore, so it's not
nearly as simple as it should be.

Also, now that we no longer necessarily depend on a single bit (ie
TS_USEDFPU) for keeping track of the state of the FPU, we migth be able
to do better.  If we are really switching between two processes that
keep touching the FP state, save/restore is inevitable, but in the case
of having one process that does most of the FPU usage, we may actually
be able to do much better than the preloading.

In particular, we may be able to keep track of which CPU the process ran
on last, and also per CPU keep track of which process' FP state that CPU
has.  For modern CPU's that don't destroy the FPU contents on save time,
that would allow us to do a lazy restore by just re-enabling the
existing FPU state - with no restore cost at all!

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-16 15:45:23 -08:00
Linus Torvalds
6d59d7a9f5 i387: don't ever touch TS_USEDFPU directly, use helper functions
This creates three helper functions that do the TS_USEDFPU accesses, and
makes everybody that used to do it by hand use those helpers instead.

In addition, there's a couple of helper functions for the "change both
CR0.TS and TS_USEDFPU at the same time" case, and the places that do
that together have been changed to use those.  That means that we have
fewer random places that open-code this situation.

The intent is partly to clarify the code without actually changing any
semantics yet (since we clearly still have some hard to reproduce bug in
this area), but also to make it much easier to use another approach
entirely to caching the CR0.TS bit for software accesses.

Right now we use a bit in the thread-info 'status' variable (this patch
does not change that), but we might want to make it a full field of its
own or even make it a per-cpu variable.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-16 13:33:12 -08:00
Linus Torvalds
b6c66418dc i387: move TS_USEDFPU clearing out of __save_init_fpu and into callers
Touching TS_USEDFPU without touching CR0.TS is confusing, so don't do
it.  By moving it into the callers, we always do the TS_USEDFPU next to
the CR0.TS accesses in the source code, and it's much easier to see how
the two go hand in hand.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-16 12:22:48 -08:00
Linus Torvalds
15d8791cae i387: fix x86-64 preemption-unsafe user stack save/restore
Commit 5b1cbac377 ("i387: make irq_fpu_usable() tests more robust")
added a sanity check to the #NM handler to verify that we never cause
the "Device Not Available" exception in kernel mode.

However, that check actually pinpointed a (fundamental) race where we do
cause that exception as part of the signal stack FPU state save/restore
code.

Because we use the floating point instructions themselves to save and
restore state directly from user mode, we cannot do that atomically with
testing the TS_USEDFPU bit: the user mode access itself may cause a page
fault, which causes a task switch, which saves and restores the FP/MMX
state from the kernel buffers.

This kind of "recursive" FP state save is fine per se, but it means that
when the signal stack save/restore gets restarted, it will now take the
'#NM' exception we originally tried to avoid.  With preemption this can
happen even without the page fault - but because of the user access, we
cannot just disable preemption around the save/restore instruction.

There are various ways to solve this, including using the
"enable/disable_page_fault()" helpers to not allow page faults at all
during the sequence, and fall back to copying things by hand without the
use of the native FP state save/restore instructions.

However, the simplest thing to do is to just allow the #NM from kernel
space, but fix the race in setting and clearing CR0.TS that this all
exposed: the TS bit changes and the TS_USEDFPU bit absolutely have to be
atomic wrt scheduling, so while the actual state save/restore can be
interrupted and restarted, the act of actually clearing/setting CR0.TS
and the TS_USEDFPU bit together must not.

Instead of just adding random "preempt_disable/enable()" calls to what
is already excessively ugly code, this introduces some helper functions
that mostly mirror the "kernel_fpu_begin/end()" functionality, just for
the user state instead.

Those helper functions should probably eventually replace the other
ad-hoc CR0.TS and TS_USEDFPU tests too, but I'll need to think about it
some more: the task switching functionality in particular needs to
expose the difference between the 'prev' and 'next' threads, while the
new helper functions intentionally were written to only work with
'current'.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-16 09:15:04 -08:00
Linus Torvalds
c38e234562 i387: fix sense of sanity check
The check for save_init_fpu() (introduced in commit 5b1cbac377: "i387:
make irq_fpu_usable() tests more robust") was the wrong way around, but
I hadn't noticed, because my "tests" were bogus: the FPU exceptions are
disabled by default, so even doing a divide by zero never actually
triggers this code at all unless you do extra work to enable them.

So if anybody did enable them, they'd get one spurious warning.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-15 08:05:18 -08:00
Linus Torvalds
694ce18ec3 Two fixes for VCPU offlining; One to fix the string format exposed
by the xen-pci[front|back] to conform to the one used in majority of
 PCI drivers; Two fixes to make the code more resilient to invalid
 configurations.
 
 Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQEcBAABAgAGBQJPOeReAAoJEFjIrFwIi8fJn9QIANP48kzrGg0uO4bjSf2h/z7G
 pp3ISdtVLk7pwMov2POBqskoXSq8E0yQAfNN8se183wqNXo3Dm4rU1DIG7HQFBk9
 sdcyfHI8x7pat9JClRhGxpQ23Ig9f1iWkShweCcZCO782vfxZyNd65i6t87X7uLq
 7SPtG1XH2RixTX7tHtKKBqdzZ0OMXOEkJ33dgCmyrn+wzohbKrFj5mg+NdOgmzEo
 VgsHPVtuq7orDROe+F9d91eAg0TILQ13th8xfWZ59lQATXu/zAlaueYt87tpy1pb
 oVQvumsn8Xev+7hct9My9Tw45D4m8YOSFLG2HcekkW2WtNmGhTTbIyMh9PsLugk=
 =NDYK
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-fixes-3.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Two fixes for VCPU offlining; One to fix the string format exposed
by the xen-pci[front|back] to conform to the one used in majority of
PCI drivers; Two fixes to make the code more resilient to invalid
configurations.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

* tag 'stable/for-linus-fixes-3.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xenbus_dev: add missing error check to watch handling
  xen/pci[front|back]: Use %d instead of %1x for displaying PCI devfn.
  xen pvhvm: do not remap pirqs onto evtchns if !xen_have_vector_callback
  xen/smp: Fix CPU online/offline bug triggering a BUG: scheduling while atomic.
  xen/bootup: During bootup suppress XENBUS: Unable to read cpu state
2012-02-14 15:20:11 -08:00
Bjorn Helgaas
316d86fe86 x86/PCI: don't fall back to defaults if _CRS has no apertures
Host bridges that lead to things like the Uncore need not have any
I/O port or MMIO apertures.  For example, in this case:

    ACPI: PCI Root Bridge [UNC1] (domain 0000 [bus ff])
    PCI: root bus ff: using default resources
    PCI host bridge to bus 0000:ff
    pci_bus 0000:ff: root bus resource [io  0x0000-0xffff]
    pci_bus 0000:ff: root bus resource [mem 0x00000000-0x3fffffffffff]

we should not pretend those default resources are available on bus ff.

CC: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2012-02-14 08:44:49 -08:00
Myron Stowe
6535943fbf x86/PCI: Convert maintaining FW-assigned BIOS BAR values to use a list
This patch converts the underlying maintenance aspects of FW-assigned
BIOS BAR values from a statically allocated array within struct pci_dev
to a list of temporary, stand alone, entries.

Signed-off-by: Myron Stowe <myron.stowe@redhat.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2012-02-14 08:44:46 -08:00
Myron Stowe
925845bd49 x86/PCI: Infrastructure to maintain a list of FW-assigned BIOS BAR values
Commit 58c84eda07 introduced functionality to try and reinstate the
original BIOS BAR addresses of a PCI device when normal resource
assignment attempts fail.  To keep track of the BIOS BAR addresses,
struct pci_dev was augmented with an array to hold the BAR addresses
of the PCI device: 'resource_size_t fw_addr[DEVICE_COUNT_RESOURCE]'.

The reinstatement of BAR addresses is an uncommon event leaving the
'fw_addr' array unused under normal circumstances.  This functionality
is also currently architecture specific with an implementation limited
to x86.  As the use of struct pci_dev is so prevalent, having the
'fw_addr' array residing within such seems somewhat wasteful.

This patch introduces a stand alone data structure and interfacing
routines for maintaining a list of FW-assigned BIOS BAR value entries.

Signed-off-by: Myron Stowe <myron.stowe@redhat.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2012-02-14 08:44:46 -08:00
Jesper Juhl
6e77fe8c11 crypto: serpent-sse2 - remove dead code from serpent_sse2_glue.c::serpent_sse2_init()
We cannot reach the line after 'return err'. Remove it.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-02-14 16:34:19 +08:00
Jesper Juhl
8d21190e22 crypto: twofish-x86 - Remove dead code from twofish_glue_3way.c::init()
We can never reach the line just after the 'return 0'
statement. Remove it.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-02-14 16:34:18 +08:00
Ben Hutchings
5467bdda4a x86/cpu: Clean up modalias feature matching
We currently include commas on both sides of the feature ID in a
modalias, but this prevents the lowest numbered feature of a CPU from
being matched.  Since all feature IDs have the same length, we do not
need to worry about substring matches, so omit commas from the
modalias entirely.

Avoid generating multiple adjacent wildcards when there is no
feature ID to match.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Thomas Renninger <trenn@suse.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-13 15:24:26 -08:00
Ben Hutchings
70142a9dd1 x86/cpu: Fix overrun check in arch_print_cpu_modalias()
snprintf() does not return a negative value when truncating.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Thomas Renninger <trenn@suse.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-13 15:24:26 -08:00
Linus Torvalds
5b1cbac377 i387: make irq_fpu_usable() tests more robust
Some code - especially the crypto layer - wants to use the x86
FP/MMX/AVX register set in what may be interrupt (typically softirq)
context.

That *can* be ok, but the tests for when it was ok were somewhat
suspect.  We cannot touch the thread-specific status bits either, so
we'd better check that we're not going to try to save FP state or
anything like that.

Now, it may be that the TS bit is always cleared *before* we set the
USEDFPU bit (and only set when we had already cleared the USEDFP
before), so the TS bit test may actually have been sufficient, but it
certainly was not obviously so.

So this explicitly verifies that we will not touch the TS_USEDFPU bit,
and adds a few related sanity-checks.  Because it seems that somehow
AES-NI is corrupting user FP state.  The cause is not clear, and this
patch doesn't fix it, but while debugging it I really wanted the code to
be more obviously correct and robust.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-13 13:56:14 -08:00
Linus Torvalds
be98c2cdb1 i387: math_state_restore() isn't called from asm
It was marked asmlinkage for some really old and stale legacy reasons.
Fix that and the equally stale comment.

Noticed when debugging the irq_fpu_usable() bugs.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-13 13:47:25 -08:00
Steven Rostedt
484546509c x86/tracing: Denote the power and cpuidle tracepoints as _rcuidle()
The power and cpuidle tracepoints are called within a rcu_idle_exit()
section, and must be denoted with the _rcuidle() version of the tracepoint.

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-13 09:14:43 -05:00
Yinghai Lu
21c3fcf3e3 x86/debug: Fix/improve the show_msr=<cpus> debug print out
Found out that show_msr=<cpus> is broken, when I asked a
user to use it to capture debug info about broken MTRR's
whose MTRR settings are probably different between CPUs.

Only the first CPUs MSRs are printed, but that is not
enough to track down the suspected bug.

For years we called print_cpu_msr from print_cpu_info(),
but this commit:

| commit 2eaad1fddd
| Author: Mike Travis <travis@sgi.com>
| Date:   Thu Dec 10 17:19:36 2009 -0800
|
|    x86: Limit the number of processor bootup messages

removed the print_cpu_info() call from all APs.

Put it back - it will only print MSRs when the user
specifically requests them via show_msr=<cpus>.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Mike Travis <travis@sgi.com>
Link: http://lkml.kernel.org/r/1329069237-11483-1-git-send-email-yinghai@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-12 19:12:21 +01:00
Masami Hiramatsu
f8d98f1095 x86: Fix to decode grouped AVX with VEX pp bits
Fix to decode grouped AVX with VEX pp bits which should be
handled as same as last-prefixes. This fixes below warnings
in posttest with CONFIG_CRYPTO_SHA1_SSSE3=y.

 Warning: arch/x86/tools/test_get_len found difference at <sha1_transform_avx>:ffffffff810d5fc0
 Warning: ffffffff810d6069:	c5 f9 73 de 04       	vpsrldq $0x4,%xmm6,%xmm0
 Warning: objdump says 5 bytes, but insn_get_length() says 4
 ...

With this change, test_get_len can decode it correctly.

 $ arch/x86/tools/test_get_len -v -y
 ffffffff810d6069:       c5 f9 73 de 04          vpsrldq $0x4,%xmm6,%xmm0
 Succeed: decoded and checked 1 instructions

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20120210053340.30429.73410.stgit@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-11 15:11:35 +01:00
Linus Torvalds
ce2814f227 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix double start/stop in x86_pmu_start()
  perf evsel: Fix an issue where perf report fails to show the proper percentage
  perf tools: Fix prefix matching for kernel maps
  perf tools: Fix perf stack to non executable on x86_64
  perf: Remove deprecated WARN_ON_ONCE()
2012-02-10 09:05:07 -08:00
Andreas Herrmann
32c3233885 x86/amd: Fix L1i and L2 cache sharing information for AMD family 15h processors
For L1 instruction cache and L2 cache the shared CPU information
is wrong. On current AMD family 15h CPUs those caches are shared
between both cores of a compute unit.

This fixes https://bugzilla.kernel.org/show_bug.cgi?id=42607

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Petkov Borislav <Borislav.Petkov@amd.com>
Cc: Dave Jones <davej@redhat.com>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/20120208195229.GA17523@alberich.amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-09 09:38:15 +01:00
Stephane Eranian
f39d47ff81 perf: Fix double start/stop in x86_pmu_start()
The following patch fixes a bug introduced by the following
commit:

        e050e3f0a7 ("perf: Fix broken interrupt rate throttling")

The patch caused the following warning to pop up depending on
the sampling frequency adjustments:

  ------------[ cut here ]------------
  WARNING: at arch/x86/kernel/cpu/perf_event.c:995 x86_pmu_start+0x79/0xd4()

It was caused by the following call sequence:

perf_adjust_freq_unthr_context.part() {
     stop()
     if (delta > 0) {
          perf_adjust_period() {
              if (period > 8*...) {
                  stop()
                  ...
                  start()
              }
          }
      }
      start()
}

Which caused a double start and a double stop, thus triggering
the assert in x86_pmu_start().

The patch fixes the problem by avoiding the double calls. We
pass a new argument to perf_adjust_period() to indicate whether
or not the event is already stopped. We can't just remove the
start/stop from that function because it's called from
__perf_event_overflow where the event needs to be reloaded via a
stop/start back-toback call.

The patch reintroduces the assertion in x86_pmu_start() which
was removed by commit:

	84f2b9b ("perf: Remove deprecated WARN_ON_ONCE()")

In this second version, we've added calls to disable/enable PMU
during unthrottling or frequency adjustment based on bug report
of spurious NMI interrupts from Eric Dumazet.

Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: markus@trippelsdorf.de
Cc: paulus@samba.org
Link: http://lkml.kernel.org/r/20120207133956.GA4932@quad
[ Minor edits to the changelog and to the code ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-07 16:58:56 +01:00
Borislav Petkov
c98fdeaa92 x86/sched/perf/AMD: Set sched_clock_stable
Stephane Eranian reported that doing a scheduler latency
measurements with perf on AMD doesn't work out as expected due
to the fact that the sched_clock() granularity is too coarse,
i.e. done in jiffies due to the sched_clock_stable not set,
which, if set, would mean that we get to use the TSC as sample
source which would give us much higher precision.

However, there's no reason not to set sched_clock_stable on AMD
because all families from F10h and upwards do have an invariant
TSC and have the CPUID flag to prove (CPUID_8000_0007_EDX[8]).

Make it so, #1.

Signed-off-by: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@amd64.org>
Cc: Venki Pallipadi <venki@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Link: http://lkml.kernel.org/r/20120206132546.GA30854@quad
[ Should any non-standard system break the TSC, we should
  mark them so explicitly, in their platform init handler, or
  in a DMI quirk. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-07 13:12:08 +01:00
Prarit Bhargava
c1d2f1bccf x86/microcode: Remove noisy AMD microcode warning
AMD processors will never support /dev/cpu/microcode updating so
just silently fail instead of printing out a warning for every
cpu.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1328552935-965-1-git-send-email-prarit@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-07 10:53:42 +01:00
Jan Beulich
7931d49305 x86/spinlocks: Eliminate TICKET_MASK
The definition of it being questionable already (unnecessarily
including a cast), and it being used in a single place that can
be written shorter without it, remove this #define.

Along the same lines, simplify __ticket_spin_is_locked()'s main
expression, which was the more convoluted way because of needs
that went away with the recent type changes by Jeremy.

This is pure cleanup, no functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4F2C06020200007800071066@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-07 10:09:54 +01:00
Linus Torvalds
14fdbf7eb4 Merge branch 'kvm-updates/3.3' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Fixing a regression with the PMU MSRs when PMU virtualization is
disabled, a guest-internal DoS with the SYSCALL instruction, and a dirty
memory logging race that may cause live migration to fail.

* 'kvm-updates/3.3' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: do not #GP on perf MSR writes when vPMU is disabled
  KVM: x86: fix missing checks in syscall emulation
  KVM: x86: extend "struct x86_emulate_ops" with "get_cpuid"
  KVM: Fix __set_bit() race in mark_page_dirty() during dirty logging
2012-02-06 16:26:58 -08:00
Arnaldo Carvalho de Melo
5ddf146f70 Merge branch 'perf/urgent' into perf/core
So that we can get the perf bench exec stack fixes and then apply the
remaining fix for the files added after what is in perf/urgent.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2012-02-06 19:11:02 -02:00
Jim Cromie
8ad95f0958 scx200_32: replace printks with pr_<level>s
update scx200_32.c to use pr_<level>, also 2 whitespaces.

Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-02-03 23:24:58 +01:00
Jim Cromie
0ac2526064 scx200_32: use PCI_VDEVICE
Replace PCI_DEVICE with PCI_VDEVICE to shorten device table.

Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-02-03 23:24:09 +01:00
Stefano Stabellini
207d543f47 xen pvhvm: do not remap pirqs onto evtchns if !xen_have_vector_callback
CC: stable@kernel.org #2.6.37 and onwards
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-02-03 16:06:27 -05:00
Konrad Rzeszutek Wilk
41bd956de3 xen/smp: Fix CPU online/offline bug triggering a BUG: scheduling while atomic.
When a user offlines a VCPU and then onlines it, we get:

NMI watchdog disabled (cpu2): hardware events not enabled
BUG: scheduling while atomic: swapper/2/0/0x00000002
Modules linked in: dm_multipath dm_mod xen_evtchn iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi scsi_mod libcrc32c crc32c radeon fbco
 ttm bitblit softcursor drm_kms_helper xen_blkfront xen_netfront xen_fbfront fb_sys_fops sysimgblt sysfillrect syscopyarea xen_kbdfront xenfs [last unloaded:

Pid: 0, comm: swapper/2 Tainted: G           O 3.2.0phase15.1-00003-gd6f7f5b-dirty #4
Call Trace:
 [<ffffffff81070571>] __schedule_bug+0x61/0x70
 [<ffffffff8158eb78>] __schedule+0x798/0x850
 [<ffffffff8158ed6a>] schedule+0x3a/0x50
 [<ffffffff810349be>] cpu_idle+0xbe/0xe0
 [<ffffffff81583599>] cpu_bringup_and_idle+0xe/0x10

The reason for this should be obvious from this call-chain:
cpu_bringup_and_idle:
 \- cpu_bringup
  |   \-[preempt_disable]
  |
  |- cpu_idle
       \- play_dead [assuming the user offlined the VCPU]
       |     \
       |     +- (xen_play_dead)
       |          \- HYPERVISOR_VCPU_off [so VCPU is dead, once user
       |          |                       onlines it starts from here]
       |          \- cpu_bringup [preempt_disable]
       |
       +- preempt_enable_no_reschedule()
       +- schedule()
       \- preempt_enable()

So we have two preempt_disble() and one preempt_enable(). Calling
preempt_enable() after the cpu_bringup() in the xen_play_dead
fixes the imbalance.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-02-03 16:05:42 -05:00
Stephane Eranian
84f2b9b2ed perf: Remove deprecated WARN_ON_ONCE()
With the new throttling/unthrottling code introduced with
commit:

  e050e3f0a7 ("perf: Fix broken interrupt rate throttling")

we occasionally hit two WARN_ON_ONCE() checks in:

  - intel_pmu_pebs_enable()
  - intel_pmu_lbr_enable()
  - x86_pmu_start()

The assertions are no longer problematic. There is a valid
path where they can trigger but it is harmless.

The assertion can be triggered with:

  $ perf record -e instructions:pp ....

Leading to paths:

  intel_pmu_pebs_enable
  intel_pmu_enable_event
  x86_perf_event_set_period
  x86_pmu_start
  perf_adjust_freq_unthr_context
  perf_event_task_tick
  scheduler_tick

And:

  intel_pmu_lbr_enable
  intel_pmu_enable_event
  x86_perf_event_set_period
  x86_pmu_start
  perf_adjust_freq_unthr_context.
  perf_event_task_tick
  scheduler_tick

cpuc->enabled is always on because when we get to
perf_adjust_freq_unthr_context() the PMU is not totally
disabled. Furthermore when we need to adjust a period,
we only stop the event we need to change and not the
entire PMU. Thus, when we re-enable, cpuc->enabled is
already set. Note that when we stop the event, both
pebs and lbr are stopped if necessary (and possible).

Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/20120202110401.GA30911@quad
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-03 08:24:40 +01:00
Greg Kroah-Hartman
bd1d462e13 Merge 3.3-rc2 into the driver-core-next branch.
This was done to resolve a merge and build problem with the
drivers/acpi/processor_driver.c file.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-02 11:24:44 -08:00
Linus Torvalds
2f2fde9272 Merge branches 'core-urgent-for-linus', 'perf-urgent-for-linus', 'sched-urgent-for-linus' and 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  bugs, x86: Fix printk levels for panic, softlockups and stack dumps

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf top: Fix number of samples displayed
  perf tools: Fix strlen() bug in perf_event__synthesize_event_type()
  perf tools: Fix broken build by defining _GNU_SOURCE in Makefile
  x86/dumpstack: Remove unneeded check in dump_trace()
  perf: Fix broken interrupt rate throttling

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/rt: Fix task stack corruption under __ARCH_WANT_INTERRUPTS_ON_CTXSW
  sched: Fix ancient race in do_exit()
  sched/nohz: Fix nohz cpu idle load balancing state with cpu hotplug
  sched/s390: Fix compile error in sched/core.c
  sched: Fix rq->nr_uninterruptible update race

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/reboot: Remove VersaLogic Menlow reboot quirk
  x86/reboot: Skip DMI checks if reboot set by user
  x86: Properly parenthesize cmpxchg() macro arguments
2012-02-02 11:11:13 -08:00
Gleb Natapov
5753785fa9 KVM: do not #GP on perf MSR writes when vPMU is disabled
Return to behaviour perf MSR had before introducing vPMU in case vPMU
is disabled. Some guests access those registers unconditionally and do
not expect it to fail.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-02-01 11:44:46 +02:00
Stephan Bärwolf
c2226fc9e8 KVM: x86: fix missing checks in syscall emulation
On hosts without this patch, 32bit guests will crash (and 64bit guests
may behave in a wrong way) for example by simply executing following
nasm-demo-application:

    [bits 32]
    global _start
    SECTION .text
    _start: syscall

(I tested it with winxp and linux - both always crashed)

    Disassembly of section .text:

    00000000 <_start>:
       0:   0f 05                   syscall

The reason seems a missing "invalid opcode"-trap (int6) for the
syscall opcode "0f05", which is not available on Intel CPUs
within non-longmodes, as also on some AMD CPUs within legacy-mode.
(depending on CPU vendor, MSR_EFER and cpuid)

Because previous mentioned OSs may not engage corresponding
syscall target-registers (STAR, LSTAR, CSTAR), they remain
NULL and (non trapping) syscalls are leading to multiple
faults and finally crashs.

Depending on the architecture (AMD or Intel) pretended by
guests, various checks according to vendor's documentation
are implemented to overcome the current issue and behave
like the CPUs physical counterparts.

[mtosatti: cleanup/beautify code]

Signed-off-by: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-02-01 11:43:40 +02:00
Stephan Bärwolf
bdb42f5afe KVM: x86: extend "struct x86_emulate_ops" with "get_cpuid"
In order to be able to proceed checks on CPU-specific properties
within the emulator, function "get_cpuid" is introduced.
With "get_cpuid" it is possible to virtually call the guests
"cpuid"-opcode without changing the VM's context.

[mtosatti: cleanup/beautify code]

Signed-off-by: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-02-01 11:43:33 +02:00
He Chunhui
35f1790e6c x86, boot: Fix port argument to inl() function
"u32 port" in inl() should be "u16 port".

[ hpa: it's a bug, but it doesn't produce incorrect code, so no need
  to put this into urgent or stable. ]

Signed-off-by: He Chunhui <hchunhui@mail.ustc.edu.cn>
Link: http://lkml.kernel.org/r/32892299.2931391328028508117.JavaMail.coremail@mailweb
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-01-31 12:05:54 -08:00
Ingo Molnar
bb1693f89a Merge branch 'perf/urgent' into perf/core
We cherry-picked 3 commits into perf/urgent, merge them back to allow
conflict-free work on those files.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-31 13:02:37 +01:00
Michael D Labriola
e6d36a653b x86/reboot: Remove VersaLogic Menlow reboot quirk
This commit removes the reboot quirk originally added by commit
e19e074 ("x86: Fix reboot problem on VersaLogic Menlow boards").

Testing with a VersaLogic Ocelot (VL-EPMs-21a rev 1.00 w/ BIOS
6.5.102) revealed the following regarding the reboot hang
problem:

- v2.6.37 reboot=bios was needed.

- v2.6.38-rc1: behavior changed, reboot=acpi is needed,
  reboot=kbd and reboot=bios results in system hang.

- v2.6.38: VersaLogic patch (e19e074 "x86: Fix reboot problem on
  VersaLogic Menlow boards") was applied prior to v2.6.38-rc7.  This
  patch sets a quirk for VersaLogic Menlow boards that forces the use
  of reboot=bios, which doesn't work anymore.

- v3.2: It seems that commit 660e34c ("x86: Reorder reboot method
  preferences") changed the default reboot method to acpi prior to
  v3.0-rc1, which means the default behavior is appropriate for the
  Ocelot.  No VersaLogic quirk is required.

The Ocelot board used for testing can successfully reboot w/out
having to pass any reboot= arguments for all 3 current versions
of the BIOS.

Signed-off-by: Michael D Labriola <michael.d.labriola@gmail.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Michael D Labriola <mlabriol@gdeb.com>
Cc: Kushal Koolwal <kushalkoolwal@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/87vcnub9hu.fsf@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-30 10:52:33 +01:00
Michael D Labriola
5955633e91 x86/reboot: Skip DMI checks if reboot set by user
Skip DMI checks for vendor specific reboot quirks if the user
passed in a reboot= arg on the command line - we should never
override user choices.

Signed-off-by: Michael D Labriola <michael.d.labriola@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Michael D Labriola <mlabriol@gdeb.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/87wr8ab9od.fsf@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-30 10:52:32 +01:00
Rafael J. Wysocki
cf579dfb82 PM / Sleep: Introduce "late suspend" and "early resume" of devices
The current device suspend/resume phases during system-wide power
transitions appear to be insufficient for some platforms that want
to use the same callback routines for saving device states and
related operations during runtime suspend/resume as well as during
system suspend/resume.  In principle, they could point their
.suspend_noirq() and .resume_noirq() to the same callback routines
as their .runtime_suspend() and .runtime_resume(), respectively,
but at least some of them require device interrupts to be enabled
while the code in those routines is running.

It also makes sense to have device suspend-resume callbacks that will
be executed with runtime PM disabled and with device interrupts
enabled in case someone needs to run some special code in that
context during system-wide power transitions.

Apart from this, .suspend_noirq() and .resume_noirq() were introduced
as a workaround for drivers using shared interrupts and failing to
prevent their interrupt handlers from accessing suspended hardware.
It appears to be better not to use them for other porposes, or we may
have to deal with some serious confusion (which seems to be happening
already).

For the above reasons, introduce new device suspend/resume phases,
"late suspend" and "early resume" (and analogously for hibernation)
whose callback will be executed with runtime PM disabled and with
device interrupts enabled and whose callback pointers generally may
point to runtime suspend/resume routines.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reviewed-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Reviewed-by: Kevin Hilman <khilman@ti.com>
2012-01-29 20:38:29 +01:00
Linus Torvalds
6c334f4f6a Merge branch 'stable/for-linus-fixes-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
* 'stable/for-linus-fixes-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/granttable: Disable grant v2 for HVM domains.
  x86: xen: size struct xen_spinlock to always fit in arch_spinlock_t
2012-01-28 18:15:33 -08:00
Dan Carpenter
d0caf29250 x86/dumpstack: Remove unneeded check in dump_trace()
Smatch complains that we have some inconsistent NULL checking.

If "task" were NULL then it would lead to a NULL dereference
later. We can remove this test because earlier on in the
function we have:

 if (!task)
	task = current;

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Clemens Ladisch <clemens@ladisch.de>
Link: http://lkml.kernel.org/r/20120128105246.GA25092@elgon.mountain
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-28 13:09:06 +01:00
Konrad Rzeszutek Wilk
6c02b7b161 Merge commit 'v3.3-rc1' into stable/for-linus-fixes-3.3
* commit 'v3.3-rc1': (9775 commits)
  Linux 3.3-rc1
  x86, syscall: Need __ARCH_WANT_SYS_IPC for 32 bits
  qnx4: don't leak ->BitMap on late failure exits
  qnx4: reduce the insane nesting in qnx4_checkroot()
  qnx4: di_fname is an array, for crying out loud...
  KEYS: Permit key_serial() to be called with a const key pointer
  keys: fix user_defined key sparse messages
  ima: fix cred sparse warning
  uml: fix compile for x86-64
  MPILIB: Add a missing ENOMEM check
  tpm: fix (ACPI S3) suspend regression
  nvme: fix merge error due to change of 'make_request_fn' fn type
  xen: using EXPORT_SYMBOL requires including export.h
  gpio: tps65910: Use correct offset for gpio initialization
  acpi/apei/einj: Add extensions to EINJ from rev 5.0 of acpi spec
  intel_idle: Split up and provide per CPU initialization func
  ACPI processor: Remove unneeded variable passed by acpi_processor_hotadd_init V2
  tg3: Fix single-vector MSI-X code
  openvswitch: Fix multipart datapath dumps.
  ipv6: fix per device IP snmp counters
  ...
2012-01-27 11:14:02 -05:00
Ingo Molnar
44a6839711 Merge branch 'perf/fast' into perf/core
Merge reason: Lets ready it for v3.4

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-27 12:08:09 +01:00
Thomas Renninger
fad12ac8c8 CPU: Introduce ARCH_HAS_CPU_AUTOPROBE and X86 parts
This patch is based on Andi Kleen's work:
Implement autoprobing/loading of modules serving CPU
specific features (x86cpu autoloading).

And Kay Siever's work to get rid of sysdev cpu structures
and making use of struct device instead.

Before, the cpuid driver had to be loaded to get the x86cpu
autoloading feature. With this patch autoloading works through
the /sys/devices/system/cpu object

Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Len Brown <lenb@kernel.org>
Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-26 16:49:08 -08:00
Andi Kleen
78ff123b05 x86: autoload microcode driver on Intel and AMD systems v2
Don't try to describe the actual models for now.

v2: Fix typo: X86_VENDOR_ANY -> X86_FAMILY_ANY (trenn)

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-26 16:49:07 -08:00
Thomas Renninger
2f1e097e24 X86: Introduce HW-Pstate scattered cpuid feature
It is rather similar to CPB (boot capability) feature
and exists since fam10h (can be looked up in AMD's BKDG).

The feature is needed for powernow-k8 to cleanup init functions and to
provide proper autoloading matching with the new x86cpu modalias
feature.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Borislav Petkov <bp@amd64.org>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-26 16:49:06 -08:00
Andi Kleen
3bd391f056 crypto: Add support for x86 cpuid auto loading for x86 crypto drivers
Add support for auto-loading of crypto drivers based on cpuid features.
This enables auto-loading of the VIA and Intel specific drivers
for AES, hashing and CRCs.

Requires the earlier infrastructure patch to add x86 modinfo.
I kept it all in a single patch for now.

I dropped the printks when the driver cpuid doesn't match (imho
drivers never should print anything in such a case)

One drawback is that udev doesn't know if the drivers are used or not,
so they will be unconditionally loaded at boot up. That's better
than not loading them at all, like it often happens.

Cc: Dave Jones <davej@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jen Axboe <axboe@kernel.dk>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-26 16:48:10 -08:00
Andi Kleen
644e9cbbe3 Add driver auto probing for x86 features v4
There's a growing number of drivers that support a specific x86 feature
or CPU.  Currently loading these drivers currently on a generic
distribution requires various driver specific hacks and it often
doesn't work.

This patch adds auto probing for drivers based on the x86 cpuid
information, in particular based on vendor/family/model number
and also based on CPUID feature bits.

For example a common issue is not loading the SSE 4.2 accelerated
CRC module: this can significantly lower the performance of BTRFS
which relies on fast CRC.

Another issue is loading the right CPUFREQ driver for the current CPU.
Currently distributions often try all all possible driver until
one sticks, which is not really a good way to do this.

It works with existing udev without any changes. The code
exports the x86 information as a generic string in sysfs
that can be matched by udev's pattern matching.

This scheme does not support numeric ranges, so if you want to
handle e.g. ranges of model numbers they have to be encoded
in ASCII or simply all models or families listed. Fixing
that would require changing udev.

Another issue is that udev will happily load all drivers that match,
there is currently no nice way to stop a specific driver from
being loaded if it's not needed (e.g. if you don't need fast CRC)
But there are not that many cpu specific drivers around and they're
all not that bloated, so this isn't a particularly serious issue.

Originally this patch added the modalias to the normal cpu
sysdevs. However sysdevs don't have all the infrastructure
needed for udev, so it couldn't really autoload drivers.
This patch instead adds the CPU modaliases to the cpuid devices,
which are real devices with full support for udev. This implies
that the cpuid driver has to be loaded to use this.

This patch just adds infrastructure, some driver conversions
in followups.

Thanks to Kay for helping with some sysfs magic.

v2: Constifcation, some updates
v4: (trenn@suse.de):
    - Use kzalloc instead of kmalloc to terminate modalias buffer
    - Use uppercase hex values to match correctly against hex values containing
      letters

Cc: Dave Jones <davej@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jen Axboe <axboe@kernel.dk>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-26 16:44:41 -08:00
Tony Luck
08dda402d6 x86/mce: Replace hard coded hex constants with symbolic defines
Magic constants like 0x0134 in code just invite questions on
where they come from, what they mean, can they be changed.

Provide #defines for the architecturally defined MCACOD values
with a reference to the Intel Software Developers manual which
describes them.

Signed-off-by: Tony Luck <tony.luck@intel.com>
2012-01-26 16:02:22 -08:00
Prarit Bhargava
b0f4c4b32c bugs, x86: Fix printk levels for panic, softlockups and stack dumps
rsyslog will display KERN_EMERG messages on a connected
terminal.  However, these messages are useless/undecipherable
for a general user.

For example, after a softlockup we get:

 Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
 kernel:Stack:

 Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
 kernel:Call Trace:

 Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
 kernel:Code: ff ff a8 08 75 25 31 d2 48 8d 86 38 e0 ff ff 48 89
 d1 0f 01 c8 0f ae f0 48 8b 86 38 e0 ff ff a8 08 75 08 b1 01 4c 89 e0 0f 01 c9 <e8> ea 69 dd ff 4c 29 e8 48 89 c7 e8 0f bc da ff 49 89 c4 49 89

This happens because the printk levels for these messages are
incorrect. Only an informational message should be displayed on
a terminal.

I modified the printk levels for various messages in the kernel
and tested the output by using the drivers/misc/lkdtm.c kernel
modules (ie, softlockups, panics, hard lockups, etc.) and
confirmed that the console output was still the same and that
the output to the terminals was correct.

For example, in the case of a softlockup we now see the much
more informative:

 Message from syslogd@intel-s3e37-04 at Jan 25 10:18:06 ...
 BUG: soft lockup - CPU4 stuck for 60s!

instead of the above confusing messages.

AFAICT, the messages no longer have to be KERN_EMERG.  In the
most important case of a panic we set console_verbose().  As for
the other less severe cases the correct data is output to the
console and /var/log/messages.

Successfully tested by me using the drivers/misc/lkdtm.c module.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: dzickus@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1327586134-11926-1-git-send-email-prarit@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 21:28:45 +01:00
Mika Westerberg
ecfdb0ac15 x86/mrst: Add msic_thermal platform support
This will let the MSIC driver to create platform device for the
thermal driver.

Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: jacob.jun.pan@linux.intel.com
Link: http://lkml.kernel.org/n/tip-rh1jaft9tjpzfql76gd56h1q@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 21:23:56 +01:00
Mika Westerberg
15a713df41 x86/config: Select MSIC MFD driver on Intel Medfield platform
On Intel Medfield platform we use MSIC MFD driver to create
necessary platform devices so it is essential to have the driver
compiled into the kernel.

Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: jacob.jun.pan@linux.intel.com
Link: http://lkml.kernel.org/n/tip-7hp1otk4wf4mg5pqohcwt06w@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 21:23:55 +01:00
Alan Cox
1a8359e411 x86/mid: Remove Intel Moorestown
All production devices operate in the Oaktrail configuration
with legacy PC elements present and an ACPI BIOS. Continue
stripping out the Moorestown elements from the tree leaving
Medfield.

Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: jacob.jun.pan@linux.intel.com
Link: http://lkml.kernel.org/n/tip-fvm1hgpq99jln6l0fbek68ik@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 21:23:53 +01:00
Jacob Pan
d450c088fb x86/mrst: Set ISA bus type for fake MP IRQs
We use MP IRQs for SFI presented timer interrupts, we should
also set mp_bus_not_pci for MP_ISA_BUS so that pin_2_irq mapping
is correct.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Link: http://lkml.kernel.org/n/tip-8h3rc1igpp8ir94aas69qmhk@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 21:23:52 +01:00
Jacob Pan
b3eea29c18 x86/ioapic: Use legacy_pic to set correct gsi-irq mapping
Using compile time NR_LEGACY_IRQS causes the wrong gsi-irq
mapping on non-PC platforms, such as Moorestown. This patch uses
legacy_pic abstraction to set the correct number of legacy
interrupts at runtime. For Moorestown, nr_legacy_irqs = 0. We
have 1:1 mapping for gsi-irq even within the legacy irq range.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Link: http://lkml.kernel.org/n/tip-kzvj4xp9tmicuoqoh2w05iay@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 21:23:50 +01:00
Jan Beulich
9d8e22777e x86-64: Handle byte-wise tail copying in memcpy() without a loop
While hard to measure, reducing the number of possibly/likely
mis-predicted branches can generally be expected to be slightly
better.

Other than apparent at the first glance, this also doesn't grow
the function size (the alignment gap to the next function just
gets smaller).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/4F218584020000780006F422@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 21:19:20 +01:00
Jan Beulich
2ab560911a x86-64: Fix memcpy() to support sizes of 4Gb and above
While currently there doesn't appear to be any reachable in-tree
case where such large memory blocks may be passed to memcpy(),
we already had hit the problem in our Xen kernels. Just like
done recently for mmeset(), rather than working around it,
prevent others from falling into the same trap by fixing this
long standing limitation.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/4F21846F020000780006F3FA@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 21:19:18 +01:00
Jan Beulich
fc395b9291 x86: Properly parenthesize cmpxchg() macro arguments
Quite oddly, all of the arguments passed through from the top
level macros to the second level which didn't need parentheses
had them, while the only expression (involving a parameter)
needing them didn't.

Very recently I got bitten by the lack thereof when using
something like "array + index" for the first operand, with
"array" being an array more narrow than int.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/4F2183A9020000780006F3E6@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 21:18:29 +01:00
Andreas Herrmann
5b68edc91c x86/microcode_amd: Add support for CPU family specific container files
We've decided to provide CPU family specific container files
(starting with CPU family 15h). E.g. for family 15h we have to
load microcode_amd_fam15h.bin instead of microcode_amd.bin

Rationale is that starting with family 15h patch size is larger
than 2KB which was hard coded as maximum patch size in various
microcode loaders (not just Linux).

Container files which include patches larger than 2KB cause
different kinds of trouble with such old patch loaders. Thus we
have to ensure that the default container file provides only
patches with size less than 2KB.

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/20120120164412.GD24508@alberich.amd.com
[ documented the naming convention and tidied the code a bit. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 12:06:39 +01:00
Andreas Herrmann
652847aa44 x86/amd: Add missing feature flag for fam15h models 10h-1fh processors
That is the last one missing for those CPUs.

Others were recently added with commits

 fb215366b3
 (KVM: expose latest Intel cpu new features (BMI1/BMI2/FMA/AVX2) to guest)

and

 commit 969df4b829
 (x86: Report cpb and eff_freq_ro flags correctly)

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Link: http://lkml.kernel.org/r/20120120163823.GC24508@alberich.amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 12:06:38 +01:00
Jan Beulich
5d7244e7c9 x86-64: Fix memset() to support sizes of 4Gb and above
While currently there doesn't appear to be any reachable in-tree
case where such large memory blocks may be passed to memset()
(alloc_bootmem() being the primary non-reachable one, as it gets
called with suitably large sizes in FLATMEM configurations), we
have recently hit the problem a second time in our Xen kernels.

Rather than working around it a second time, prevent others from
falling into the same trap by fixing this long standing
limitation.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/4F05D992020000780006AA09@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 11:50:04 +01:00
Ingo Molnar
4e9f44ba29 MCE recovery (data path only)
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJPHy7VAAoJEKurIx+X31iBHcsP/1VcGuxFAL5i/zBqUqhbS7BL
 s+4or1j3NOcxIePQ9egg1L/sLzD+jmo37ObFMTzFOLwuLeodtJF6e0DXQhR7bMKz
 UqOS4WAhNxRBtZtUqIbIiMoDG4Vny1atdqxDQKzmV88ulTG2+JE5U6sGjfTdWvX7
 gZA6Vj31Dz7p6scPT2j8tnLjFV+XvVJSBp/2rgi2Nw81UzBeIRZRiWZrBMLemPCU
 T82OEffnIpSdn60sktMN/ht99yGQO31zT0c+/72Z0ysZAPlTjFbW7CZJHPZmLIVB
 tPkoTRFOf4iwjy2pZNzs9bB8ord/As3IyTxAsfYUin4N2bX27n058uTQ3CqbgEz+
 pa6C5N0ZrV9plYa9BbgCHmNIkhEONIb3WtH27uh/hZOztDA2CXzPT5mm4FOzmrJ7
 DGVBqmXth6g2jYJNT/K2QgmVMZM0CeXQnoDJP54sXzv7F4dEM5P64Lz6E1kCd5Jf
 x9O1orDnEVXssgEPVtF/eEjIQK/vF7s1BUUlMBZJwdAyTwCiD8RvueG87bApnA2z
 eO8VS62akqjpDt5sHboAGJrjcuhqnkbgtG2dn0EqONzk8DJPnhFXVLmSbvH+KuTC
 OguH2LC5N7n9Wjr5a9Duw2DdIj8njvzFrKVzo/l6r3m99u/Jby54vGk2cPLwfGvp
 /9Y+SK2Ou6LSbPiRU4dP
 =ofSb
 -----END PGP SIGNATURE-----

Merge tag 'mce-recovery-for-tip' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/mce

Implement MCE recovery for the data load error path and assorted cleanups.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 11:40:13 +01:00
Jesper Juhl
5067cf53ca x86/boot-image: Don't leak phdrs in arch/x86/boot/compressed/misc.c::Parse_elf()
We allocate memory with malloc(), but neglect to free it before
the variable 'phdrs' goes out of scope --> leak.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1201232332590.8772@swampdragon.chaosbits.net
[ Mostly harmless. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 11:30:29 +01:00
Daniel J Blueman
3fe54564a6 x86/numachip: Drop unnecessary conflict with EDAC
EDAC detection no longer crashes multi-node systems, so don't
conflict on it with NumaChip.

Signed-off-by: Daniel J Blueman <daniel@numascale-asia.com>
Cc: Steffen Persvold <sp@numascale.com>
Link: http://lkml.kernel.org/r/1327473349-28395-1-git-send-email-daniel@numascale-asia.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 11:03:03 +01:00
Cliff Wickman
d2ebc71d47 x86/uv: Fix uninitialized spinlocks
Initialize two spinlocks in tlb_uv.c and also properly define/initialize
the uv_irq_lock.

The lack of explicit initialization seems to be functionally
harmless, but it is diagnosed when these are turned on:

        CONFIG_DEBUG_SPINLOCK=y
        CONFIG_DEBUG_MUTEXES=y
        CONFIG_DEBUG_LOCK_ALLOC=y
        CONFIG_LOCKDEP=y

Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: <stable@kernel.org>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/E1RnXd1-0003wU-PM@eag09.americas.sgi.com
[ Added the uv_irq_lock initialization fix by Dimitri Sivanich ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 10:58:34 +01:00
Russ Anderson
5a51467b14 x86/uv: Fix uv_gpa_to_soc_phys_ram() shift
uv_gpa_to_soc_phys_ram() was inadvertently ignoring the
shift values.  This fix takes the shift into account.

Signed-off-by: Russ Anderson <rja@sgi.com>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/20120119020753.GA7228@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 10:58:27 +01:00
Linus Torvalds
701b259f44 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Davem says:

1) Fix JIT code generation on x86-64 for divide by zero, from Eric Dumazet.

2) tg3 header length computation correction from Eric Dumazet.

3) More build and reference counting fixes for socket memory cgroup
   code from Glauber Costa.

4) module.h snuck back into a core header after all the hard work we
   did to remove that, from Paul Gortmaker and Jesper Dangaard Brouer.

5) Fix PHY naming regression and add some new PCI IDs in stmmac, from
   Alessandro Rubini.

6) Netlink message generation fix in new team driver, should only advertise
   the entries that changed during events, from Jiri Pirko.

7) SRIOV VF registration and unregistration fixes, and also add a
   missing PCI ID, from Roopa Prabhu.

8) Fix infinite loop in tx queue flush code of brcmsmac, from Stanislaw Gruszka.

9) ftgmac100/ftmac100 build fix, missing interrupt.h include.

10) Memory leak fix in net/hyperv do_set_mutlicast() handling, from Wei Yongjun.

11) Off by one fix in netem packet scheduler, from Vijay Subramanian.

12) TCP loss detection fix from Yuchung Cheng.

13) TCP reset packet MD5 calculation uses wrong address, fix from Shawn Lu.

14) skge carrier assertion and DMA mapping fixes from Stephen Hemminger.

15) Congestion recovery undo performed at the wrong spot in BIC and CUBIC
    congestion control modules, fix from Neal Cardwell.

16) Ethtool ETHTOOL_GSSET_INFO is unnecessarily restrictive, from Michał Mirosław.

17) Fix triggerable race in ipv6 sysctl handling, from Francesco Ruggeri.

18) Statistics bug fixes in mlx4 from Eugenia Emantayev.

19) rds locking bug fix during info dumps, from your's truly.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits)
  rds: Make rds_sock_lock BH rather than IRQ safe.
  netprio_cgroup.h: dont include module.h from other includes
  net: flow_dissector.c missing include linux/export.h
  team: send only changed options/ports via netlink
  net/hyperv: fix possible memory leak in do_set_multicast()
  drivers/net: dsa/mv88e6xxx.c files need linux/module.h
  stmmac: added PCI identifiers
  llc: Fix race condition in llc_ui_recvmsg
  stmmac: fix phy naming inconsistency
  dsa: Add reporting of silicon revision for Marvell 88E6123/88E6161/88E6165 switches.
  tg3: fix ipv6 header length computation
  skge: add byte queue limit support
  mv643xx_eth: Add Rx Discard and Rx Overrun statistics
  bnx2x: fix compilation error with SOE in fw_dump
  bnx2x: handle CHIP_REVISION during init_one
  bnx2x: allow user to change ring size in ISCSI SD mode
  bnx2x: fix Big-Endianess in ethtool -t
  bnx2x: fixed ethtool statistics for MF modes
  bnx2x: credit-leakage fixup on vlan_mac_del_all
  macvlan: fix a possible use after free
  ...
2012-01-24 15:51:40 -08:00