linux

q3k/linux

Author	SHA1	Message	Date
Linus Torvalds	bb7762961d	Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (22 commits) x86: fix system without memory on node0 x86, mm: Fix node_possible_map logic mm, x86: remove MEMORY_HOTPLUG_RESERVE related code x86: make sparse mem work in non-NUMA mode x86: process.c, remove useless headers x86: merge process.c a bit x86: use sparse_memory_present_with_active_regions() on UMA x86: unify 64-bit UMA and NUMA paging_init() x86: Allow 1MB of slack between the e820 map and SRAT, not 4GB x86: Sanity check the e820 against the SRAT table using e820 map only x86: clean up and and print out initial max_pfn_mapped x86/pci: remove rounding quirk from e820_setup_gap() x86, e820, pci: reserve extra free space near end of RAM x86: fix typo in address space documentation x86: 46 bit physical address support on 64 bits x86, mm: fault.c, use printk_once() in is_errata93() x86: move per-cpu mmu_gathers to mm/init.c x86: move max_pfn_mapped and max_low_pfn_mapped to setup.c x86: unify noexec handling x86: remove (null) in /sys kernel_page_tables ...	2009-06-10 16:13:20 -07:00
Linus Torvalds	48c72d1ab4	Merge branch 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, microcode: Simplify vfree() use x86: microcode: use smp_call_function_single instead of set_cpus_allowed, cleanup of synchronization logic	2009-06-10 16:13:06 -07:00
Linus Torvalds	5301e0de34	Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86_64: fix incorrect comments x86: unify restore_fpu_checking x86_32: introduce restore_fpu_checking()	2009-06-10 15:55:04 -07:00
Linus Torvalds	c44e3ed539	Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: cpu_debug: Remove model information to reduce encoding-decoding x86: fixup numa_node information for AMD CPU northbridge functions x86: k8 convert node_to_k8_nb_misc() from a macro to an inline function x86: cacheinfo: complete L2/L3 Cache and TLB associativity field definitions x86/docs: add description for cache_disable sysfs interface x86: cacheinfo: disable L3 ECC scrubbing when L3 cache index is disabled x86: cacheinfo: replace sysfs interface for cache_disable feature x86: cacheinfo: use cached K8 NB_MISC devices instead of scanning for it x86: cacheinfo: correct return value when cache_disable feature is not active x86: cacheinfo: use L3 cache index disable feature only for CPUs that support it	2009-06-10 15:51:15 -07:00
Linus Torvalds	7dc3ca39cb	Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, nmi: Use predefined numbers instead of hardcoded one x86: asm/processor.h: remove double declaration x86, mtrr: replace MTRRdefType_MSR with msr-index's MSR_MTRRdefType x86, mtrr: replace MTRRfix4K_C0000_MSR with msr-index's MSR_MTRRfix4K_C0000 x86, mtrr: remove mtrr MSRs double declaration x86, mtrr: replace MTRRfix16K_80000_MSR with msr-index's MSR_MTRRfix16K_80000 x86, mtrr: replace MTRRfix64K_00000_MSR with msr-index's MSR_MTRRfix64K_00000 x86, mtrr: replace MTRRcap_MSR with msr-index's MSR_MTRRcap x86: mce: remove duplicated #include x86: msr-index.h remove duplicate MSR C001_0015 declaration x86: clean up arch/x86/kernel/tsc_sync.c a bit x86: use symbolic name for VM86_SIGNAL when used as vm86 default return x86: added 'ifndef _ASM_X86_IOMAP_H' to iomap.h x86: avoid multiple declaration of kstack_depth_to_print x86: vdso/vma.c declare vdso_enabled and arch_setup_additional_pages before they get used x86: clean up declarations and variables x86: apic/x2apic_cluster.c x86_cpu_to_logical_apicid should be static x86 early quirks: eliminate unused function	2009-06-10 15:49:36 -07:00
Linus Torvalds	aa98936e4f	Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, 64-bit: ifdef out struct thread_struct::ip x86, 32-bit: ifdef out struct thread_struct::fs x86: clean up alternative.h	2009-06-10 15:49:10 -07:00
Linus Torvalds	99e97b860e	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: fix typo in sched-rt-group.txt file ftrace: fix typo about map of kernel priority in ftrace.txt file. sched: properly define the sched_group::cpumask and sched_domain::span fields sched, timers: cleanup avenrun users sched, timers: move calc_load() to scheduler sched: Don't export sched_mc_power_savings on multi-socket single core system sched: emit thread info flags with stack trace sched: rt: document the risk of small values in the bandwidth settings sched: Replace first_cpu() with cpumask_first() in ILB nomination code sched: remove extra call overhead for schedule() sched: use group_first_cpu() instead of cpumask_first(sched_group_cpus()) wait: don't use __wake_up_common() sched: Nominate a power-efficient ilb in select_nohz_balancer() sched: Nominate idle load balancer from a semi-idle package. sched: remove redundant hierarchy walk in check_preempt_wakeup	2009-06-10 15:32:59 -07:00
Ingo Molnar	511b01bdf6	Revert "x86, bts: reenable ptrace branch trace support" This reverts commit `7e0bfad24d`. A late objection to the ABI has arrived: http://lkml.org/lkml/2009/6/10/253 Keep the ABI disabled out of caution, to not create premature user-space expectations. While the hw-branch-tracing variant uses and tests the BTS code. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Markus Metzger <markus.t.metzger@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-11 00:32:00 +02:00
Linus Torvalds	82782ca77d	Merge branch 'x86-kbuild-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-kbuild-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (46 commits) x86, boot: add new generated files to the appropriate .gitignore files x86, boot: correct the calculation of ZO_INIT_SIZE x86-64: align __PHYSICAL_START, remove __KERNEL_ALIGN x86, boot: correct sanity checks in boot/compressed/misc.c x86: add extension fields for bootloader type and version x86, defconfig: update kernel position parameters x86, defconfig: update to current, no material changes x86: make CONFIG_RELOCATABLE the default x86: default CONFIG_PHYSICAL_START and CONFIG_PHYSICAL_ALIGN to 16 MB x86: document new bzImage fields x86, boot: make kernel_alignment adjustable; new bzImage fields x86, boot: remove dead code from boot/compressed/head_.S x86, boot: use LOAD_PHYSICAL_ADDR on 64 bits x86, boot: make symbols from the main vmlinux available x86, boot: determine compressed code offset at compile time x86, boot: use appropriate rep string for move and clear x86, boot: zero EFLAGS on 32 bits x86, boot: set up the decompression stack as early as possible x86, boot: straighten out ranges to copy/zero in compressed/head.S x86, boot: stylistic cleanups for boot/compressed/head_64.S ... Fixed trivial conflict in arch/x86/configs/x86_64_defconfig manually	2009-06-10 15:30:41 -07:00
Linus Torvalds	f0d5e12bd4	Merge branch 'irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (76 commits) x86, apic: Fix dummy apic read operation together with broken MP handling x86, apic: Restore irqs on fail paths x86: Print real IOAPIC version for x86-64 x86: enable_update_mptable should be a macro sparseirq: Allow early irq_desc allocation x86, io-apic: Don't mark pin_programmed early x86, irq: don't call mp_config_acpi_gsi() if update_mptable is not enabled x86, irq: update_mptable needs pci_routeirq x86: don't call read_apic_id if !cpu_has_apic x86, apic: introduce io_apic_irq_attr x86/pci: add 4 more return parameters to IO_APIC_get_PCI_irq_vector(), fix x86: read apic ID in the !acpi_lapic case x86: apic: Fixmap apic address even if apic disabled x86: display extended apic registers with print_local_APIC and cpu_debug code x86: read apic ID in the !acpi_lapic case x86: clean up and fix setup_clear/force_cpu_cap handling x86: apic: Check rev 3 fadt correctly for physical_apic bit x86/pci: update pirq_enable_irq() to setup io apic routing x86/acpi: move setup io apic routing out of CONFIG_ACPI scope x86/pci: add 4 more return parameters to IO_APIC_get_PCI_irq_vector() ...	2009-06-10 15:25:41 -07:00
Harald Welte	0fea615e52	CPUFREQ: Mark e_powersaver driver as EXPERIMENTAL and DANGEROUS The e_powersaver driver for VIA's C7 CPU's needs to be marked as DANGEROUS as it configures the CPU to power states that are out of specification. According to Centaur, all systems with C7 and Nano CPU's support the ACPI p-state method. Thus, the acpi-cpufreq driver should be used instead. Signed-off-by: Harald Welte <HaraldWelte@viatech.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-10 15:22:44 -07:00
Harald Welte	0de51088e6	CPUFREQ: Enable acpi-cpufreq driver for VIA/Centaur CPUs The VIA/Centaur C7, C7-M and Nano CPU's all support ACPI based cpu p-states using a MSR interface. The Linux driver just never made use of it, since in addition to the check for the EST flag it also checked if the vendor is Intel. Signed-off-by: Harald Welte <HaraldWelte@viatech.com> [ Removed the vendor checks entirely - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-10 15:22:44 -07:00
Peter Zijlstra	bd2b5b1284	perf_counter: More aggressive frequency adjustment Also employ the overflow handler to adjust the frequency, this results in a stable frequency in about 40~50 samples, instead of that many ticks. This also means we can start sampling at a sample period of 1 without running head-first into the throttle. It relies on sched_clock() to accurately measure the time difference between the overflow NMIs. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-10 16:55:26 +02:00
Yong Wang	dc81081b2d	perf_counter/x86: Fix the model number of Intel Core2 processors Fix the model number of Intel Core2 processors according to the documentation: Intel Processor Identification with the CPUID Instruction: http://www.intel.com/support/processors/sb/cs-009861.htm Signed-off-by: Yong Wang <yong.y.wang@intel.com> Also-Reported-by: Arnd Bergmann <arnd@arndb.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <20090610090612.GA26580@ywang-moblin2.bj.intel.com> [ Added two more model numbers suggested by Arnd Bergmann ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-10 13:04:43 +02:00
Borislav Petkov	b034c19f9f	x86: MSR: add methods for writing of an MSR on several CPUs Provide for concurrent MSR writes on all the CPUs in the cpumask. Also, add a temporary workaround for smp_call_function_many which skips the CPU we're executing on. Bart: zero out rv struct which is allocated on stack. CC: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>	2009-06-10 12:18:43 +02:00
Borislav Petkov	6bc1096d7a	x86: MSR: add a struct representation of an MSR Add a struct representing a 64bit MSR pair consisting of a low and high register part and convert msr_info to use it. Also, rename msr-on-cpu.c to msr.c. Side note: Put the cpumask.h include in __KERNEL__ space thus fixing an allmodconfig build failure in the headers_check target. CC: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>	2009-06-10 12:18:42 +02:00
Andi Kleen	a0861c02a9	KVM: Add VT-x machine check support VT-x needs an explicit MC vector intercept to handle machine checks in the hyper visor. It also has a special option to catch machine checks that happen during VT entry. Do these interceptions and forward them to the Linux machine check handler. Make it always look like user space is interrupted because the machine check handler treats kernel/user space differently. Thanks to Jiang Yunhong for help and testing. Cc: stable@kernel.org Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 12:27:08 +03:00
Nitin A Kamble	56b237e31a	KVM: VMX: Rename rmode.active to rmode.vm86_active That way the interpretation of rmode.active becomes more clear with unrestricted guest code. Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:49:00 +03:00
Gleb Natapov	20f65983e3	KVM: Move "exit due to NMI" handling into vmx_complete_interrupts() To save us one reading of VM_EXIT_INTR_INFO. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:59 +03:00
Gleb Natapov	8db3baa2db	KVM: Disable CR8 intercept if tpr patching is active Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:59 +03:00
Gleb Natapov	36752c9b91	KVM: Do not migrate pending software interrupts. INTn will be re-executed after migration. If we wanted to migrate pending software interrupt we would need to migrate interrupt type and instruction length too, but we do not have all required info on SVM, so SVM->VMX migration would need to re-execute INTn anyway. To make it simple never migrate pending soft interrupt. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:59 +03:00
Gleb Natapov	44c11430b5	KVM: inject NMI after IRET from a previous NMI, not before. If NMI is received during handling of another NMI it should be injected immediately after IRET from previous NMI handler, but SVM intercept IRET before instruction execution so we can't inject pending NMI at this point and there is not way to request exit when NMI window opens. This patch fix SVM code to open NMI window after IRET by single stepping over IRET instruction. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:59 +03:00
Gleb Natapov	6a8b1d1312	KVM: Always request IRQ/NMI window if an interrupt is pending Currently they are not requested if there is pending exception. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:58 +03:00
Gleb Natapov	66fd3f7f90	KVM: Do not re-execute INTn instruction. Re-inject event instead. This is what Intel suggest. Also use correct instruction length when re-injecting soft fault/interrupt. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:58 +03:00
Gleb Natapov	f629cf8485	KVM: skip_emulated_instruction() decode instruction if size is not known Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:58 +03:00
Gleb Natapov	923c61bbc6	KVM: Remove irq_pending bitmap Only one interrupt vector can be injected from userspace irqchip at any given time so no need to store it in a bitmap. Put it into interrupt queue directly. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:57 +03:00
Gleb Natapov	fa9726b073	KVM: Do not allow interrupt injection from userspace if there is a pending event. The exception will immediately close the interrupt window. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:57 +03:00
Gleb Natapov	3298b75c88	KVM: Unprotect a page if #PF happens during NMI injection. It is done for exception and interrupt already. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:57 +03:00
Robert P. J. Day	58f8ac279a	KVM: Expand on "help" info to specify kvm intel and amd module names Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Cc: Avi Kivity <avi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:55 +03:00
Marcelo Tosatti	8986ecc0ef	KVM: x86: check for cr3 validity in mmu_alloc_roots Verify the cr3 address stored in vcpu->arch.cr3 points to an existant memslot. If not, inject a triple fault. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:55 +03:00
Marcelo Tosatti	7c8a83b75a	KVM: MMU: protect kvm_mmu_change_mmu_pages with mmu_lock kvm_handle_hva, called by MMU notifiers, manipulates mmu data only with the protection of mmu_lock. Update kvm_mmu_change_mmu_pages callers to take mmu_lock, thus protecting against kvm_handle_hva. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:54 +03:00
Glauber Costa	310b5d306c	KVM: Deal with interrupt shadow state for emulated instructions We currently unblock shadow interrupt state when we skip an instruction, but failing to do so when we actually emulate one. This blocks interrupts in key instruction blocks, in particular sti; hlt; sequences If the instruction emulated is an sti, we have to block shadow interrupts. The same goes for mov ss. pop ss also needs it, but we don't currently emulate it. Without this patch, I cannot boot gpxe option roms at vmx machines. This is described at https://bugzilla.redhat.com/show_bug.cgi?id=494469 Signed-off-by: Glauber Costa <glommer@redhat.com> CC: H. Peter Anvin <hpa@zytor.com> CC: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:54 +03:00
Glauber Costa	2809f5d2c4	KVM: Replace ->drop_interrupt_shadow() by ->set_interrupt_shadow() This patch replaces drop_interrupt_shadow with the more general set_interrupt_shadow, that can either drop or raise it, depending on its parameter. It also adds ->get_interrupt_shadow() for future use. Signed-off-by: Glauber Costa <glommer@redhat.com> CC: H. Peter Anvin <hpa@zytor.com> CC: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:54 +03:00
Marcelo Tosatti	32f8840064	KVM: use smp_send_reschedule in kvm_vcpu_kick KVM uses a function call IPI to cause the exit of a guest running on a physical cpu. For virtual interrupt notification there is no need to wait on IPI receival, or to execute any function. This is exactly what the reschedule IPI does, without the overhead of function IPI. So use it instead of smp_call_function_single in kvm_vcpu_kick. Also change the "guest_mode" variable to a bit in vcpu->requests, and use that to collapse multiple IPI's that would be issued between the first one and zeroing of guest mode. This allows kvm_vcpu_kick to called with interrupts disabled. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:53 +03:00
Avi Kivity	d149c731e4	KVM: Update cpuid 1.ecx reporting Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:53 +03:00
Avi Kivity	069ebaa464	x86: Add cpu features MOVBE and POPCNT Add cpu feature bit support for the MOVBE and POPCNT instructions. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:52 +03:00
Avi Kivity	7faa4ee1c7	KVM: Add AMD cpuid bit: cr8_legacy, abm, misaligned sse, sse4, 3dnow prefetch Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:52 +03:00
Avi Kivity	8d753f369b	KVM: Fix cpuid feature misreporting MTRR, PAT, MCE, and MCA are all supported (to some extent) but not reported. Vista requires these features, so if userspace relies on kernel cpuid reporting, it loses support for Vista. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:52 +03:00
Jan Kiszka	d6a8c875f3	KVM: Drop request_nmi from stats The stats entry request_nmi is no longer used as the related user space interface was dropped. So clean it up. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:51 +03:00
Gleb Natapov	fe8e7f83de	KVM: SVM: Don't reinject event that caused a task switch If a task switch caused by an event remove it from the event queue. VMX already does that. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:51 +03:00
Andre Przywara	b586eb0253	KVM: SVM: Fix cross vendor migration issue in segment segment descriptor On AMD CPUs sometimes the DB bit in the stack segment descriptor is left as 1, although the whole segment has been made unusable. Clear it here to pass an Intel VMX entry check when cross vendor migrating. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:51 +03:00
Glauber Costa	9b5843ddd2	KVM: fix apic_debug instances Apparently nobody turned this on in a while... setting apic_debug to something compilable, generates some errors. This patch fixes it. Signed-off-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:50 +03:00
Sheng Yang	522c68c441	KVM: Enable snooping control for supported hardware Memory aliases with different memory type is a problem for guest. For the guest without assigned device, the memory type of guest memory would always been the same as host(WB); but for the assigned device, some part of memory may be used as DMA and then set to uncacheable memory type(UC/WC), which would be a conflict of host memory type then be a potential issue. Snooping control can guarantee the cache correctness of memory go through the DMA engine of VT-d. [avi: fix build on ia64] Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:50 +03:00
Sheng Yang	4b12f0de33	KVM: Replace get_mt_mask_shift with get_mt_mask Shadow_mt_mask is out of date, now it have only been used as a flag to indicate if TDP enabled. Get rid of it and use tdp_enabled instead. Also put memory type logical in kvm_x86_ops->get_mt_mask(). Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:49 +03:00
Jan Blunck	9b62e5b10f	KVM: Wake up waitqueue before calling get_cpu() This moves the get_cpu() call down to be called after we wake up the waiters. Therefore the waitqueue locks can safely be rt mutex. Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: Sven-Thorsten Dietrich <sven@thebigcorporation.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:49 +03:00
Gleb Natapov	14d0bc1f7c	KVM: Get rid of get_irq() callback It just returns pending IRQ vector from the queue for VMX/SVM. Get IRQ directly from the queue before migration and put it back after. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:49 +03:00
Gleb Natapov	16d7a19117	KVM: Fix userspace IRQ chip migration Re-put pending IRQ vector into interrupt_bitmap before migration. Otherwise it will be lost if migration happens in the wrong time. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:48 +03:00
Gleb Natapov	95ba827313	KVM: SVM: Add NMI injection support Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:48 +03:00
Gleb Natapov	c4282df98a	KVM: Get rid of arch.interrupt_window_open & arch.nmi_window_open They are recalculated before each use anyway. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:48 +03:00
Gleb Natapov	0a5fff1923	KVM: Do not report TPR write to userspace if new value bigger or equal to a previous one. Saves many exits to userspace in a case of IRQ chip in userspace. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:47 +03:00
Gleb Natapov	615d519305	KVM: sync_lapic_to_cr8() should always sync cr8 to V_TPR Even if IRQ chip is in userspace. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:47 +03:00
Gleb Natapov	115666dfc7	KVM: Remove kvm_push_irq() No longer used. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:47 +03:00
Gleb Natapov	1d6ed0cb95	KVM: Remove inject_pending_vectors() callback It is the same as inject_pending_irq() for VMX/SVM now. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:47 +03:00
Gleb Natapov	1cb948ae86	KVM: Remove exception_injected() callback. It always return false for VMX/SVM now. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:46 +03:00
Gleb Natapov	9222be18f7	KVM: SVM: Coalesce userspace/kernel irqchip interrupt injection logic Start to use interrupt/exception queues like VMX does. This also fix the bug that if exit was caused by a guest internal exception access to IDT the exception was not reinjected. Use EVENTINJ to inject interrupts. Use VINT only for detecting when IRQ windows is open again. EVENTINJ ensures the interrupt is injected immediately and not delayed. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:46 +03:00
Gleb Natapov	5df5664647	KVM: Use kvm_arch_interrupt_allowed() instead of checking interrupt_window_open directly kvm_arch_interrupt_allowed() also checks IF so drop the check. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:46 +03:00
Gleb Natapov	1f21e79aac	KVM: VMX: Cleanup vmx_intr_assist() Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:45 +03:00
Gleb Natapov	863e8e658e	KVM: VMX: Consolidate userspace and kernel interrupt injection for VMX Use the same callback to inject irq/nmi events no matter what irqchip is in use. Only from VMX for now. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:45 +03:00
Gleb Natapov	8061823a25	KVM: Make kvm_cpu_(has\|get)_interrupt() work for userspace irqchip too At the vector level, kernel and userspace irqchip are fairly similar. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:45 +03:00
Jan Kiszka	3438253926	KVM: MMU: Fix auditing code Fix build breakage of hpa lookup in audit_mappings_page. Moreover, make this function robust against shadow_notrap_nonpresent_pte entries. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:45 +03:00
Marcelo Tosatti	59839dfff5	KVM: x86: check for cr3 validity in ioctl_set_sregs Matt T. Yourst notes that kvm_arch_vcpu_ioctl_set_sregs lacks validity checking for the new cr3 value: "Userspace callers of KVM_SET_SREGS can pass a bogus value of cr3 to the kernel. This will trigger a NULL pointer access in gfn_to_rmap() when userspace next tries to call KVM_RUN on the affected VCPU and kvm attempts to activate the new non-existent page table root. This happens since kvm only validates that cr3 points to a valid guest physical memory page when code inside the guest sets cr3. However, kvm currently trusts the userspace caller (e.g. QEMU) on the host machine to always supply a valid page table root, rather than properly validating it along with the rest of the reloaded guest state." http://sourceforge.net/tracker/?func=detail&atid=893831&aid=2687641&group_id=180599 Check for a valid cr3 address in kvm_arch_vcpu_ioctl_set_sregs, triple fault in case of failure. Cc: stable@kernel.org Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:43 +03:00
Avi Kivity	463656c000	KVM: Replace kvmclock open-coded get_cpu_var() with the real thing Suggested by Ingo Molnar. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:42 +03:00
Gleb Natapov	8317c298ea	KVM: SVM: Skip instruction on a task switch only when appropriate If a task switch was initiated because off a task gate in IDT and IDT was accessed because of an external even the instruction should not be skipped. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:42 +03:00
Gleb Natapov	ba8afb6b0a	KVM: x86 emulator: Add new mode of instruction emulation: skip In the new mode instruction is decoded, but not executed. The EIP is moved to point after the instruction. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:42 +03:00
Gleb Natapov	e637b8238a	KVM: x86 emulator: Decode soft interrupt instructions Do not emulate them yet. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:41 +03:00
Gleb Natapov	84ce66a686	KVM: x86 emulator: Completely decode in/out at decoding stage Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:41 +03:00
Gleb Natapov	341de7e372	KVM: x86 emulator: Add unsigned byte immediate decode Extend "Source operand type" opcode description field to 4 bites to accommodate new option. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:41 +03:00
Gleb Natapov	d53c4777b3	KVM: x86 emulator: Complete decoding of call near in decode stage Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:41 +03:00
Gleb Natapov	b2833e3cde	KVM: x86 emulator: Complete short/near jcc decoding in decode stage Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:40 +03:00
Gleb Natapov	782b877c80	KVM: x86 emulator: Complete ljmp decoding at decode stage Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:40 +03:00
Gleb Natapov	0654169e73	KVM: x86 emulator: Add lcall decoding No emulation yet. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:40 +03:00
Gleb Natapov	a5f868bd45	KVM: x86 emulator: Add decoding of 16bit second immediate argument Such as segment number in lcall/ljmp Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:40 +03:00
Marcelo Tosatti	c2d0ee46e6	KVM: MMU: remove global page optimization logic Complexity to fix it not worthwhile the gains, as discussed in http://article.gmane.org/gmane.comp.emulators.kvm.devel/28649. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:39 +03:00
Marcelo Tosatti	ede2ccc517	KVM: PIT: fix count read and mode 0 handling Commit 46ee278652f4cbd51013471b64c7897ba9bcd1b1 causes Solaris 10 to hang on boot. Assuming that PIT counter reads should return 0 for an expired timer is wrong: when it is active, the counter never stops (see comment on __kpit_elapsed). Also arm a one shot timer for mode 0. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:39 +03:00
Gleb Natapov	2d03319654	KVM: x86 emulator: fix call near emulation The length of pushed on to the stack return address depends on operand size not address size. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:38 +03:00
Sheng Yang	4c26b4cd6f	KVM: MMU: Discard reserved bits checking on PDE bit 7-8 1. It's related to a Linux kernel bug which fixed by Ingo on `07a66d7c53`. The original code exists for quite a long time, and it would convert a PDE for large page into a normal PDE. But it fail to fit normal PDE well. With the code before Ingo's fix, the kernel would fall reserved bit checking with bit 8 - the remaining global bit of PTE. So the kernel would receive a double-fault. 2. After discussion, we decide to discard PDE bit 7-8 reserved checking for now. For this marked as reserved in SDM, but didn't checked by the processor in fact... Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:38 +03:00
Gleb Natapov	64a7ec0668	KVM: Fix unneeded instruction skipping during task switching. There is no need to skip instruction if the reason for a task switch is a task gate in IDT and access to it is caused by an external even. The problem is currently solved only for VMX since there is no reliable way to skip an instruction in SVM. We should emulate it instead. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:38 +03:00
Gleb Natapov	b237ac37a1	KVM: Fix task switch back link handling. Back link is written to a wrong TSS now. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:37 +03:00
Gleb Natapov	8843419048	KVM: VMX: Do not zero idt_vectoring_info in vmx_complete_interrupts(). We will need it later in task_switch(). Code in handle_exception() is dead. is_external_interrupt(vect_info) will always be false since idt_vectoring_info is zeroed in vmx_complete_interrupts(). Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:37 +03:00
Gleb Natapov	37b96e9880	KVM: VMX: Rewrite vmx_complete_interrupt()'s twisted maze of if() statements ...with a more straightforward switch(). Also fix a bug when NMI could be dropped on exit. Although this should never happen in practice, since NMIs can only be injected, never triggered internally by the guest like exceptions. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:37 +03:00
Gleb Natapov	7b4a25cb29	KVM: VMX: Fix handling of a fault during NMI unblocked due to IRET Bit 12 is undefined in any of the following cases: If the VM exit sets the valid bit in the IDT-vectoring information field. If the VM exit is due to a double fault. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:36 +03:00
Dong, Eddie	20c466b561	KVM: Use rsvd_bits_mask in load_pdptrs() Also remove bit 5-6 from rsvd_bits_mask per latest SDM. Signed-off-by: Eddie Dong <Eddie.Dong@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:36 +03:00
Sheng Yang	93ba03c2e2	KVM: VMX: Fix feature testing The testing of feature is too early now, before vmcs_config complete initialization. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:36 +03:00
Sheng Yang	045471563d	KVM: VMX: Clean up Flex Priority related And clean paranthes on returns. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:36 +03:00
Wei Yongjun	7a6ce84c74	KVM: remove pointless conditional before kfree() in lapic initialization Remove pointless conditional before kfree(). Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:35 +03:00
Avi Kivity	9645bb56b3	KVM: MMU: Use different shadows when EFER.NXE changes A pte that is shadowed when the guest EFER.NXE=1 is not valid when EFER.NXE=0; if bit 63 is set, the pte should cause a fault, and since the shadow EFER always has NX enabled, this won't happen. Fix by using a different shadow page table for different EFER.NXE bits. This allows vcpus to run correctly with different values of EFER.NXE, and for transitions on this bit to be handled correctly without requiring a full flush. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:35 +03:00
Dong, Eddie	82725b20e2	KVM: MMU: Emulate #PF error code of reserved bits violation Detect, indicate, and propagate page faults where reserved bits are set. Take care to handle the different paging modes, each of which has different sets of reserved bits. [avi: fix pte reserved bits for efer.nxe=0] Signed-off-by: Eddie Dong <eddie.dong@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:35 +03:00
Eddie Dong	a8b876b1a4	KVM: MMU: Fix comment in page_fault() The original one is for the code before refactoring. Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com> Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:34 +03:00
Sheng Yang	f9c617f611	KVM: VMX: Correct wrong vmcs field sizes EXIT_QUALIFICATION and GUEST_LINEAR_ADDRESS are natural width, not 64-bit. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:34 +03:00
Avi Kivity	7d433b9f94	KVM: VMX: Make flexpriority module parameter reflect hardware capability If the hardware does not support flexpriority, zero the module parameter. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:34 +03:00
Gleb Natapov	78646121e9	KVM: Fix interrupt unhalting a vcpu when it shouldn't kvm_vcpu_block() unhalts vpu on an interrupt/timer without checking if interrupt window is actually opened. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:33 +03:00
Gleb Natapov	09cec75488	KVM: Timer event should not unconditionally unhalt vcpu. Currently timer events are processed before entering guest mode. Move it to main vcpu event loop since timer events should be processed even while vcpu is halted. Timer may cause interrupt/nmi to be injected and only then vcpu will be unhalted. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:33 +03:00
Avi Kivity	089d034e0c	KVM: VMX: Fold vm_need_ept() into callers Trivial. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:32 +03:00
Avi Kivity	575ff2dcb2	KVM: VMX: Zero ept module parameter if ept is not present Allows reading back hardware capability. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:32 +03:00
Avi Kivity	919818abc2	KVM: VMX: Zero the vpid module parameter if vpid is not supported This allows reading back how the hardware is configured. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:32 +03:00
Avi Kivity	4462d21a61	KVM: VMX: Annotate module parameters as __read_mostly Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:32 +03:00
Avi Kivity	736caefe15	KVM: VMX: Simplify module parameter names Instead of 'enable_vpid=1', use a simple 'vpid=1'. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:31 +03:00
Avi Kivity	6062d012ed	KVM: VMX: Rename kvm_handle_exit() to vmx_handle_exit() It is a static vmx-specific function. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:31 +03:00
Avi Kivity	c1f8bc04c6	KVM: VMX: Make module parameters readable Useful to see how the module was loaded. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:31 +03:00
Gleb Natapov	fe4c7b1914	KVM: reuse (pop\|push)_irq from svm.c in vmx.c The prioritized bit vector manipulation functions are useful in both vmx and svm. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:31 +03:00
Gleb Natapov	61c50edfcd	KVM: SVM: Remove duplicate code in svm_do_inject_vector() svm_do_inject_vector() reimplements pop_irq(). Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:30 +03:00
Amit Shah	7fe29e0faa	KVM: x86: Ignore reads to EVNTSEL MSRs We ignore writes to the performance counters and performance event selector registers already. Kaspersky antivirus reads the eventsel MSR causing it to crash with the current behaviour. Return 0 as data when the eventsel registers are read to stop the crash. Signed-off-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:30 +03:00
Gleb Natapov	f00be0cae4	KVM: MMU: do not free active mmu pages in free_mmu_pages() free_mmu_pages() should only undo what alloc_mmu_pages() does. Free mmu pages from the generic VM destruction function, kvm_destroy_vm(). Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:30 +03:00
Sheng Yang	e56d532f20	KVM: Device assignment framework rework After discussion with Marcelo, we decided to rework device assignment framework together. The old problems are kernel logic is unnecessary complex. So Marcelo suggest to split it into a more elegant way: 1. Split host IRQ assign and guest IRQ assign. And userspace determine the combination. Also discard msi2intx parameter, userspace can specific KVM_DEV_IRQ_HOST_MSI \| KVM_DEV_IRQ_GUEST_INTX in assigned_irq->flags to enable MSI to INTx convertion. 2. Split assign IRQ and deassign IRQ. Import two new ioctls: KVM_ASSIGN_DEV_IRQ and KVM_DEASSIGN_DEV_IRQ. This patch also fixed the reversed _IOR vs _IOW in definition(by deprecated the old interface). [avi: replace homemade bitcount() by hweight_long()] Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:29 +03:00
Hannes Eder	386eb6e8b3	KVM: make 'lapic_timer_ops' and 'kpit_ops' static Fix this sparse warnings: arch/x86/kvm/lapic.c:916:22: warning: symbol 'lapic_timer_ops' was not declared. Should it be static? arch/x86/kvm/i8254.c:268:22: warning: symbol 'kpit_ops' was not declared. Should it be static? Signed-off-by: Hannes Eder <hannes@hanneseder.net> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:29 +03:00
Gleb Natapov	58c2dde17d	KVM: APIC: get rid of deliver_bitmask Deliver interrupt during destination matching loop. Signed-off-by: Gleb Natapov <gleb@redhat.com> Acked-by: Xiantao Zhang <xiantao.zhang@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-06-10 11:48:27 +03:00
Gleb Natapov	e1035715ef	KVM: change the way how lowest priority vcpu is calculated The new way does not require additional loop over vcpus to calculate the one with lowest priority as one is chosen during delivery bitmap construction. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-06-10 11:48:27 +03:00
Gleb Natapov	343f94fe4d	KVM: consolidate ioapic/ipi interrupt delivery logic Use kvm_apic_match_dest() in kvm_get_intr_delivery_bitmask() instead of duplicating the same code. Use kvm_get_intr_delivery_bitmask() in apic_send_ipi() to figure out ipi destination instead of reimplementing the logic. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-06-10 11:48:27 +03:00
Gleb Natapov	6da7e3f643	KVM: APIC: kvm_apic_set_irq deliver all kinds of interrupts Get rid of ioapic_inj_irq() and ioapic_inj_nmi() functions. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-06-10 11:48:26 +03:00
Joerg Roedel	f5a1e9f895	KVM: MMU: remove call to kvm_mmu_pte_write from walk_addr There is no reason to update the shadow pte here because the guest pte is only changed to dirty state. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2009-06-10 11:48:26 +03:00
Marcelo Tosatti	d3c7b77d1a	KVM: unify part of generic timer handling Hide the internals of vcpu awakening / injection from the in-kernel emulated timers. This makes future changes in this logic easier and decreases the distance to more generic timer handling. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:25 +03:00
Marcelo Tosatti	fd66842370	KVM: PIT: remove usage of count_load_time for channel 0 We can infer elapsed time from hrtimer_expires_remaining. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:25 +03:00
Marcelo Tosatti	5a05d54554	KVM: PIT: remove unused scheduled variable Unused. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:25 +03:00
Marcelo Tosatti	a90ede7b17	KVM: x86: paravirt skip pit-through-ioapic boot check Skip the test which checks if the PIT is properly routed when using the IOAPIC, aimed at buggy hardware. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:24 +03:00
Matt T. Yourst	2dea4c84bc	KVM: x86: silence preempt warning on kvm_write_guest_time This issue just appeared in kvm-84 when running on 2.6.28.7 (x86-64) with PREEMPT enabled. We're getting syslog warnings like this many (but not all) times qemu tells KVM to run the VCPU: BUG: using smp_processor_id() in preemptible [00000000] code: qemu-system-x86/28938 caller is kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm] Pid: 28938, comm: qemu-system-x86 2.6.28.7-mtyrel-64bit Call Trace: debug_smp_processor_id+0xf7/0x100 kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm] ? __wake_up+0x4e/0x70 ? wake_futex+0x27/0x40 kvm_vcpu_ioctl+0x2e9/0x5a0 [kvm] enqueue_hrtimer+0x8a/0x110 _spin_unlock_irqrestore+0x27/0x50 vfs_ioctl+0x31/0xa0 do_vfs_ioctl+0x74/0x480 sys_futex+0xb4/0x140 sys_ioctl+0x99/0xa0 system_call_fastpath+0x16/0x1b As it turns out, the call trace is messed up due to gcc's inlining, but I isolated the problem anyway: kvm_write_guest_time() is being used in a non-thread-safe manner on preemptable kernels. Basically kvm_write_guest_time()'s body needs to be surrounded by preempt_disable() and preempt_enable(), since the kernel won't let us query any per-CPU data (indirectly using smp_processor_id()) without preemption disabled. The attached patch fixes this issue by disabling preemption inside kvm_write_guest_time(). [marcelo: surround only __get_cpu_var calls since the warning is harmless] Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:24 +03:00
Sheng Yang	d510d6cc65	KVM: Enable MSI-X for KVM assigned device This patch finally enable MSI-X. What we need for MSI-X: 1. Intercept one page in MMIO region of device. So that we can get guest desired MSI-X table and set up the real one. Now this have been done by guest, and transfer to kernel using ioctl KVM_SET_MSIX_NR and KVM_SET_MSIX_ENTRY. 2. Information for incoming interrupt. Now one device can have more than one interrupt, and they are all handled by one workqueue structure. So we need to identify them. The previous patch enable gsi_msg_pending_bitmap get this done. 3. Mapping from host IRQ to guest gsi as well as guest gsi to real MSI/MSI-X message address/data. We used same entry number for the host and guest here, so that it's easy to find the correlated guest gsi. What we lack for now: 1. The PCI spec said nothing can existed with MSI-X table in the same page of MMIO region, except pending bits. The patch ignore pending bits as the first step (so they are always 0 - no pending). 2. The PCI spec allowed to change MSI-X table dynamically. That means, the OS can enable MSI-X, then mask one MSI-X entry, modify it, and unmask it. The patch didn't support this, and Linux also don't work in this way. 3. The patch didn't implement MSI-X mask all and mask single entry. I would implement the former in driver/pci/msi.c later. And for single entry, userspace should have reposibility to handle it. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:23 +03:00
Sheng Yang	bfd349d073	KVM: bit ops for deliver_bitmap It's also convenient when we extend KVM supported vcpu number in the future. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:22 +03:00
Sheng Yang	110c2faeba	KVM: Update intr delivery func to accept unsigned long* bitmap Would be used with bit ops, and would be easily extended if KVM_MAX_VCPUS is increased. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:22 +03:00
Avi Kivity	5897297bc2	KVM: VMX: Don't intercept MSR_KERNEL_GS_BASE Windows 2008 accesses this MSR often on context switch intensive workloads; since we run in guest context with the guest MSR value loaded (so swapgs can work correctly), we can simply disable interception of rdmsr/wrmsr for this MSR. A complication occurs since in legacy mode, we run with the host MSR value loaded. In this case we enable interception. This means we need two MSR bitmaps, one for legacy mode and one for long mode. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:21 +03:00
Avi Kivity	3e7c73e9b1	KVM: VMX: Don't use highmem pages for the msr and pio bitmaps Highmem pages are a pain, and saving three lowmem pages on i386 isn't worth the extra code. Signed-off-by: Avi Kivity <avi@redhat.com>	2009-06-10 11:48:21 +03:00
Chuck Ebbert	0b8c3d5ab0	x86: Clear TS in irq_ts_save() when in an atomic section The dynamic FPU context allocation changes caused the padlock driver to generate the below warning. Fix it by masking TS when doing padlock encryption operations in an atomic section. This solves: BUG: sleeping function called from invalid context at mm/slub.c:1602 in_atomic(): 1, irqs_disabled(): 0, pid: 82, name: cryptomgr_test Pid: 82, comm: cryptomgr_test Not tainted 2.6.29.4-168.test7.fc11.x86_64 #1 Call Trace: [<ffffffff8103ff16>] __might_sleep+0x10b/0x110 [<ffffffff810cd3b2>] kmem_cache_alloc+0x37/0xf1 [<ffffffff81018505>] init_fpu+0x49/0x8a [<ffffffff81012a83>] math_state_restore+0x3e/0xbc [<ffffffff813ac6d0>] do_device_not_available+0x9/0xb [<ffffffff810123ab>] device_not_available+0x1b/0x20 [<ffffffffa001c066>] ? aes_crypt+0x66/0x74 [padlock_aes] [<ffffffff8119a51a>] ? blkcipher_walk_next+0x257/0x2e0 [<ffffffff8119a731>] ? blkcipher_walk_first+0x18e/0x19d [<ffffffffa001c1fe>] aes_encrypt+0x9d/0xe5 [padlock_aes] [<ffffffffa0027253>] crypt+0x6b/0x114 [xts] [<ffffffffa001c161>] ? aes_encrypt+0x0/0xe5 [padlock_aes] [<ffffffffa001c161>] ? aes_encrypt+0x0/0xe5 [padlock_aes] [<ffffffffa0027390>] encrypt+0x49/0x4b [xts] [<ffffffff81199acc>] async_encrypt+0x3c/0x3e [<ffffffff8119dafc>] test_skcipher+0x1da/0x658 [<ffffffff811979c3>] ? crypto_spawn_tfm+0x8e/0xb1 [<ffffffff8119672d>] ? __crypto_alloc_tfm+0x11b/0x15f [<ffffffff811979c3>] ? crypto_spawn_tfm+0x8e/0xb1 [<ffffffff81199dbe>] ? skcipher_geniv_init+0x2b/0x47 [<ffffffff8119a905>] ? async_chainiv_init+0x5c/0x61 [<ffffffff8119dfdd>] alg_test_skcipher+0x63/0x9b [<ffffffff8119e1bc>] alg_test+0x12d/0x175 [<ffffffff8119c488>] cryptomgr_test+0x38/0x54 [<ffffffff8119c450>] ? cryptomgr_test+0x0/0x54 [<ffffffff8105c6c9>] kthread+0x4d/0x78 [<ffffffff8101264a>] child_rip+0xa/0x20 [<ffffffff81011f67>] ? restore_args+0x0/0x30 [<ffffffff8105c67c>] ? kthread+0x0/0x78 [<ffffffff81012640>] ? child_rip+0x0/0x20 Signed-off-by: Chuck Ebbert <cebbert@redhat.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> LKML-Reference: <20090609104050.50158cfe@dhcp-100-2-144.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-09 16:50:43 +02:00
Yong Wang	fecc8ac849	perf_counter, x86: Correct some event and umask values for Intel processors Correct some event and UMASK values according to Intel SDM, in the Nehalem and Atom tables. Signed-off-by: Yong Wang <yong.y.wang@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <20090609131553.GA12489@ywang-moblin2.bj.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-09 16:50:07 +02:00
Andreas Herrmann	42937e81a8	x86: Detect use of extended APIC ID for AMD CPUs Booting a 32-bit kernel on Magny-Cours results in the following panic: ... Using APIC driver default ... Overriding APIC driver with bigsmp ... Getting VERSION: 80050010 Getting VERSION: 80050010 Getting ID: 10000000 Getting ID: ef000000 Getting LVT0: 700 Getting LVT1: 10000 Kernel panic - not syncing: Boot APIC ID in local APIC unexpected (16 vs 0) Pid: 1, comm: swapper Not tainted 2.6.30-rcX #2 Call Trace: [<c05194da>] ? panic+0x38/0xd3 [<c0743102>] ? native_smp_prepare_cpus+0x259/0x31f [<c073b19d>] ? kernel_init+0x3e/0x141 [<c073b15f>] ? kernel_init+0x0/0x141 [<c020325f>] ? kernel_thread_helper+0x7/0x10 The reason is that default_get_apic_id handled extension of local APIC ID field just in case of XAPIC. Thus for this AMD CPU, default_get_apic_id() returns 0 and bigsmp_get_apic_id() returns 16 which leads to the respective kernel panic. This patch introduces a Linux specific feature flag to indicate support for extended APIC id (8 bits instead of 4 bits width) and sets the flag on AMD CPUs if applicable. Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: <stable@kernel.org> LKML-Reference: <20090608135509.GA12431@alberich.amd.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-09 15:28:46 +02:00
Yinghai Lu	eaa958402e	cpumask: alloc zeroed cpumask for static cpumask_var_ts These are defined as static cpumask_var_t so if MAXSMP is not used, they are cleared already. Avoid surprises when MAXSMP is enabled. Signed-off-by: Yinghai Lu <yinghai.lu@kernel.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2009-06-09 22:30:27 +09:30
Joerg Roedel	e9a22a13c7	amd-iommu: remove unnecessary "AMD IOMMU: " prefix That prefix is already included in the DUMP_printk macro. So there is no need to repeat it in the format string. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-06-09 12:01:58 +02:00
Joerg Roedel	71ff3bca2f	amd-iommu: detach device explicitly before attaching it to a new domain This fixes a bug with a device that could not be assigned to a KVM guest because it is still assigned to a dma_ops protection domain. [chrisw: simply remove WARN_ON(), will always fire since dev->driver will be pci-sub] Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-06-09 11:14:14 +02:00
Joerg Roedel	29150078d7	amd-iommu: remove BUS_NOTIFY_BOUND_DRIVER handling Handling this event causes device assignment in KVM to fail because the device gets re-attached as soon as the pci-stub registers as the driver for the device. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-06-09 10:54:18 +02:00
Joerg Roedel	d2dd01de99	Merge commit 'tip/core/iommu' into amd-iommu/fixes	2009-06-09 10:50:57 +02:00
Thomas Gleixner	820a644211	perf_counter, x86: Clean up hw_cache_event ids copies Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-08 23:10:42 +02:00
Thomas Gleixner	f86748e91a	perf_counter, x86: Implement generalized cache event types, add AMD support Fill in amd_hw_cache_event_id[] with the AMD CPU specific events, for family 0x0f, 0x10 and 0x11. There's apparently no distinction between load and store events, so we only fill in the load events. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-08 23:10:37 +02:00
Andreas Herrmann	c9690998ef	x86: memtest: remove 64-bit division Using gcc 3.3.5 a "make allmodconfig" + "CONFIG_KVM=n" triggers a build error: arch/x86/mm/built-in.o(.init.text+0x43f7): In function `__change_page_attr': arch/x86/mm/pageattr.c:114: undefined reference to `__udivdi3' make: *** [.tmp_vmlinux1] Error 1 The culprit turned out to be a division in arch/x86/mm/memtest.c For more info see this thread: http://marc.info/?l=linux-kernel&m=124416232620683 The patch entirely removes the division that caused the build error. [ Impact: build fix with certain GCC versions ] Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: xiyou.wangcong@gmail.com Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@kernel.org> LKML-Reference: <20090608170939.GB12431@alberich.amd.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-08 19:18:25 +02:00
Jack Steiner	c4ed3f04ba	x86, UV: Fix macros for multiple coherency domains Fix bug in the SGI UV macros that support systems with multiple coherency domains. The macros used for referencing global MMR (chipset registers) are failing to correctly "or" the NASID (node identifier) bits that reside above M+N. These high bits are supplied automatically by the chipset for memory accesses coming from the processor socket. However, the bits must be present for references to the special global MMR space used to map chipset registers. (See uv_hub.h for more details ...) The bug results in references to invalid/incorrect nodes. Signed-off-by: Jack Steiner <steiner@sgi.com> Cc: <stable@kernel.org> LKML-Reference: <20090608154405.GA16395@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-08 18:57:47 +02:00
Ingo Molnar	1123e3ad73	perf_counter: Clean up x86 boot messages Standardize and tidy up all the messages we print during perfcounter initialization. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-08 12:29:30 +02:00
Thomas Gleixner	ad68922061	perf_counter, x86: Implement generalized cache event types, add Atom support Fill in core2_hw_cache_event_id[] with the Atom model specific events. The events can be used in all the tools via the -e (--event) parameter, for example "-e l1-misses" or -"-e l2-accesses" or "-e l2-write-misses". ( Note: these are straight from the Intel manuals - not tested yet.) Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-08 11:18:27 +02:00
Thomas Gleixner	0312af8416	perf_counter, x86: Implement generalized cache event types, add Core2 support Fill in core2_hw_cache_event_id[] with the Core2 model specific events. The events can be used in all the tools via the -e (--event) parameter, for example "-e l1-misses" or -"-e l2-accesses" or "-e l2-write-misses". Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-08 11:18:26 +02:00
Figo.zhang	aeef50bc04	x86, microcode: Simplify vfree() use vfree() does its own 'NULL' check, so no need for check before calling it. In v2, remove the stray newline. [ Impact: cleanup ] Signed-off-by: Figo.zhang <figo1802@gmail.com> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com> LKML-Reference: <1244385036.3402.11.camel@myhost> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-07 16:35:11 +02:00
Lubomir Rintel	3aa6b186f8	x86: Fix non-lazy GS handling in sys_vm86() This fixes a stack corruption panic or null dereference oops due to a bad GS in resume_userspace() when returning from sys_vm86() and calling lockdep_sys_exit(). Only a problem when CONFIG_LOCKDEP and CONFIG_CC_STACKPROTECTOR enabled. Signed-off-by: Lubomir Rintel <lkundrak@v3.sk> Cc: H. Peter Anvin <hpa@zytor.com> LKML-Reference: <1244384628.2323.4.camel@bimbo> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-07 16:31:23 +02:00
Cyrill Gorcunov	a4046f8d29	x86, nmi: Use predefined numbers instead of hardcoded one [ Impact: cleanup ] Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> LKML-Reference: <20090607081937.GC4547@lenovo> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-07 16:22:02 +02:00
Cyrill Gorcunov	103428e57b	x86, apic: Fix dummy apic read operation together with broken MP handling Ingo Molnar reported that read_apic is buggy novadays: [ 0.000000] Using APIC driver default [ 0.000000] SMP: Allowing 1 CPUs, 0 hotplug CPUs [ 0.000000] Local APIC disabled by BIOS -- you can enable it with "lapic" [ 0.000000] APIC: disable apic facility [ 0.000000] ------------[ cut here ]------------ [ 0.000000] WARNING: at arch/x86/kernel/apic/apic.c:254 native_apic_read_dummy+0x2d/0x3b() [ 0.000000] Hardware name: HP OmniBook PC Indeed we still rely on apic->read operation for SMP compiled kernel. And instead of disfigure the SMP code with #ifdef we allow to call apic->read. To capture any unexpected results we check for apic->read being called for sane reason via WARN_ON_ONCE but(!) instead of OR we should use AND logical operation (thanks Yinghai for spotting the root of the problem). Along with that we could be have bad MP table and we are to fix it that way no SMP started and no complains about BIOS bug if apic was just disabled via command line. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <20090607124840.GD4547@lenovo> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-07 16:08:05 +02:00
Jean Delvare	4a4aca641b	x86: Add quirk for reboot stalls on a Dell Optiplex 360 The Dell Optiplex 360 hangs on reboot, just like the Optiplex 330, so the same quirk is needed. Signed-off-by: Jean Delvare <jdelvare@suse.de> Cc: Steve Conklin <steve.conklin@canonical.com> Cc: Leann Ogasawara <leann.ogasawara@canonical.com> Cc: <stable@kernel.org> LKML-Reference: <200906051202.38311.jdelvare@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-07 15:51:20 +02:00
Jaswinder Singh Rajput	5095f59bda	x86: cpu_debug: Remove model information to reduce encoding-decoding Remove model information, encoding/decoding and reduce bookkeeping. This, besides removing a lot of code and cleaning up the code, also enables these features on many more CPUs that were enumerated before. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> LKML-Reference: <1244224637.8212.6.camel@ht.satnam> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-07 12:22:56 +02:00
Ingo Molnar	5f4457a4f6	Merge branch 'linus' into x86/cpu	2009-06-07 12:22:15 +02:00
Ingo Molnar	56fdd18c7b	Merge branch 'linus' into core/iommu Merge reason: This branch was on an -rc5 base so pull almost-2.6.30 to resync with the latest upstream fixes and make sure the combination works fine. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-07 11:35:05 +02:00
Linus Torvalds	ccc0d38ec1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: x86/pci: fix mmconfig detection with 32bit near 4g PCI: use fixed-up device class when configuring device	2009-06-06 14:33:54 -07:00
Ingo Molnar	75b5032212	Merge branch 'linus' into perfcounters/core Merge reason: Pick up the latest fixes before the -v8 perfcounters release. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-06 20:21:28 +02:00
Ingo Molnar	8326f44da0	perf_counter: Implement generalized cache event types Extend generic event enumeration with the PERF_TYPE_HW_CACHE method. This is a 3-dimensional space: { L1-D, L1-I, L2, ITLB, DTLB, BPU } x { load, store, prefetch } x { accesses, misses } User-space passes in the 3 coordinates and the kernel provides a counter. (if the hardware supports that type and if the combination makes sense.) Combinations that make no sense produce a -EINVAL. Combinations that are not supported by the hardware produce -ENOTSUP. Extend the tools to deal with this, and rewrite the event symbol parsing code with various popular aliases for the units and access methods above. So 'l1-cache-miss' and 'l1d-read-ops' are both valid aliases. ( x86 is supported for now, with the Nehalem event table filled in, and with Core2 and Atom having placeholder tables. ) Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-06 13:14:47 +02:00
Ingo Molnar	a21ca2cac5	perf_counter: Separate out attr->type from attr->config Counter type is a frequently used value and we do a lot of bit juggling by encoding and decoding it from attr->config. Clean this up by creating a separate attr->type field. Also clean up the various similarly complex user-space bits all around counter attribute management. The net improvement is significant, and it will be easier to add a new major type (which is what triggered this cleanup). (This changes the ABI, all tools are adapted.) (PowerPC build-tested.) Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-06 11:37:22 +02:00
Mark Langsdorf	fe2245c905	x86: enable GART-IOMMU only after setting up protection methods The current code to set up the GART as an IOMMU enables GART translations before it removes the aperture from the kernel memory map, sets the GART PTEs to UC, sets up the guard and scratch pages, or does a wbinvd(). This leaves the possibility of cache aliasing open and can cause system crashes. Re-order the code so as to enable the GART translations only after all safeguards are in place and the tlb has been flushed. AMD has tested this patch on both Istanbul systems and 1st generation Opteron systems with APG enabled and seen no adverse effects. Istanbul systems with HT Assist enabled sometimes see MCE errors due to cache artifacts with the unmodified code. Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com> Cc: <stable@kernel.org> Cc: Joerg Roedel <joerg.roedel@amd.com> Cc: akpm@linux-foundation.org Cc: jbarnes@virtuousgeek.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-06 09:42:09 +02:00
Dave Jones	2c701b1028	[CPUFREQ] powernow-k8: check space_id of _PCT registers to be FFH The powernow-k8 driver checks to see that the Performance Control/Status Registers are declared as FFH (functional fixed hardware) by the BIOS. However, this check got broken in the commit: `0e64a0c982` [CPUFREQ] checkpatch cleanups for powernow-k8 Fix based on an original patch from Naga Chumbalkar. Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com> Cc: Mark Langsdorf <mark.langsdorf@amd.com> Signed-off-by: Dave Jones <davej@redhat.com>	2009-06-05 13:25:25 -04:00
Peter Zijlstra	f7b6eb3fa0	x86: Set context.vdso before installing the mapping In order to make arch_vma_name() work from inside install_special_mapping() we need to set the context.vdso before calling it. ( This is needed for performance counters to be able to track this special executable area. ) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-05 14:46:40 +02:00
Rusty Russell	2cb7878a3a	lguest: fix 'unhandled trap 13' with CONFIG_CC_STACKPROTECTOR We don't set up the canary; let's disable stack protector on boot.c so we can get into lguest_init, then set it up. As a side effect, switch_to_new_gdt() sets up %fs for us properly too. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-04 11:50:06 -07:00
Yinghai Lu	75e613cdc7	x86/pci: fix mmconfig detection with 32bit near 4g Pascal reported and bisected a commit: \| x86/PCI: don't call e820_all_mapped with -1 in the mmconfig case which broke one system system. ACPI: Using IOAPIC for interrupt routing PCI: MCFG configuration 0: base f0000000 segment 0 buses 0 - 255 PCI: MCFG area at f0000000 reserved in ACPI motherboard resources PCI: Using MMCONFIG for extended config space it didn't have PCI: updated MCFG configuration 0: base f0000000 segment 0 buses 0 - 63 anymore, and try to use 0xf000000 - 0xffffffff for mmconfig For 32bit, mcfg_res->end could be 32bit only (if 64 resources aren't used) So use end - 1 to pass the value in mcfg->end to avoid overflow. We don't need to worry about the e820 path, they are always 64 bit. Reported-by: Pascal Terjan <pterjan@mandriva.com> Bisected-by: Pascal Terjan <pterjan@mandriva.com> Tested-by: Pascal Terjan <pterjan@mandriva.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: stable@kernel.org Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2009-06-04 11:31:13 +01:00
Hidetoshi Seto	8051dbd2df	x86, mce: fix for mce counters Make the MCE counters work on 32bit and add poll count in arch_irq_stat_cpu. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:48:59 -07:00
Andi Kleen	9b1beaf2b5	x86, mce: support action-optional machine checks Newer Intel CPUs support a new class of machine checks called recoverable action optional. Action Optional means that the CPU detected some form of corruption in the background and tells the OS about using a machine check exception. The OS can then take appropiate action, like killing the process with the corrupted data or logging the event properly to disk. This is done by the new generic high level memory failure handler added in a earlier patch. The high level handler takes the address with the failed memory and does the appropiate action, like killing the process. In this version of the patch the high level handler is stubbed out with a weak function to not create a direct dependency on the hwpoison branch. The high level handler cannot be directly called from the machine check exception though, because it has to run in a defined process context to be able to sleep when taking VM locks (it is not expected to sleep for a long time, just do so in some exceptional cases like lock contention) Thus the MCE handler has to queue a work item for process context, trigger process context and then call the high level handler from there. This patch adds two path to process context: through a per thread kernel exit notify_user() callback or through a high priority work item. The first runs when the process exits back to user space, the other when it goes to sleep and there is no higher priority process. The machine check handler will schedule both, and whoever runs first will grab the event. This is done because quick reaction to this event is critical to avoid a potential more fatal machine check when the corruption is consumed. There is a simple lock less ring buffer to queue the corrupted addresses between the exception handler and the process context handler. Then in process context it just calls the high level VM code with the corrupted PFNs. The code adds the required code to extract the failed address from the CPU's machine check registers. It doesn't try to handle all possible cases -- the specification has 6 different ways to specify memory address -- but only the linear address. Most of the required checking has been already done earlier in the mce_severity rule checking engine. Following the Intel recommendations Action Optional errors are only enabled for known situations (encoded in MCACODs). The errors are ignored otherwise, because they are action optional. v2: Improve comment, disable preemption while processing ring buffer (reported by Ying Huang) Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:48:59 -07:00
Andi Kleen	8fa8dd9e3a	x86, mce: define MCE_VECTOR Add MCE_VECTOR for the #MC exception. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:48:05 -07:00
Andi Kleen	9ff36ee966	x86, mce: rename mce_notify_user to mce_notify_irq Rename the mce_notify_user function to mce_notify_irq. The next patch will split the wakeup handling of interrupt context and of process context and it's better to give it a clearer name for this. Contains a fix from Ying Huang [ Impact: cleanup ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:48:04 -07:00
Andi Kleen	4ef702c10b	x86: fix panic with interrupts off (needed for MCE) For some time each panic() called with interrupts disabled triggered the !irqs_disabled() WARN_ON in smp_call_function(), producing ugly backtraces and confusing users. This is a common situation with machine checks for example which tend to call panic with interrupts disabled, but will also hit in other situations e.g. panic during early boot. In fact it means that panic cannot be called in many circumstances, which would be bad. This all started with the new fancy queued smp_call_function, which is then used by the shutdown path to shut down the other CPUs. On closer examination it turned out that the fancy RCU smp_call_function() does lots of things not suitable in a panic situation anyways, like allocating memory and relying on complex system state. I originally tried to patch this over by checking for panic there, but it was quite complicated and the original patch was also not very popular. This also didn't fix some of the underlying complexity problems. The new code in post 2.6.29 tries to patch around this by checking for oops_in_progress, but that is not enough to make this fully safe and I don't think that's a real solution because panic has to be reliable. So instead use an own vector to reboot. This makes the reboot code extremly straight forward, which is definitely a big plus in a panic situation where it is important to avoid relying on too much kernel state. The new simple code is also safe to be called from interupts off region because it is very very simple. There can be situations where it is important that panic is reliable. For example on a fatal machine check the panic is needed to get the system up again and running as quickly as possible. So it's important that panic is reliable and all function it calls simple. This is why I came up with this simple vector scheme. It's very hard to beat in simplicity. Vectors are not particularly precious anymore since all big systems are using per CPU vectors. Another possibility would have been to use an NMI similar to kdump, but there is still the problem that NMIs don't work reliably on some systems due to BIOS issues. NMIs would have been able to stop CPUs running with interrupts off too. In the sake of universal reliability I opted for using a non NMI vector for now. I put the reboot vector into the highest priority bucket of the APIC vectors and moved the 64bit UV_BAU message down instead into the next lower priority. [ Impact: bug fix, fixes an old regression ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:45:35 -07:00
Huang Ying	4611a6fa4b	x86, mce: export MCE severities coverage via debugfs The MCE severity judgement code is data-driven, so code coverage tools such as gcov can not be used for measuring coverage. Instead a dedicated coverage mechanism is implemented. The kernel keeps track of rules executed and reports them in debugfs. This is useful for increasing coverage of the mce-test testsuite. Right now it's unconditionally enabled because it's very little code. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:45:34 -07:00
Andi Kleen	ed7290d0ee	x86, mce: implement new status bits The x86 architecture recently added some new machine check status bits: S(ignalled) and AR (Action-Required). Signalled allows to check if a specific event caused an exception or was just logged through CMCI. AR allows the kernel to decide if an event needs immediate action or can be delayed or ignored. Implement support for these new status bits. mce_severity() uses the new bits to grade the machine check correctly and decide what to do. The exception handler uses AR to decide to kill or not. The S bit is used to separate events between the poll/CMCI handler and the exception handler. Classical UC always leads to panic. That was true before anyways because the existing CPUs always passed a PCC with it. Also corrects the rules whether to kill in user or kernel context and how to handle missing RIPV. The machine check handler largely uses the mce-severity grading engine now instead of making its own decisions. This means the logic is centralized in one place. This is useful because it has to be evaluated multiple times. v2: Some rule fixes; Add AO events Fix RIPV, RIPV\|EIPV order (Ying Huang) Fix UCNA with AR=1 message (Ying Huang) Add comment about panicing in m_c_p. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:45:34 -07:00
Andi Kleen	86503560e4	x86, mce: print header/footer only once for multiple MCEs When multiple MCEs are printed print the "HARDWARE ERROR" header and "This is not a software error" footer only once. This makes the output much more compact with many CPUs. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:45:34 -07:00
Andi Kleen	29b0f591d6	x86, mce: default to panic timeout for machine checks Fatal machine checks can be logged to disk after boot, but only if the system did a warm reboot. That's unfortunately difficult with the default panic behaviour, which waits forever and the admin has to press the power button because modern systems usually miss a reset button. This clears the machine checks in the registers and make it impossible to log them. This patch changes the default for machine check panic to always reboot after 30s. Then the mce can be successfully logged after reboot. I believe this will improve machine check experience for any system running the X server. This is dependent on successfull boot logging of MCEs. This currently only works on Intel systems, on AMD there are quite a lot of systems around which leave junk in the machine check registers after boot, so it's disabled here. These systems will continue to default to endless waiting panic. v2: Only force panic timeout when it's shorter (H.Seto) v3: Only force timeout when there is no timeout (based on comment H.Seto) [ Fix changelog - HS ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:45:33 -07:00
Huang Ying	1b2797dcc9	x86, mce: improve mce_get_rip Assume IP on the stack is valid when either EIPV or RIPV are set. This influences whether the machine check exception handler decides to return or panic. This fixes a test case in the mce-test suite and is more compliant to the specification. This currently only makes a difference in a artificial testing scenario with the mce-test test suite. Also in addition do not force the EIPV to be valid with the exact register MSRs, and keep in trust the CS value on stack even if MSR is available. [AK: combination of patches from Huang Ying and Hidetoshi Seto, with new description by me] [add some description, no code changed - HS] Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:45:33 -07:00
Andi Kleen	ac9603754d	x86, mce: make non Monarch panic message "Fatal machine check" too ... instead of "Machine check". This is for consistency with the Monarch panic message. Based on a report from Ying Huang. v2: But add a descriptive postfix so that the test suite can distingush. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:45:12 -07:00
Andi Kleen	3c0797925f	x86, mce: switch x86 machine check handler to Monarch election. On Intel platforms machine check exceptions are always broadcast to all CPUs. This patch makes the machine check handler synchronize all these machine checks, elect a Monarch to handle the event and collect the worst event from all CPUs and then process it first. This has some advantages: - When there is a truly data corrupting error the system panics as quickly as possible. This improves containment of corrupted data and makes sure the corrupted data never hits stable storage. - The panics are synchronized and do not reenter the panic code on multiple CPUs (which currently does not handle this well). - All the errors are reported. Currently it often happens that another CPU happens to do the panic first, but reports useless information (empty machine check) because the real error happened on another CPU which came in later. This is a big advantage on Nehalem where the 8 threads per CPU lead to often the wrong CPU winning the race and dumping useless information on a machine check. The problem also occurs in a less severe form on older CPUs. - The system can detect when no CPUs detected a machine check and shut down the system. This can happen when one CPU is so badly hung that that it cannot process a machine check anymore or when some external agent wants to stop the system by asserting the machine check pin. This follows Intel hardware recommendations. - This matches the recommended error model by the CPU designers. - The events can be output in true severity order - When a panic happens on another CPU it makes sure to be actually be able to process the stop IPI by enabling interrupts. The code is extremly careful to handle timeouts while waiting for other CPUs. It can't rely on the normal timing mechanisms (jiffies, ktime_get) because of its asynchronous/lockless nature, so it uses own timeouts using ndelay() and a "SPINUNIT" The timeout is configurable. By default it waits for upto one second for the other CPUs. This can be also disabled. From some informal testing AMD systems do not see to broadcast machine checks, so right now it's always disabled by default on non Intel CPUs or also on very old Intel systems. Includes fixes from Ying Huang Fixed a "ecception" in a comment (H.Seto) Moved global_nwo reset later based on suggestion from H.Seto v2: Avoid duplicate messages [ Impact: feature, fixes long standing problems. ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:45:12 -07:00
Andi Kleen	f94b61c2c9	x86, mce: implement panic synchronization In some circumstances multiple CPUs can enter mce_panic() in parallel. This gives quite confused output because they will all dump the same machine check buffer. The other problem is that they would all panic in parallel, but not process each other's shutdown IPIs because interrupts are disabled. Detect this situation early on in mce_panic(). On the first CPU entering will do the panic, the others will just wait to be killed. For paranoia reasons in case the other CPU dies during the MCE I added a 5 seconds timeout. If it expires each CPU will panic on its own again. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:45:12 -07:00
Andi Kleen	ccc3c3192a	x86, mce: implement bootstrapping for machine check wakeups Machine checks support waking up the mcelog daemon quickly. The original wake up code for this was pretty ugly, relying on a idle notifier and a special process flag. The reason it did it this way is that the machine check handler is not subject to normal interrupt locking rules so it's not safe to call wake_up(). Instead it set a process flag and then either did the wakeup in the syscall return or in the idle notifier. This patch adds a new "bootstraping" method as replacement. The idea is that the handler checks if it's in a state where it is unsafe to call wake_up(). If it's safe it calls it directly. When it's not safe -- that is it interrupted in a critical section with interrupts disables -- it uses a new "self IPI" to trigger an IPI to its own CPU. This can be done safely because IPI triggers are atomic with some care. The IPI is raised once the interrupts are reenabled and can then safely call wake_up(). When APICs are disabled the event is just queued and will be picked up eventually by the next polling timer. I think that's a reasonable compromise, since it should only happen quite rarely. Contains fixes from Ying Huang. [ solve conflict on irqinit, make it work on 32bit (entry_arch.h) - HS ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:44:05 -07:00
Andi Kleen	bd19a5e6b7	x86, mce: check early in exception handler if panic is needed The exception handler should behave differently if the exception is fatal versus one that can be returned from. In the first case it should never clear any registers because these need to be preserved for logging after the next boot. Otherwise it should clear them on each CPU step by step so that other CPUs sharing the same bank don't see duplicate events. Otherwise we risk reporting events multiple times on any CPUs which have shared machine check banks, which is a common problem on Intel Nehalem which has both SMT (two CPU threads sharing banks) and shared machine check banks in the uncore. Determine early in a special pass if any event requires a panic. This uses the mce_severity() function added earlier. This is needed for the next patch. Also fixes a problem together with an earlier patch that corrected events weren't logged on a fatal MCE. [ Impact: Feature ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:40:39 -07:00
Andi Kleen	817f32d02a	x86, mce: add table driven machine check grading The machine check grading (as in deciding what should be done for a given register value) has to be done multiple times soon and it's also getting more complicated. So it makes sense to consolidate it into a single function. To get smaller and more straight forward and possibly more extensible code I opted towards a new table driven method. The various rules are put into a table when is then executed by a very simple interpreter. The grading engine is in a new file mce-severity.c. I also added a private include file mce-internal.h, because mce.h is already a bit too cluttered. This is dead code right now, but will be used in followon patches. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:40:39 -07:00
Andi Kleen	a0189c70e5	x86, mce: remove TSC print heuristic Previously mce_panic used a simple heuristic to avoid printing old so far unreported machine check events on a mce panic. This worked by comparing the TSC value at the start of the machine check handler with the event time stamp and only printing newer ones. This has a couple of issues, in particular on systems where the TSC is not fully synchronized between CPUs it could lose events or print old ones. It is also problematic with full system synchronization as it is added by the next patch. Remove the TSC heuristic and instead replace it with a simple heuristic to print corrected errors first and after that uncorrected errors and finally the worst machine check as determined by the machine check handler. This simplifies the code because there is no need to pass the original TSC value around. Contains fixes from Ying Huang [ Impact: bug fix, cleanup ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Ying Huang <ying.huang@intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:40:39 -07:00
Andi Kleen	de8a84d85a	x86, mce: log corrected errors when panicing Normally the machine check handler ignores corrected errors and leaves them to machine_check_poll(). But when panicing mcp won't run, so log all errors. Note: this can still miss some cases until the "early no way out" patch later is applied too. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:40:39 -07:00
Andi Kleen	8ee08347c1	x86, mce: extend struct mce user interface with more information. Experience has shown that struct mce which is used to pass an machine check to the user space daemon currently a few limitations. Also some data which is useful to print at panic level is also missing. This patch addresses most of them. The same information is also printed out together with mce panic. struct mce can be painlessly extended in a compatible way, the mcelog user space code just ignores additional fields with a warning. - It doesn't provide a wall time timestamp. There have been a few complaints about that. Fix that by adding a 64bit time_t - It doesn't provide the exact CPU identification. This makes it awkward for mcelog to decode the event correctly, especially when there are variations in the supported MCE codes on different CPU models or when mcelog is running on a different host after a panic. Previously the administrator had to specify the correct CPU when mcelog ran on a different host, but with the more variation in machine checks now it's better to auto detect that. It's also useful for more detailed analysis of CPU events. Pass CPUID 1.EAX and the cpu vendor (as encoded in processor.h) instead. - Socket ID and initial APIC ID are useful to report because they allow to identify the failing CPU in some (not all) cases. This is also especially useful for the panic situation. This addresses one of the complaints from Thomas Gleixner earlier. - The MCG capabilities MSR needs to be reported for some advanced error processing in mcelog Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:40:38 -07:00
Andi Kleen	d620c67fb9	x86, mce: support more than 256 CPUs in struct mce The old struct mce had a limitation to 256 CPUs. But x86 Linux supports more than that now with x2apic. Add a new field extcpu to report the extended number. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:40:38 -07:00
Andi Kleen	f6fb0ac086	x86, mce: store record length into memory struct mce anchor This makes it easier for tools who want to extract the mcelog out of crash images or memory dumps to adapt to changing struct mce size. The length field replaces padding, so it's fully compatible. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:40:38 -07:00
Andi Kleen	ca84f69697	x86, mce: add MCE poll count to /proc/interrupts Keep a count of the machine check polls (or CMCI events) in /proc/interrupts. Andi needs this for debugging, but it's also useful in general to see what's going in by the kernel. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:40:38 -07:00
Andi Kleen	01ca79f141	x86, mce: add machine check exception count in /proc/interrupts Useful for debugging, but it's also good general policy to have a counter for all special interrupts there. This makes it easier to diagnose where a CPU is spending its time. [ Impact: feature, debugging tool ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-03 14:40:38 -07:00
Ingo Molnar	128f048f0f	perf_counter: Fix throttling lock-up Throttling logic is broken and we can lock up with too small hw sampling intervals. Make the throttling code more robust: disable counters even if we already disabled them. ( Also clean up whitespace damage i noticed while reading various pieces of code related to throttling. ) Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-03 23:39:51 +02:00
Cliff Wickman	0e2595cdfd	x86: Fix UV BAU activation descriptor init The UV tlb shootdown code has a serious initialization error. An array of structures [328] is initialized as if it were [32]. The array is indexed by (cpu number on the blade)8, so the short initialization works for up to 4 cpus on a blade. But above that, we provide an invalid opcode to the hub's broadcast assist unit. This patch changes the allocation of the array to use its symbolic dimensions for better clarity. And initializes all 32*8 entries. Shortened 'UV_ACTIVATION_DESCRIPTOR_SIZE' to 'UV_ADP_SIZE' per Ingo's recommendation. Tested on the UV simulator. Signed-off-by: Cliff Wickman <cpw@sgi.com> Cc: <stable@kernel.org> LKML-Reference: <E1M6lZR-0007kV-Aq@eag09.americas.sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-03 13:07:31 +02:00
Jiri Slaby	367d04c4ec	amd_iommu: fix lock imbalance In alloc_coherent there is an omitted unlock on the path where mapping fails. Add the unlock. [ Impact: fix lock imbalance in alloc_coherent ] Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-06-03 10:34:55 +02:00
Yong Wang	a32881066e	perf_counter/x86: Remove the IRQ (non-NMI) handling bits Remove the IRQ (non-NMI) handling bits as NMI will be used always. Signed-off-by: Yong Wang <yong.y.wang@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090603051255.GA2791@ywang-moblin2.bj.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-03 09:53:34 +02:00
Mike Galbraith	6799687a53	x86, boot: add new generated files to the appropriate .gitignore files git status complains of untracked (generated) files in arch/x86/boot.. # Untracked files: # (use "git add <file>..." to include in what will be committed) # # ../../arch/x86/boot/compressed/mkpiggy # ../../arch/x86/boot/compressed/piggy.S # ../../arch/x86/boot/compressed/vmlinux.lds # ../../arch/x86/boot/voffset.h # ../../arch/x86/boot/zoffset.h ..so adjust .gitignore files accordingly. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-02 21:13:30 -07:00
Peter Zijlstra	0d48696f87	perf_counter: Rename perf_counter_hw_event => perf_counter_attr The structure isn't hw only and when I read event, I think about those things that fall out the other end. Rename the thing. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> Cc: Stephane Eranian <eranian@googlemail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-02 21:45:33 +02:00
Peter Zijlstra	e4abb5d4f7	perf_counter: x86: Emulate longer sample periods Do as Power already does, emulate sample periods up to 2^63-1 by composing them of smaller values limited by hardware capabilities. Only once we wrap the software period do we generate an overflow event. Just 10 lines of new code. Reported-by: Stephane Eranian <eranian@googlemail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-02 21:45:31 +02:00
Peter Zijlstra	8a016db386	perf_counter: Remove the last nmi/irq bits IRQ (non-NMI) sampling is not used anymore - remove the last few bits. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-02 21:45:31 +02:00
Peter Zijlstra	b23f3325ed	perf_counter: Rename various fields A few renames: s/irq_period/sample_period/ s/irq_freq/sample_freq/ s/PERF_RECORD_/PERF_SAMPLE_/ s/record_type/sample_type/ And change both the new sample_type and read_format to u64. Reported-by: Stephane Eranian <eranian@googlemail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-02 21:45:30 +02:00
Huang Ying	2cf4ac8beb	crypto: aes-ni - Add support for more modes Because kernel_fpu_begin() and kernel_fpu_end() operations are too slow, the performance gain of general mode implementation + aes-aesni is almost all compensated. The AES-NI support for more modes are implemented as follow: - Add a new AES algorithm implementation named __aes-aesni without kernel_fpu_begin/end() - Use fpu(<mode>(AES)) to provide kenrel_fpu_begin/end() invoking - Add <mode>(AES) ablkcipher, which uses cryptd(fpu(<mode>(AES))) to defer cryption to cryptd context in soft_irq context. Now the ctr, lrw, pcbc and xts support are added. Performance testing based on dm-crypt shows that cryption time can be reduced to 50% of general mode implementation + aes-aesni implementation. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2009-06-02 14:04:16 +10:00
Huang Ying	150c7e8552	crypto: fpu - Add template for blkcipher touching FPU Blkcipher touching FPU need to be enclosed by kernel_fpu_begin() and kernel_fpu_end(). If they are invoked in cipher algorithm implementation, they will be invoked for each block, so that performance will be hurt, because they are "slow" operations. This patch implements "fpu" template, which makes these operations to be invoked for each request. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2009-06-02 14:04:15 +10:00
Jiri Slaby	3d58829b05	x86, apic: Restore irqs on fail paths lapic_resume forgets to restore interrupts on fail paths. Fix that. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> LKML-Reference: <1243497289-18591-1-git-send-email-jirislaby@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com>	2009-06-02 02:48:59 +02:00
Naga Chumbalkar	58f892e022	x86: Print real IOAPIC version for x86-64 Fix the fact that the IOAPIC version number in the x86_64 code path always gets assigned to 0, instead of the correct value. Before the patch: (from "dmesg" output): ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0]) IOAPIC[0]: apic_id 8, version 0, address 0xfec00000, GSI 0-23 <--- After the patch: ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0]) IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23 <--- History: io_apic_get_version() was compiled out of the x86_64 code path in the commit `f2c2cca3ac`: Author: Andi Kleen <ak@suse.de> Date: Tue Sep 26 10:52:37 2006 +0200 [PATCH] Remove APIC version/cpu capability mpparse checking/printing ACPI went to great trouble to get the APIC version and CPU capabilities of different CPUs before passing them to the mpparser. But all that data was used was to print it out. Actually it even faked some data based on the boot cpu, not on the actual CPU being booted. Remove all this code because it's not needed. Cc: len.brown@intel.com At the time, the IOAPIC version number was deliberately not printed in the x86_64 code path. However, after the x86 and x86_64 files were merged, the net result is that the IOAPIC version is printed incorrectly in the x86_64 code path. The patch below provides a fix. I have tested it with acpi, and with acpi=off, and did not see any problems. Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com> Acked-by: Yinghai Lu <yhlu.kernel@gmail.com> LKML-Reference: <20090416014230.4885.94926.sendpatchset@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu> *************************	2009-06-02 02:03:18 +02:00
H. Peter Anvin	48b1fddbb1	Merge branch 'irq/numa' into x86/mce3 Merge reason: arch/x86/kernel/irqinit_{32,64}.c unified in irq/numa and modified in x86/mce3; this merge resolves the conflict. Conflicts: arch/x86/kernel/irqinit.c Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-06-01 15:25:31 -07:00
Ingo Molnar	ee4c24a5c9	Merge branch 'x86/cpufeature' into irq/numa Merge reason: irq/numa didnt build because this commit: `2759c32`: x86: don't call read_apic_id if !cpu_has_apic Had a dependency on x86/cpufeature changes. Pull in that (small) branch to fix the dependency. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-01 22:30:01 +02:00
Ingo Molnar	3d58f48ba0	Merge branch 'linus' into irq/numa Conflicts: arch/mips/sibyte/bcm1480/irq.c arch/mips/sibyte/sb1250/irq.c Merge reason: we gathered a few conflicts plus update to latest upstream fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-01 21:06:21 +02:00
Ingo Molnar	23db9f430b	Merge branch 'linus' into perfcounters/core Merge reason: merge almost-rc8 into perfcounters/core, which was -rc6 based - to pick up the latest upstream fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-06-01 10:01:39 +02:00
Joe Perches	61c8c67e3a	acpi-cpufreq: fix printk typo and indentation Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Len Brown <len.brown@intel.com>	2009-05-29 21:26:26 -04:00
Mel Gorman	32b154c0b0	x86: ignore VM_LOCKED when determining if hugetlb-backed page tables can be shared or not Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13302 On x86 and x86-64, it is possible that page tables are shared beween shared mappings backed by hugetlbfs. As part of this, page_table_shareable() checks a pair of vma->vm_flags and they must match if they are to be shared. All VMA flags are taken into account, including VM_LOCKED. The problem is that VM_LOCKED is cleared on fork(). When a process with a shared memory segment forks() to exec() a helper, there will be shared VMAs with different flags. The impact is that the shared segment is sometimes considered shareable and other times not, depending on what process is checking. What happens is that the segment page tables are being shared but the count is inaccurate depending on the ordering of events. As the page tables are freed with put_page(), bad pmd's are found when some of the children exit. The hugepage counters also get corrupted and the Total and Free count will no longer match even when all the hugepage-backed regions are freed. This requires a reboot of the machine to "fix". This patch addresses the problem by comparing all flags except VM_LOCKED when deciding if pagetables should be shared or not for hugetlbfs-backed mapping. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <starlight@binnacle.cx> Cc: Eric B Munson <ebmunson@us.ibm.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-05-29 08:40:03 -07:00
Yong Wang	c323d95fa4	perf_counter/x86: Always use NMI for performance-monitoring interrupt Always use NMI for performance-monitoring interrupt as there could be racy situations if we switch between irq and nmi mode frequently. Signed-off-by: Yong Wang <yong.y.wang@intel.com> LKML-Reference: <20090529052835.GA13657@ywang-moblin2.bj.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-29 09:04:58 +02:00
H. Peter Anvin	38736072d4	x86, mce: drop "extern" from function prototypes in asm/mce.h Function prototypes don't need to be prefixed by "extern". [ Impact: cleanup ] Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 10:05:33 -07:00
Hidetoshi Seto	cd13adcc82	x86: trivial clean up for arch/x86/Kconfig Use tab. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 10:05:21 -07:00
Andi Kleen	eb2a6ab729	x86: trivial clean up for irq_vectors.h Fix a wrong comment. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:16 -07:00
Hidetoshi Seto	98a9c8c3ba	x86, mce: trivial clean up for mce-inject.c Fix for: WARNING: Use #include <linux/uaccess.h> instead of <asm/uaccess.h> +#include <asm/uaccess.h> WARNING: usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc + if (m.cpu >= NR_CPUS \|\| !cpu_online(m.cpu)) ERROR: trailing whitespace +/* $ Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:16 -07:00
Hidetoshi Seto	61a021a070	x86, mce: trivial clean up for mce_intel_64.c Fix for: WARNING: space prohibited between function name and open parenthesis '(' + for_each_online_cpu (cpu) { Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:16 -07:00
Hidetoshi Seto	34fa1967aa	x86, mce: trivial clean up for mce_amd_64.c Fix for followings: WARNING: Use #include <linux/percpu.h> instead of <asm/percpu.h> +#include <asm/percpu.h> ERROR: Macros with multiple statements should be enclosed in a do - while loop +#define THRESHOLD_ATTR(_name, _mode, _show, _store) \ +{ \ + .attr = {.name = __stringify(_name), .mode = _mode }, \ + .show = _show, \ + .store = _store, \ +}; WARNING: usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc + if (cpu >= NR_CPUS) Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:16 -07:00
Hidetoshi Seto	14a02530e2	x86, mce: trivial clean up for mce.c This fixs following checkpatch warnings: WARNING: Use #include <linux/uaccess.h> instead of <asm/uaccess.h> +#include <asm/uaccess.h> WARNING: Use #include <linux/smp.h> instead of <asm/smp.h> +#include <asm/smp.h> WARNING: line over 80 characters + set_bit(MCE_OVERFLOW, (unsigned long *)&mcelog.flags); WARNING: braces {} are not necessary for any arm of this statement + if (mce_notify_user()) { [...] + } else { [...] Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:16 -07:00
Hidetoshi Seto	cc3aec52ab	x86, mce: trivial clean up for therm_throt.c This patch removes following checkpatch warning: WARNING: Use #include <linux/cpu.h> instead of <asm/cpu.h> +#include <asm/cpu.h> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:15 -07:00
Hidetoshi Seto	9319cec8c1	x86, mce: use strict_strtoull Use strict_strtoull instead of simple_strtoull. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:15 -07:00
Andi Kleen	b170204ddb	x86, mce: drop BKL in mce_open BKL is not needed for anything in mce_open because it has an own spinlock. Remove it. [ Impact: cleanup ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:15 -07:00
Andi Kleen	32561696c2	x86, mce: rename and align out2 label There's only a single out path in do_machine_check now, so rename the label from out2 to out. Also align it at the first column. [ Impact: minor cleanup, no functional changes ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:15 -07:00
Thomas Gleixner	8be9110569	x86, mce: remove mce_init unused argument Remove unused mce_init argument. [ Impact: cleanup ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:15 -07:00
Andi Kleen	fc016a49c2	x86, mce: remove unused mce_events variable Remove unused mce_events static variable. [ Impact: cleanup ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:15 -07:00
Andi Kleen	b56f642d2b	x86, mce: use extended sysattrs for the check_interval attribute. Instead of using own callbacks use the generic ones provided by the sysdev later. This finally allows to get rid of the ugly ACCESSOR macros. Should also save some text size. [ Impact: cleanup ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:15 -07:00
Andi Kleen	88921be302	x86, mce: synchronize core after machine check handling The example code in the IA32 SDM recommends to synchronize the CPU after machine check handling. So do that here. [ Impact: Spec compliance ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:14 -07:00
H. Peter Anvin	5706001aac	x86, mce: fix comment style in mce-inject.c Fix style of winged comment in mce-inject.c. [ Impact: comment only ] Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:14 -07:00
H. Peter Anvin	a1ff41bfc1	x86, mce: add comment about mce_chrdev_ops being writable Add a comment explaining that mce_chrdev_ops is intentionally writable. [ Impact: comment only ] Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:14 -07:00
Andi Kleen	ea149b36c7	x86, mce: add basic error injection infrastructure Allow user programs to write mce records into /dev/mcelog. When they do that a fake machine check is triggered to test the machine check code. This uses the MCE MSR wrappers added earlier. The implementation is straight forward. There is a struct mce record per CPU and the MCE MSR accesses get data from there if there is valid data injected there. This allows to test the machine check code relatively realistically because only the lowest layer of hardware access is intercepted. The test suite and injector are available at git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:14 -07:00
Andi Kleen	5f8c1a54ca	x86, mce: add MSR read wrappers for easier error injection This will be used by future patches to allow machine check error injection. Right now it's a nop, except for adding some wrappers around the MSR reads. This is early in the sequence to avoid too many conflicts. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:14 -07:00
Andi Kleen	de5619dfef	x86, mce: enable MCE_AMD for 32bit NEW_MCE That's very easy using the infrastructure enabled earlier for MCE_INTEL Untested. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:13 -07:00
Andi Kleen	7856f6cce4	x86, mce: enable MCE_INTEL for 32bit new MCE Enable the 64bit MCE_INTEL code (CMCI, thermal interrupts) for 32bit NEW_MCE. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:13 -07:00
Andi Kleen	4efc0670ba	x86, mce: use 64bit machine check code on 32bit The 64bit machine check code is in many ways much better than the 32bit machine check code: it is more specification compliant, is cleaner, only has a single code base versus one per CPU, has better infrastructure for recovery, has a cleaner way to communicate with user space etc. etc. Use the 64bit code for 32bit too. This is the second attempt to do this. There was one a couple of years ago to unify this code for 32bit and 64bit. Back then this ran into some trouble with K7s and was reverted. I believe this time the K7 problems (and some others) are addressed. I went over the old handlers and was very careful to retain all quirks. But of course this needs a lot of testing on old systems. On newer 64bit capable systems I don't expect much problems because they have been already tested with the 64bit kernel. I made this a CONFIG for now that still allows to select the old machine check code. This is mostly to make testing easier, if someone runs into a problem we can ask them to try with the CONFIG switched. The new code is default y for more coverage. Once there is confidence the 64bit code works well on older hardware too the CONFIG_X86_OLD_MCE and the associated code can be easily removed. This causes a behaviour change for 32bit installations. They now have to install the mcelog package to be able to log corrected machine checks. The 64bit machine check code only handles CPUs which support the standard Intel machine check architecture described in the IA32 SDM. The 32bit code has special support for some older CPUs which have non standard machine check architectures, in particular WinChip C3 and Intel P5. I made those a separate CONFIG option and kept them for now. The WinChip variant could be probably removed without too much pain, it doesn't really do anything interesting. P5 is also disabled by default (like it was before) because many motherboards have it miswired, but according to Alan Cox a few embedded setups use that one. Forward ported/heavily changed version of old patch, original patch included review/fixes from Thomas Gleixner, Bert Wesarg. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:13 -07:00
Andi Kleen	d896a940ef	x86, mce: remove oops_begin() use in 64bit machine check First 32bit doesn't have oops_begin, so it's a barrier of using this code on 32bit. On closer examination it turns out oops_begin is not a good idea in a machine check panic anyways. All oops_begin does it so check for recursive/parallel oopses and implement the "wait on oops" heuristic. But there's actually no good reason to lock machine checks against oopses or prevent them from recursion. Also "wait on oops" does not really make sense for a machine check too. Replace it with a manual bust_spinlocks/console_verbose. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:13 -07:00
Andi Kleen	8e97aef5f4	x86, mce: remove machine check handler idle notify on 64bit i386 has no idle notifiers, but the 64bit machine check code uses them to wake up mcelog from a fatal machine check exception. For corrected machine checks found by the poller or threshold interrupts going through an idle notifier is not needed because the wake_up can is just done directly and doesn't need the idle notifier. It is only needed for logging exceptions. To be honest I never liked the idle notifier even though I signed off on it. On closer investigation the code actually turned out to be nearly. Right now machine check exceptions on x86 are always unrecoverable (lead to panic due to PCC), which means we never execute the idle notifier path. The only exception is the somewhat weird tolerant==3 case, which ignores PCC. I'll fix this in a future patch in a much cleaner way. So remove the "mcelog wakeup through idle notifier" code from 64bit. This allows to compile the 64bit machine check handler on 32bit which doesn't have idle notifiers. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:12 -07:00
Andi Kleen	d7c3c9a609	x86, mce: move mce_disabled option into common 32bit/64bit code It's the same function, so let's share it. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:12 -07:00
Andi Kleen	04b2b1a4df	x86, mce: rename 64bit mce_dont_init to mce_disabled Give it the same name as on 32bit. This makes further merging easier. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:12 -07:00
Andi Kleen	5d7279268b	x86, mce: use a call vector to call the 64bit mce handler Allows to call different machine check handlers from the low level machine check entry vector. This is needed for later when it will be used for 32bit too. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:12 -07:00
Andi Kleen	2e6f694fde	x86, mce: port K7 bank 0 quirk to 64bit mce code Various K7 have broken bank 0s. Don't enable it by default Port from the 32bit code. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:12 -07:00
Andi Kleen	06b7a7a5ec	x86, mce: implement the PPro bank 0 quirk in the 64bit machine check code Quoting the comment: * SDM documents that on family 6 bank 0 should not be written * because it aliases to another special BIOS controlled * register. * But it's not aliased anymore on model 0x1a+ * Don't ignore bank 0 completely because there could be a valid * event later, merely don't write CTL0. This is mostly a port on the 32bit code, except that 32bit always didn't write it and didn't have the 0x1a heuristic. I checked with the CPU designers that the quirk is not required starting with this model. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:12 -07:00
Andi Kleen	3cde5c8c83	x86, mce: initial steps to make 64bit mce code 32bit clean Replace unsigned long with u64s if they need to contain 64bit values. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:12 -07:00
Thomas Gleixner	01c6680a54	x86, mce: Cleanup MCG definitions Decode more magic constants and turn them into symbols. [ Sort definitions bitwise, introduce MCG_EXT_CNT - HS ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:12 -07:00
Thomas Gleixner	ba2d0f2b0c	x86, mce: Cleanup symbols in intel thermal codes Decode magic constants and turn them into symbols. [ Cleanup to use symbols already exists - HS ] [ Impact: cleanup ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:11 -07:00
Ingo Molnar	b659294b77	x86, mce: print number of MCE banks The number of MCE banks supported by a CPU is a useful number to know, so print it out during CPU initialization. [ Impact: add printout ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:11 -07:00
Ingo Molnar	cb491fca55	x86, mce: Rename sysfs variables Shorten variable names. This also compacts the code a bit. device_mce => mce_dev mce_device_initialized => mce_dev_initialized mce_attribute => mce_attrs [ Impact: cleanup ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:11 -07:00
Ingo Molnar	dba3725d44	x86, mce: unify move mce_64.c => mce.c and glue it up in the Makefile. Remove mce_32.c Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:11 -07:00
Ingo Molnar	711c2e481c	x86, mce: unify, prepare for 32-bit v2 Prepare the 64-bit mce_64.c code side to be built on 32-bit. [ includes ifdef relocation by Andi Kleen ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Andi Kleen <ak@firstfloor.org> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:11 -07:00
Ingo Molnar	a988d334ae	x86, mce: unify, prepare codes Move current 32-bit mce_32.c code into mce_64.c. [ Remove unused artifact stop/restart_mce pointed by Andi Kleen ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Andi Kleen <ak@firstfloor.org> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:11 -07:00
Ingo Molnar	06b851d982	x86, mce: unify, prepare 64bit in mce.h Prepare mce.h for unification, so that it will build on 32-bit x86 kernels too. [ Impact: cleanup ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:11 -07:00
Thomas Gleixner	a65d086235	x86, mce: unify Intel thermal init Mechanic unification. No change in code. [ Impact: cleanup, 32-bit / 64-bit unification ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:10 -07:00
Thomas Gleixner	6cc6f3ebd1	x86, mce: unify Intel thermal init, prepare Prepare for unification, make two intel_init_thermal equal. [ Impact: cleanup ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:10 -07:00
Ingo Molnar	1cb2a8e176	x86, mce: clean up mce_amd_64.c Make the coding style match that of the rest of the x86 arch code. [ Impact: cleanup ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:10 -07:00
Ingo Molnar	cb6f3c155b	x86, mce: clean up therm_throt.c Make the coding style match that of the rest of the x86 arch code. [ Impact: cleanup ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:10 -07:00
Ingo Molnar	bdbfbdd5e8	x86, mce: clean up non-fatal.c Make the coding style match that of the rest of the x86 arch code. [ Impact: cleanup ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:10 -07:00
Ingo Molnar	91425084f7	x86, mce: clean up winchip.c Make the coding style match that of the rest of the x86 arch code. [ Impact: cleanup ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:10 -07:00
Ingo Molnar	efee4ca809	x86, mce: clean up k7.c Make the coding style match that of the rest of the x86 arch code. [ Impact: cleanup ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:10 -07:00
Ingo Molnar	ea2566ff80	x86, mce: clean up p6.c Make the coding style match that of the rest of the x86 arch code. [ Impact: cleanup ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:09 -07:00
Ingo Molnar	ed8bc7ed9a	x86, mce: clean up p5.c Make the coding style match that of the rest of the x86 arch code. [ Impact: cleanup ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:09 -07:00
Ingo Molnar	c5aaf0e070	x86, mce: clean up p4.c Make the coding style match that of the rest of the x86 arch code. [ Impact: cleanup ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:09 -07:00
Ingo Molnar	3b58dfd04b	x86, mce: clean up mce_32.c Make the coding style match that of the rest of the x86 arch code. [ Impact: cleanup ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:09 -07:00
Ingo Molnar	e9eee03e99	x86, mce: clean up mce_64.c This file has been modified many times along the years, by multiple authors, so the general style and structure has diverged in a number of areas making this file hard to read. So fix the coding style match that of the rest of the x86 arch code. [ Impact: cleanup ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:09 -07:00
Hidetoshi Seto	13503fa913	x86, mce: Cleanup param parser - Fix the comment formatting. - The error path does not return 0, and printk lacks level and "\n". - Move __setup("nomce") next to mcheck_disable(). - Improve readability etc. [ Impact: cleanup ] Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Acked-by: Andi Kleen <ak@linux.intel.com> LKML-Reference: <49CB3F38.7090703@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-28 09:24:09 -07:00
Joerg Roedel	83cce2b69e	Merge branches 'amd-iommu/fixes', 'amd-iommu/debug', 'amd-iommu/suspend-resume' and 'amd-iommu/extended-allocator' into amd-iommu/2.6.31 Conflicts: arch/x86/kernel/amd_iommu.c arch/x86/kernel/amd_iommu_init.c	2009-05-28 18:23:56 +02:00
Joerg Roedel	47bccd6bb2	amd-iommu: don't free dma adresses below 512MB with CONFIG_IOMMU_STRESS This will test the automatic aperture enlargement code. This is important because only very few devices will ever trigger this code path. So force it under CONFIG_IOMMU_STRESS. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:18:33 +02:00
Joerg Roedel	f5e9705c64	amd-iommu: don't preallocate page tables with CONFIG_IOMMU_STRESS This forces testing of on-demand page table allocation code. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:18:08 +02:00
Joerg Roedel	fe16f088a8	amd-iommu: disable round-robin allocator for CONFIG_IOMMU_STRESS Disabling the round-robin allocator results in reusing the same dma-addresses again very fast. This is a good test if the iotlb flushing is working correctly. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:17:13 +02:00
Joerg Roedel	d9cfed9254	amd-iommu: remove amd_iommu_size kernel parameter This parameter is not longer necessary when aperture increases dynamically. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:16:49 +02:00
Joerg Roedel	11b83888ae	amd-iommu: enlarge the aperture dynamically By dynamically increasing the aperture the extended allocator is now ready for use. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:15:57 +02:00
Joerg Roedel	00cd122ae5	amd-iommu: handle exlusion ranges and unity mappings in alloc_new_range This patch makes sure no reserved addresses are allocated in an dma_ops domain when the aperture is increased dynamically. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:15:19 +02:00
Joerg Roedel	9cabe89b99	amd-iommu: move aperture_range allocation code to seperate function This patch prepares the dynamic increasement of dma_ops domain apertures. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:14:35 +02:00
Joerg Roedel	803b8cb4d9	amd-iommu: change dma_dom->next_bit to dma_dom->next_address Simplify the code a little bit by using the same unit for all address space related state in the dma_ops domain structure. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:14:26 +02:00
Joerg Roedel	384de72910	amd-iommu: make address allocator aware of multiple aperture ranges This patch changes the AMD IOMMU address allocator to allow up to 32 aperture ranges per dma_ops domain. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:14:15 +02:00
Joerg Roedel	53812c115c	amd-iommu: handle page table allocation failures in dma_ops code The code will be required when the aperture size increases dynamically in the extended address allocator. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:13:43 +02:00
Joerg Roedel	8bda3092bc	amd-iommu: move page table allocation code to seperate function This patch makes page table allocation usable for dma_ops code. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:13:20 +02:00
Joerg Roedel	c3239567a2	amd-iommu: introduce aperture_range structure This is a preperation for extended address allocator. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:12:52 +02:00
Joerg Roedel	736501ee00	amd-iommu: implement suspend/resume This patch puts everything together and enables suspend/resume support in the AMD IOMMU driver. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:11:39 +02:00
Joerg Roedel	05f92db9f4	amd_iommu: un __init functions required for suspend/resume This patch makes sure that no function required for suspend/resume of AMD IOMMU driver is thrown away after boot. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:10:56 +02:00
Joerg Roedel	7d7a110c61	amd-iommu: add function to flush tlb for all devices This function is required for suspend/resume support with AMD IOMMU enabled. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:10:43 +02:00
Joerg Roedel	bfd1be1857	amd-iommu: add function to flush tlb for all domains This function is required for suspend/resume support with AMD IOMMU enabled. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:10:12 +02:00
Joerg Roedel	92ac4320af	amd-iommu: add function to disable all iommus This function is required for suspend/resume support with AMD IOMMU enabled. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:09:26 +02:00
Joerg Roedel	d91cecdd79	amd-iommu: remove support for msi-x Current hardware uses msi instead of msi-x so this code it not necessary and can not be tested. The best thing is to drop this code. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:09:18 +02:00
Joerg Roedel	fab6afa309	amd-iommu: drop pointless iommu-loop in msi setup code It is not necessary to loop again over all IOMMUs in this code. So drop the loop. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:09:08 +02:00
Joerg Roedel	58492e1288	amd-iommu: consolidate hardware initialization to one function This patch restructures the AMD IOMMU initialization code to initialize all hardware registers with one single function call. This is helpful for suspend/resume support. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:08:58 +02:00
Joerg Roedel	3bd221724a	amd-iommu: introduce for_each_iommu* macros This patch introduces the for_each_iommu and for_each_iommu_safe macros to simplify the developers life when having to iterate over all AMD IOMMUs in the system. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:08:50 +02:00
Chris Wright	c1eee67b2d	amd iommu: properly detach from protection domain on ->remove Some drivers may use the dma api during ->remove which will cause a protection domain to get reattached to a device. Delay the detach until after the driver is completely unbound. [ joro: added a little merge helper ] [ Impact: fix too early device<->domain removal ] Signed-off-by: Chris Wright <chrisw@redhat.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:06:54 +02:00
Joerg Roedel	0bc252f430	amd-iommu: make sure only ivmd entries are parsed The bug never triggered. But it should be fixed to protect against broken ACPI tables in the future. [ Impact: protect against broken ivrs acpi table ] Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:06:47 +02:00
Neil Turton	7455aab1f9	amd-iommu: fix the handling of device aliases in the AMD IOMMU driver. The devid parameter to set_dev_entry_from_acpi is the requester ID rather than the device ID since it is used to index the IOMMU device table. The handling of IVHD_DEV_ALIAS used to pass the device ID. This patch fixes it to pass the requester ID. [ Impact: fix setting the wrong req-id in acpi-table parsing ] Signed-off-by: Neil Turton <nturton@solarflare.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:06:38 +02:00
Neil Turton	421f909c80	amd-iommu: fix an off-by-one error in the AMD IOMMU driver. The variable amd_iommu_last_bdf holds the maximum bdf of any device controlled by an IOMMU, so the number of device entries needed is amd_iommu_last_bdf+1. The function tbl_size used amd_iommu_last_bdf instead. This would be a problem if the last device were a large enough power of 2. [ Impact: fix amd_iommu_last_bdf off-by-one error ] Signed-off-by: Neil Turton <nturton@solarflare.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 18:06:27 +02:00
Joerg Roedel	2e8b569614	amd-iommu: disable device isolation with CONFIG_IOMMU_STRESS With device isolation disabled we can test better for race conditions in dma_ops related code. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 17:56:57 +02:00
Joerg Roedel	2be69c79e9	x86/iommu: add IOMMU_STRESS Kconfig entry This Kconfig option is intended to enable various code paths or parameters in IOMMU implementations to stress test the code and/or the hardware. This can also be done by disabling optimizations in the code when this option is switched on. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>	2009-05-28 17:55:33 +02:00
Joerg Roedel	b3b99ef8b4	amd-iommu: move protection domain printk to dump code This information is only helpful for debugging. Don't print it anymore unless explicitly requested. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 17:55:08 +02:00
Joerg Roedel	02acc43a29	amd-iommu: print ivmd information to dmesg when requested Add information about device memory mapping requirements for the IOMMU as described in the IVRS ACPI table to the kernel log if amd_iommu_dump was specified on the kernel command line. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 17:53:30 +02:00
Joerg Roedel	42a698f40a	amd-iommu: print ivhd information to dmesg when requested Add information about devices belonging to an IOMMU as described in the IVRS ACPI table to the kernel log if amd_iommu_dump was specified on the kernel command line. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 17:52:04 +02:00
Joerg Roedel	9c72041f71	amd-iommu: add dump for iommus described in ivrs table Add information about IOMMU devices described in the IVRS ACPI table to the kernel log if amd_iommu_dump was specified on the kernel command line. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 17:50:56 +02:00
Joerg Roedel	fefda117dd	amd-iommu: add amd_iommu_dump parameter This kernel parameter will be useful to get some AMD IOMMU related information in dmesg that is not necessary for the default user but may be helpful in debug situations. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2009-05-28 17:49:56 +02:00
Petr Tesarik	7d96fd41ca	x86: move rdtsc_barrier() into the TSC vread method The *fence instructions were moved to vsyscall_64.c by commit `cb9e35dce9`. But this breaks the vDSO, because vread methods are also called from there. Besides, the synchronization might be unnecessary for other time sources than TSC. [ Impact: fix potential time warp in VDSO ] Signed-off-by: Petr Tesarik <ptesarik@suse.cz> LKML-Reference: <9d0ea9ea0f866bdc1f4d76831221ae117f11ea67.1243241859.git.ptesarik@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org>	2009-05-28 14:15:54 +02:00
Pallipadi, Venkatesh	ee1ca48fae	ACPI: Disable ARB_DISABLE on platforms where it is not needed ARB_DISABLE is a NOP on all of the recent Intel platforms. For such platforms, reduce contention on c3_lock by skipping the fake ARB_DISABLE. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>	2009-05-27 21:57:30 -04:00
Yinghai Lu	abfe0af981	x86: enable_update_mptable should be a macro instead of declaring one variant as an inline function... because other case is a variable Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4A13B344.7030307@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-28 01:59:05 +02:00
Linus Torvalds	cd86a536c8	Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: avoid back to back on_each_cpu in cpa_flush_array x86, relocs: ignore R_386_NONE in kernel relocation entries	2009-05-26 15:06:12 -07:00
Pallipadi, Venkatesh	2171787be2	x86: avoid back to back on_each_cpu in cpa_flush_array Cleanup cpa_flush_array() to avoid back to back on_each_cpu() calls. [ Impact: optimizes fix `0af48f42df` ] Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-26 13:12:12 -07:00
Andreas Herrmann	ca446d0635	[CPUFREQ] powernow-k8: determine exact CPU frequency for HW Pstates Slightly modified by trenn@suse.de -> only do this on fam 10h and fam 11h. Currently powernow-k8 determines CPU frequency from ACPI PSS objects, but according to AMD family 11h BKDG this frequency is just a rounded value: "CoreFreq (MHz) = The CPU COF specified by MSRC001_00[6B:64][CpuFid] rounded to the nearest 100 Mhz." As a consequnce powernow-k8 reports wrong CPU frequency on some systems, e.g. on Turion X2 Ultra: powernow-k8: Found 1 AMD Turion(tm)X2 Ultra DualCore Mobile ZM-82 processors (2 cpu cores) (version 2.20.00) powernow-k8: 0 : pstate 0 (2200 MHz) powernow-k8: 1 : pstate 1 (1100 MHz) powernow-k8: 2 : pstate 2 (600 MHz) But this is wrong as frequency for Pstate2 is 550 MHz. x86info reports it correctly: #x86info -a \|grep Pstate ... Pstate-0: fid=e, did=0, vid=24 (2200MHz) Pstate-1: fid=e, did=1, vid=30 (1100MHz) Pstate-2: fid=e, did=2, vid=3c (550MHz) (current) Solution is to determine the frequency directly from Pstate MSRs instead of using rounded values from ACPI table. Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com> Signed-off-by: Thomas Renninger <trenn@suse.de> Signed-off-by: Dave Jones <davej@redhat.com>	2009-05-26 12:04:51 -04:00
Thomas Renninger	df1829770d	[CPUFREQ] powernow-k8 cleanup msg if BIOS does not export ACPI _PSS cpufreq data - Make the message shorter and easier to grep for - Use printk_once instead of WARN_ONCE (functionality of these was mixed) Signed-off-by: Thomas Renninger <trenn@suse.de> Cc: Langsdorf, Mark <mark.langsdorf@amd.com> Signed-off-by: Dave Jones <davej@redhat.com>	2009-05-26 12:04:51 -04:00
Dave Jones	d38e73e8da	[CPUFREQ] powernow-k7 build fix when ACPI=n arch/x86/kernel/cpu/cpufreq/powernow-k7.c:172: warning: 'invalidate_entry' defined but not used Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Dave Jones <davej@redhat.com>	2009-05-26 12:04:50 -04:00
Jarod Wilson	4319503779	[CPUFREQ] add atom family to p4-clockmod Some atom procs don't do freq scaling (such as the atom 330 on my own littlefalls2 board). By adding the atom family here, we at least get the benefit of passive cooling in a thermal emergency. Not sure how to see that its actually helping any, but the driver does bind and claim its functioning on my atom 330. Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Dave Jones <davej@redhat.com>	2009-05-26 12:04:50 -04:00
Ingo Molnar	aaba98018b	perf_counter, x86: Make NMI lockups more robust We have a debug check that detects stuck NMIs and returns with the PMU disabled in the global ctrl MSR - but i managed to trigger a situation where this was not enough to deassert the NMI. So clear/reset the full PMU and keep the disable count balanced when exiting from here. This way the box produces a debug warning but stays up and is more debuggable. [ Impact: in case of PMU related bugs, recover more gracefully ] Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-26 09:52:03 +02:00
Ingo Molnar	79202ba9ff	perf_counter, x86: Fix APIC NMI programming My Nehalem box locks up in certain situations (with an always-asserted NMI causing a lockup) if the PMU LVT entry is programmed between NMI and IRQ mode with a high frequency. Standardize exlusively on NMIs instead. [ Impact: fix lockup ] Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-26 09:49:28 +02:00
Tejun Heo	46176b4f6b	x86, relocs: ignore R_386_NONE in kernel relocation entries For relocatable 32bit kernels, boot/compressed/relocs.c processes relocation entries in the kernel image and appends it to the kernel image such that boot/compressed/head_32.S can relocate the kernel. The kernel image is one statically linked object and only uses two relocation types - R_386_PC32 and R_386_32, of the two only the latter needs massaging during kernel relocation and thus handled by relocs. R_386_PC32 is ignored and all other relocation types are considered error. When the target of a relocation resides in a discarded section, binutils doesn't throw away the relocation record but nullifies it by changing it to R_386_NONE, which unfortunately makes relocs fail. The problem was triggered by yet out-of-tree x86 stack unwind patches but given the binutils behavior, ignoring R_386_NONE is the right thing to do. The problem has been tracked down to binutils behavior by Jan Beulich. [ Impact: fix build with certain binutils by ignoring R_386_NONE ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jan Beulich <JBeulich@novell.com> Cc: Ingo Molnar <mingo@elte.hu> LKML-Reference: <4A1B8150.40702@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-25 22:52:49 -07:00
Linus Torvalds	b18f1e2199	Merge branch 'kvm-updates/2.6.30' of git://git.kernel.org/pub/scm/virt/kvm/kvm * 'kvm-updates/2.6.30' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: Fix PDPTR reloading on CR4 writes KVM: Make paravirt tlb flush also reload the PAE PDPTRs	2009-05-25 15:51:27 -07:00
Ingo Molnar	53b441a565	Revert "perf_counter, x86: speed up the scheduling fast-path" This reverts commit `b68f1d2e7a`. It is causing problems (stuck/stuttering profiling) - when mixed NMI and non-NMI counters are used. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090525153931.703093461@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-25 21:41:28 +02:00
Peter Zijlstra	a78ac32587	perf_counter: Generic per counter interrupt throttle Introduce a generic per counter interrupt throttle. This uses the perf_counter_overflow() quick disable to throttle a specific counter when its going too fast when a pmu->unthrottle() method is provided which can undo the quick disable. Power needs to implement both the quick disable and the unthrottle method. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090525153931.703093461@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-25 21:41:12 +02:00
Peter Zijlstra	48e22d56ec	perf_counter: x86: Remove interrupt throttle remove the x86 specific interrupt throttle Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090525153931.616671838@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-25 21:41:12 +02:00
Peter Zijlstra	ff99be573e	perf_counter: x86: Expose INV and EDGE bits Expose the INV and EDGE bits of the PMU to raw configs. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <20090525153931.494709027@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-25 21:41:11 +02:00
Avi Kivity	a2edf57f51	KVM: Fix PDPTR reloading on CR4 writes The processor is documented to reload the PDPTRs while in PAE mode if any of the CR4 bits PSE, PGE, or PAE change. Linux relies on this behaviour when zapping the low mappings of PAE kernels during boot. The code already handled changes to CR4.PAE; augment it to also notice changes to PSE and PGE. This triggered while booting an F11 PAE kernel; the futex initialization code runs before any CR3 reloads and writes to a NULL pointer; the futex subsystem ended up uninitialized, killing PI futexes and pulseaudio which uses them. Cc: stable@kernel.org Signed-off-by: Avi Kivity <avi@redhat.com>	2009-05-25 20:00:53 +03:00
Avi Kivity	a8cd0244e9	KVM: Make paravirt tlb flush also reload the PAE PDPTRs The paravirt tlb flush may be used not only to flush TLBs, but also to reload the four page-directory-pointer-table entries, as it is used as a replacement for reloading CR3. Change the code to do the entire CR3 reloading dance instead of simply flushing the TLB. Cc: stable@kernel.org Signed-off-by: Avi Kivity <avi@redhat.com>	2009-05-25 20:00:50 +03:00
Tejun Heo	71c9d8b68b	x86: Remove remap percpu allocator for the time being Remap percpu allocator has subtle bug when combined with page attribute changing. Remap percpu allocator aliases PMD pages for the first chunk and as pageattr doesn't know about the alias it ends up updating page attributes of the original mapping thus leaving the alises in inconsistent state which might lead to subtle data corruption. Please read the following threads for more information: http://thread.gmane.org/gmane.linux.kernel/835783 The following is the proposed fix which teaches pageattr about percpu aliases. http://thread.gmane.org/gmane.linux.kernel/837157 However, the above changes are deemed too pervasive for upstream inclusion for 2.6.30 release, so this patch essentially disables the remap allocator for the time being. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <4A1A0A27.4050301@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-25 05:37:55 +02:00
H. Peter Anvin	ee0736627d	Merge branch 'x86/urgent' into x86/setup Resolved conflicts: arch/x86/boot/memory.c Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-23 16:42:19 -07:00
venkatesh.pallipadi@intel.com	0af48f42df	x86: cpa_flush_array wbinvd should be done on all CPUs cpa_flush_array seems to prefer wbinvd() over clflush at 4M threshold. clflush needs to be done on only one CPU as per instruction definition. wbinvd() however, should be done on all CPUs. [ Impact: fix missing flush which could cause data corruption ] Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-22 13:33:59 -07:00
venkatesh.pallipadi@intel.com	0b827537e3	x86: bugfix wbinvd() model check instead of family check wbinvd is supported on all CPUs 486 or later. But, pageattr.c is checking x86_model >= 4 before wbinvd(), which looks like an oversight bug. It was first introduced at one place by changeset `d7c8f21a8c` and got copied over to second place in the same file later. [ Impact: fix missing cache flush on early-model CPUs, potential data corruption ] Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-22 13:33:27 -07:00
Suresh Siddha	0c752a9335	x86: introduce noxsave boot parameter Introduce "noxsave" boot parameter which will disable the cpu's xsave/xrstor capabilities. Useful for debugging and working around xsave related issues. [ Impact: make it possible to debug problems in the field ] Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-22 13:10:54 -07:00
H. Peter Anvin	bca23dba76	x86, setup: revert ACPI 3 E820 extended attributes support Remove ACPI 3 E820 extended memory attributes support. At least one vendor actively set all the flags to zero, but left ECX on return at 24. This bug may be present in other BIOSes. The breakage functionally means the ACPI 3 flags are probably completely useless, and that no OS any time soon is going to rely on their existence. Therefore, drop support completely. We may want to revisit this question in the future, if we find ourselves actually needing the flags. This reverts all or part of the following checkins: `cd670599b7` `c549e71d07` However, retain the part from the latter commit that copies e820 into a temporary buffer; that is an unrelated BIOS workaround. Put in a comment to explain that part. See https://bugzilla.redhat.com/show_bug.cgi?id=499396 for some additional information. [ Impact: detect all memory on affected machines ] Reported-by: Thomas J. Baker <tjb@unh.edu> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Acked-by: Len Brown <len.brown@intel.com> Cc: Chuck Ebbert <cebbert@redhat.com> Cc: Kyle McMartin <kmcmartin@redhat.com> Cc: Matt Domsch <matt_domsch@dell.com>	2009-05-22 11:14:02 -07:00
Paul Mackerras	a63eaf34ae	perf_counter: Dynamically allocate tasks' perf_counter_context struct This replaces the struct perf_counter_context in the task_struct with a pointer to a dynamically allocated perf_counter_context struct. The main reason for doing is this is to allow us to transfer a perf_counter_context from one task to another when we do lazy PMU switching in a later patch. This has a few side-benefits: the task_struct becomes a little smaller, we save some memory because only tasks that have perf_counters attached get a perf_counter_context allocated for them, and we can remove the inclusion of <linux/perf_counter.h> in sched.h, meaning that we don't end up recompiling nearly everything whenever perf_counter.h changes. The perf_counter_context structures are reference-counted and freed when the last reference is dropped. A context can have references from its task and the counters on its task. Counters can outlive the task so it is possible that a context will be freed well after its task has exited. Contexts are allocated on fork if the parent had a context, or otherwise the first time that a per-task counter is created on a task. In the latter case, we set the context pointer in the task struct locklessly using an atomic compare-and-exchange operation in case we raced with some other task in creating a context for the subject task. This also removes the task pointer from the perf_counter struct. The task pointer was not used anywhere and would make it harder to move a context from one task to another. Anything that needed to know which task a counter was attached to was already using counter->ctx->task. The __perf_counter_init_context function moves up in perf_counter.c so that it can be called from find_get_context, and now initializes the refcount, but is otherwise unchanged. We were potentially calling list_del_counter twice: once from __perf_counter_exit_task when the task exits and once from __perf_counter_remove_from_context when the counter's fd gets closed. This adds a check in list_del_counter so it doesn't do anything if the counter has already been removed from the lists. Since perf_counter_task_sched_in doesn't do anything if the task doesn't have a context, and leaves cpuctx->task_ctx = NULL, this adds code to __perf_install_in_context to set cpuctx->task_ctx if necessary, i.e. in the case where the current task adds the first counter to itself and thus creates a context for itself. This also adds similar code to __perf_counter_enable to handle a similar situation which can arise when the counters have been disabled using prctl; that also leaves cpuctx->task_ctx = NULL. [ Impact: refactor counter context management to prepare for new feature ] Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <18966.10075.781053.231153@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-22 12:18:19 +02:00
Zhang Rui	88dff4936c	x86: DMI match for the Sony VGN-Z540N as it needs BIOS reboot x86: DMI match for the Sony VGN-Z540N as it needs BIOS reboot, see: http://bugzilla.kernel.org/show_bug.cgi?id=12901 [ Impact: fix hung reboot on certain systems ] Signed-off-by: Zhang Rui <rui.zhang@intel.com> Cc: Len Brown <lenb@kernel.org> LKML-Reference: <1242963350.32574.53.camel@rzhang-dt> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-22 09:11:30 +02:00
H. Peter Anvin	c6ac4c18fb	x86, boot: correct the calculation of ZO_INIT_SIZE Correct the calculation of ZO_INIT_SIZE (the amount of memory we need during decompression). One symbol (ZO_startup_32) was missing from zoffset.h, and another (ZO_z_extract_offset) was misspelled. [ Impact: build fix ] Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-20 11:26:09 -07:00
Ingo Molnar	34adc80622	perf_counter: Fix context removal deadlock Disable the PMU globally before removing a counter from a context. This fixes the following lockup: [22081.741922] ------------[ cut here ]------------ [22081.746668] WARNING: at arch/x86/kernel/cpu/perf_counter.c:803 intel_pmu_handle_irq+0x9b/0x24e() [22081.755624] Hardware name: X8DTN [22081.758903] perfcounters: irq loop stuck! [22081.762985] Modules linked in: [22081.766136] Pid: 11082, comm: perf Not tainted 2.6.30-rc6-tip #226 [22081.772432] Call Trace: [22081.774940] <NMI> [<ffffffff81019aed>] ? intel_pmu_handle_irq+0x9b/0x24e [22081.781993] [<ffffffff81019aed>] ? intel_pmu_handle_irq+0x9b/0x24e [22081.788368] [<ffffffff8104505c>] ? warn_slowpath_common+0x77/0xa3 [22081.794649] [<ffffffff810450d3>] ? warn_slowpath_fmt+0x40/0x45 [22081.800696] [<ffffffff81019aed>] ? intel_pmu_handle_irq+0x9b/0x24e [22081.807080] [<ffffffff814d1a72>] ? perf_counter_nmi_handler+0x3f/0x4a [22081.813751] [<ffffffff814d2d09>] ? notifier_call_chain+0x58/0x86 [22081.819951] [<ffffffff8105b250>] ? notify_die+0x2d/0x32 [22081.825392] [<ffffffff814d1414>] ? do_nmi+0x8e/0x242 [22081.830538] [<ffffffff814d0f0a>] ? nmi+0x1a/0x20 [22081.835342] [<ffffffff8117e102>] ? selinux_file_free_security+0x0/0x1a [22081.842105] [<ffffffff81018793>] ? x86_pmu_disable_counter+0x15/0x41 [22081.848673] <<EOE>> [<ffffffff81018f3d>] ? x86_pmu_disable+0x86/0x103 [22081.855512] [<ffffffff8108fedd>] ? __perf_counter_remove_from_context+0x0/0xfe [22081.862926] [<ffffffff8108fcbc>] ? counter_sched_out+0x30/0xce [22081.868909] [<ffffffff8108ff36>] ? __perf_counter_remove_from_context+0x59/0xfe [22081.876382] [<ffffffff8106808a>] ? smp_call_function_single+0x6c/0xe6 [22081.882955] [<ffffffff81091b96>] ? perf_release+0x86/0x14c [22081.888600] [<ffffffff810c4c84>] ? __fput+0xe7/0x195 [22081.893718] [<ffffffff810c213e>] ? filp_close+0x5b/0x62 [22081.899107] [<ffffffff81046a70>] ? put_files_struct+0x64/0xc2 [22081.905031] [<ffffffff8104841a>] ? do_exit+0x1e2/0x6ef [22081.910360] [<ffffffff814d0a60>] ? _spin_lock_irqsave+0x9/0xe [22081.916292] [<ffffffff8104898e>] ? do_group_exit+0x67/0x93 [22081.921953] [<ffffffff810489cc>] ? sys_exit_group+0x12/0x16 [22081.927759] [<ffffffff8100baab>] ? system_call_fastpath+0x16/0x1b [22081.934076] ---[ end trace 3a3936ce3e1b4505 ]--- And could potentially also fix the lockup reported by Marcelo Tosatti. Also, print more debug info in case of a detected lockup. [ Impact: fix lockup ] Reported-by: Marcelo Tosatti <mtosatti@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-20 20:12:54 +02:00
Yinghai Lu	4c6f18fc81	x86, io-apic: Don't mark pin_programmed early Peter bisected that: \| commit `b9c61b7007` \| Date: Wed May 6 10:10:06 2009 -0700 \| \| x86/pci: update pirq_enable_irq() to setup io apic routing \| \| So we can set io apic routing only when enabling the device irq. wrecked his opteron box, ata1 interrupts fail to get through. ata1 is using irq 11: [ 1.451839] sata_svw 0000:01:0e.0: version 2.3 [ 1.456333] sata_svw 0000:01:0e.0: PCI INT A -> GSI 11 (level, low) -> IRQ 11 [ 1.463639] scsi0 : sata_svw [ 1.466949] scsi1 : sata_svw [ 1.470022] scsi2 : sata_svw [ 1.473090] scsi3 : sata_svw [ 1.476112] ata1: SATA max UDMA/133 mmio m8192@0xff3fe000 port 0xff3fe000 irq 11 [ 1.483490] ata2: SATA max UDMA/133 mmio m8192@0xff3fe000 port 0xff3fe100 irq 11 [ 1.490870] ata3: SATA max UDMA/133 mmio m8192@0xff3fe000 port 0xff3fe200 irq 11 [ 1.498247] ata4: SATA max UDMA/133 mmio m8192@0xff3fe000 port 0xff3fe300 irq 11 that pin is overlapped with pin with legacy ones. We should not set bits in pin_programmed here, so that those bit could be set later via io_apic_set_pci_routing(). [ Impact: fix boot hang on certain systems ] Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Yinghai Lu <yinghai.lu@kernel.org> Tested-by: Peter Zijlstra <peterz@infradead.org> Cc: Jack Steiner <steiner@sgi.com> LKML-Reference: <4A119990.9020606@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-19 14:26:51 +02:00
Jaswinder Singh Rajput	4aee2ad461	x86: asm/processor.h: remove double declaration Remove double declaration of: extern void init_scattered_cpuid_features(struct cpuinfo_x86 c); extern unsigned int init_intel_cacheinfo(struct cpuinfo_x86 c); extern unsigned short num_cache_leaves; they are already defined in the same file. [ Impact: cleanup ] Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> LKML-Reference: <1242733021.3377.1.camel@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-19 14:16:24 +02:00
Linus Torvalds	13bba6fda9	Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Fix performance regression caused by paravirt_ops on native kernels xen: use header for EXPORT_SYMBOL_GPL x86, 32-bit: fix kernel_trap_sp() x86: fix percpu_{to,from}_op() x86: mtrr: Fix high_width computation when phys-addr is >= 44bit x86: Fix false positive section mismatch warnings in the apic code	2009-05-18 09:17:37 -07:00
Linus Torvalds	0130b2d701	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Append prompt in /debug/tracing/README file x86/function-graph: fix constraint for recording old return value	2009-05-18 09:15:41 -07:00
Ingo Molnar	1079cac0f4	Merge commit 'v2.6.30-rc6' into tracing/core Merge reason: we were on an -rc4 base, sync up to -rc6 Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-18 10:15:35 +02:00
Ingo Molnar	b68f1d2e7a	perf_counter, x86: speed up the scheduling fast-path We have to set up the LVT entry only at counter init time, not at every switch-in time. There's friction between NMI and non-NMI use here - we'll probably remove the per counter configurability of it - but until then, dont slow down things ... [ Impact: micro-optimization ] Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-18 09:37:09 +02:00
Yinghai Lu	f1bdb52388	x86, irq: don't call mp_config_acpi_gsi() if update_mptable is not enabled Len expressed concern that the update_mptable feature has side-effects on the ACPI code. Make it sure explicitly that the code only ever gets called if the (default disabled) update_mptable boot quirk option is disabled. [ Impact: isolate the update_mptable feature from ACPI code more ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Len Brown <lenb@kernel.org> LKML-Reference: <4A0DC832.5090200@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-18 09:33:29 +02:00
Yinghai Lu	629e15d245	x86, irq: update_mptable needs pci_routeirq To get all device irq routing and to save them. This is basically an implicit pci=routeirq enablement if (and on if) the update_mptable boot option (which is off by default) has been specified. [ Impact: extend the update_mptable boot opion's scope ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> LKML-Reference: <4A0DB7B4.4060702@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-18 09:33:17 +02:00
Yinghai Lu	35d5a9a614	x86: fix system without memory on node0 Jack found a boot crash on a system which doesn't have memory on node0. It turns out with recent per_cpu changes, node_number for BSP will always be 0, and it is not consistent to cpu_to_node() that might set it to a different (nearer) node already. aka when numa_set_node() for node0 is called early before per_cpu area is setup: two places touched that per_cpu(node_number,): 1. in cpu/common.c::cpu_init() and it is not for BP \| #ifdef CONFIG_NUMA \| if (cpu != 0 && percpu_read(node_number) == 0 && \| cpu_to_node(cpu) != NUMA_NO_NODE) \| percpu_write(node_number, cpu_to_node(cpu)); \| #endif for BP: traps_init ==> cpu_init for AP: start_secondary ==> cpu_init 2. cpu/intel.c or amd.c::srat_detect_node via numa_set_node() for BP: check_bugs ==> identify_boot_cpu ==> identify_cpu() that is rather later before numa_node_id() is used for BP... for AP: start_secondary => smp_callin => smp_store_cpu_info() => => identify_secondary_cpu => identify_cpu() so try to set that for BP earlier in setup_per_cpu_areas(), and don't bother to set that for APs there (it will be updated later and will be used later) (and don't mess the 0 before the copying BP per_cpu data to APs) [ Impact: fix boot crash on memoryless node-0 ] Reported-and-tested-by: Jack Steiner <steiner@sgi.com> Cc: Tejun Heo <htejun@gmail.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4A0C4A02.7050401@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-18 09:27:09 +02:00
Yinghai Lu	7c43769a97	x86, mm: Fix node_possible_map logic Recently there were some changes to the meaning of node_possible_map, and it is quite strange: - the node without memory would be set in node_possible_map - but some node with less NODE_MIN_SIZE will be kicked out of node_possible_map. fix it by adding strict_setup_node_bootmem(). Also, remove unparse_node(). so result will be: 1. cpu_to_node() will return online node only (nearest one) 2. apicid_to_node() still returns the node that could be not online but is set in node_possible_map. 3. node_possible_map will include nodes that mem on it are less NODE_MIN_SIZE v2: after move_cpus_to_node change. [ Impact: get node_possible_map right ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> Tested-by: Jack Steiner <steiner@sgi.com> LKML-Reference: <4A0C49BE.6080800@kernel.org> [ v3: various small cleanups and comment clarifications ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-18 09:21:04 +02:00
Yinghai Lu	888a589f6b	mm, x86: remove MEMORY_HOTPLUG_RESERVE related code after: \| commit `b263295dbf` \| Author: Christoph Lameter <clameter@sgi.com> \| Date: Wed Jan 30 13:30:47 2008 +0100 \| \| x86: 64-bit, make sparsemem vmemmap the only memory model we don't have MEMORY_HOTPLUG_RESERVE anymore. Historically, x86-64 had an architecture-specific method for memory hotplug whereby it scanned the SRAT for physical memory ranges that could be potentially used for memory hot-add later. By reserving those ranges without physical memory, the memmap would be allocated and left dormant until needed. This depended on the DISCONTIG memory model which has been removed so the code implementing HOTPLUG_RESERVE is now dead. This patch removes the dead code used by MEMORY_HOTPLUG_RESERVE. (Changelog authored by Mel.) v2: updated changelog, and remove hotadd= in doc [ Impact: remove dead code ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Workflow-found-OK-by: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <4A0C4910.7090508@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-18 09:13:31 +02:00
Ingo Molnar	b286e21868	Merge commit 'v2.6.30-rc6' into x86/mm Merge reason: sync up to -rc6 which has changes to mm/ which we are going to touch in the commits to follow as well. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-18 09:12:51 +02:00
Yinghai Lu	2759c3287d	x86: don't call read_apic_id if !cpu_has_apic should not call that if apic is disabled. [ Impact: fix crash on certain UP configs ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> LKML-Reference: <4A09CCBB.2000306@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-18 08:43:25 +02:00
Yinghai Lu	e5198075c6	x86, apic: introduce io_apic_irq_attr according to Ingo, io_apic irq-setup related functions have too many parameters with a repetitive signature. So reduce related funcs to get less params by passing a pointer to a newly defined io_apic_irq_attr structure. v2: io_apic_irq ==> irq_attr triggering ==> trigger v3: add set_io_apic_irq_attr [ Impact: cleanup ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: Len Brown <lenb@kernel.org> LKML-Reference: <4A08ACD3.2070401@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-18 08:38:55 +02:00
Ingo Molnar	dc3f81b129	Merge commit 'v2.6.30-rc6' into perfcounters/core Merge reason: this branch was on an -rc4 base, merge it up to -rc6 to get the latest upstream fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-18 07:37:49 +02:00
Ingo Molnar	d2517a49d5	perf_counter, x86: fix zero irq_period counters The quirk to irq_period unearthed an unrobustness we had in the hw_counter initialization sequence: we left irq_period at 0, which was then quirked up to 2 ... which then generated a _lot_ of interrupts during 'perf stat' runs, slowed them down and skewed the counter results in general. Initialize irq_period to the maximum instead. [ Impact: fix perf stat results ] Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-17 12:27:37 +02:00
Jeremy Fitzhardinge	b4ecc12699	x86: Fix performance regression caused by paravirt_ops on native kernels Xiaohui Xin and some other folks at Intel have been looking into what's behind the performance hit of paravirt_ops when running native. It appears that the hit is entirely due to the paravirtualized spinlocks introduced by: \| commit `8efcbab674` \| Date: Mon Jul 7 12:07:51 2008 -0700 \| \| paravirt: introduce a "lock-byte" spinlock implementation The extra call/return in the spinlock path is somehow causing an increase in the cycles/instruction of somewhere around 2-7% (seems to vary quite a lot from test to test). The working theory is that the CPU's pipeline is getting upset about the call->call->locked-op->return->return, and seems to be failing to speculate (though I haven't seen anything definitive about the precise reasons). This doesn't entirely make sense, because the performance hit is also visible on unlock and other operations which don't involve locked instructions. But spinlock operations clearly swamp all the other pvops operations, even though I can't imagine that they're nearly as common (there's only a .05% increase in instructions executed). If I disable just the pv-spinlock calls, my tests show that pvops is identical to non-pvops performance on native (my measurements show that it is actually about .1% faster, but Xiaohui shows a .05% slowdown). Summary of results, averaging 10 runs of the "mmperf" test, using a no-pvops build as baseline: nopv Pv-nospin Pv-spin CPU cycles 100.00% 99.89% 102.18% instructions 100.00% 100.10% 100.15% CPI 100.00% 99.79% 102.03% cache ref 100.00% 100.84% 100.28% cache miss 100.00% 90.47% 88.56% cache miss rate 100.00% 89.72% 88.31% branches 100.00% 99.93% 100.04% branch miss 100.00% 103.66% 107.72% branch miss rt 100.00% 103.73% 107.67% wallclock 100.00% 99.90% 102.20% The clear effect here is that the 2% increase in CPI is directly reflected in the final wallclock time. (The other interesting effect is that the more ops are out of line calls via pvops, the lower the cache access and miss rates. Not too surprising, but it suggests that the non-pvops kernel is over-inlined. On the flipside, the branch misses go up correspondingly...) So, what's the fix? Paravirt patching turns all the pvops calls into direct calls, so _spin_lock etc do end up having direct calls. For example, the compiler generated code for paravirtualized _spin_lock is: <_spin_lock+0>: mov %gs:0xb4c8,%rax <_spin_lock+9>: incl 0xffffffffffffe044(%rax) <_spin_lock+15>: callq 0xffffffff805a5b30 <_spin_lock+22>: retq The indirect call will get patched to: <_spin_lock+0>: mov %gs:0xb4c8,%rax <_spin_lock+9>: incl 0xffffffffffffe044(%rax) <_spin_lock+15>: callq <__ticket_spin_lock> <_spin_lock+20>: nop; nop / or whatever 2-byte nop */ <_spin_lock+22>: retq One possibility is to inline _spin_lock, etc, when building an optimised kernel (ie, when there's no spinlock/preempt instrumentation/debugging enabled). That will remove the outer call/return pair, returning the instruction stream to a single call/return, which will presumably execute the same as the non-pvops case. The downsides arel 1) it will replicate the preempt_disable/enable code at eack lock/unlock callsite; this code is fairly small, but not nothing; and 2) the spinlock definitions are already a very heavily tangled mass of #ifdefs and other preprocessor magic, and making any changes will be non-trivial. The other obvious answer is to disable pv-spinlocks. Making them a separate config option is fairly easy, and it would be trivial to enable them only when Xen is enabled (as the only non-default user). But it doesn't really address the common case of a distro build which is going to have Xen support enabled, and leaves the open question of whether the native performance cost of pv-spinlocks is worth the performance improvement on a loaded Xen system (10% saving of overall system CPU when guests block rather than spin). Still it is a reasonable short-term workaround. [ Impact: fix pvops performance regression when running native ] Analysed-by: "Xin Xiaohui" <xiaohui.xin@intel.com> Analysed-by: "Li Xin" <xin.li@intel.com> Analysed-by: "Nakajima Jun" <jun.nakajima@intel.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Xen-devel <xen-devel@lists.xensource.com> LKML-Reference: <4A0B62F7.5030802@goop.org> [ fixed the help text ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-15 20:07:42 +02:00
Jaswinder Singh Rajput	52650257ea	x86, mtrr: replace MTRRdefType_MSR with msr-index's MSR_MTRRdefType Use standard msr-index.h's MSR declaration and no need to declare again. [ Impact: cleanup, no object code change ] Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-15 07:49:01 -07:00
Jaswinder Singh Rajput	ba5673ff1f	x86, mtrr: replace MTRRfix4K_C0000_MSR with msr-index's MSR_MTRRfix4K_C0000 Use standard msr-index.h's MSR declaration and no need to declare again. [ Impact: cleanup, no object code change ] Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-15 07:49:01 -07:00
Jaswinder Singh Rajput	654ac05801	x86, mtrr: remove mtrr MSRs double declaration Removed MTRR MSR from mtrr/mtrr.h as these are already declared in msr-index.h and nobody is using them: MTRRfix16K_A0000_MSR MTRRfix4K_C8000_MSR MTRRfix4K_D0000_MSR MTRRfix4K_D8000_MSR MTRRfix4K_E0000_MSR MTRRfix4K_E8000_MSR MTRRfix4K_F0000_MSR MTRRfix4K_F8000_MSR Use standard msr-index.h's MSR declaration and no need to declare again [ Impact: cleanup, no object code change ] Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-15 07:49:01 -07:00
Jaswinder Singh Rajput	7d9d55e449	x86, mtrr: replace MTRRfix16K_80000_MSR with msr-index's MSR_MTRRfix16K_80000 Use standard msr-index.h's MSR declaration and no need to declare again [ Impact: cleanup, no object code change ] Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-15 07:49:00 -07:00
Jaswinder Singh Rajput	a036c7a358	x86, mtrr: replace MTRRfix64K_00000_MSR with msr-index's MSR_MTRRfix64K_00000 Use standard msr-index.h's MSR declaration and no need to declare again. [ Impact: cleanup, no object code change ] Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-15 07:49:00 -07:00
Jaswinder Singh Rajput	d9bcc01d58	x86, mtrr: replace MTRRcap_MSR with msr-index's MSR_MTRRcap Use standard msr-index.h's MSR declaration and no need to declare again. [ Impact: cleanup, no object code change ] Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-15 07:49:00 -07:00
Peter Zijlstra	60db5e09c1	perf_counter: frequency based adaptive irq_period Instead of specifying the irq_period for a counter, provide a target interrupt frequency and dynamically adapt the irq_period to match this frequency. [ Impact: new perf-counter attribute/feature ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <20090515132018.646195868@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-15 15:26:56 +02:00
Jason Wessel	33ab1979bc	kgdb,i386: use address that SP register points to in the exception frame The treatment of the SP register is different on x86_64 and i386. This is a regression fix that lived outside the mainline kernel from 2.6.27 to now. The regression was a result of the original merge consolidation of the i386 and x86_64 archs to x86. The incorrectly reported SP on i386 prevented stack tracebacks from working correctly in gdb. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2009-05-15 07:56:25 -05:00
Ingo Molnar	9029a5e380	perf_counter: x86: Protect against infinite loops in intel_pmu_handle_irq() intel_pmu_handle_irq() can lock up in an infinite loop if the hardware does not allow the acking of irqs. Alas, this happened in testing so make this robust and emit a warning if it happens in the future. Also, clean up the IRQ handlers a bit. [ Impact: improve perfcounter irq/nmi handling robustness ] Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-15 09:47:06 +02:00
Ingo Molnar	1c80f4b598	perf_counter: x86: Disallow interval of 1 On certain CPUs i have observed a stuck PMU if interval was set to 1 and NMIs were used. The PMU had PMC0 set in MSR_CORE_PERF_GLOBAL_STATUS, but it was not possible to ack it via MSR_CORE_PERF_GLOBAL_OVF_CTRL, and the NMI loop got stuck infinitely. [ Impact: fix rare hangs during high perfcounter load ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-15 09:47:05 +02:00
Peter Zijlstra	a4016a79fc	perf_counter: x86: Robustify interrupt handling Two consecutive NMIs could daze and confuse the machine when the first would handle the overflow of both counters. [ Impact: fix false-positive syslog messages under multi-session profiling ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-15 09:47:03 +02:00
Peter Zijlstra	9e35ad388b	perf_counter: Rework the perf counter disable/enable The current disable/enable mechanism is: token = hw_perf_save_disable(); ... /* do bits */ ... hw_perf_restore(token); This works well, provided that the use nests properly. Except we don't. x86 NMI/INT throttling has non-nested use of this, breaking things. Therefore provide a reference counter disable/enable interface, where the first disable disables the hardware, and the last enable enables the hardware again. [ Impact: refactor, simplify the PMU disable/enable logic ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-15 09:47:02 +02:00
Peter Zijlstra	962bf7a66e	perf_counter: x86: Fix up the amd NMI/INT throttle perf_counter_unthrottle() restores throttle_ctrl, buts its never set. Also, we fail to disable all counters when throttling. [ Impact: fix rare stuck perf-counters when they are throttled ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-15 09:47:01 +02:00
Peter Zijlstra	a026dfecc0	perf_counter: x86: Allow unpriviliged use of NMIs Apply sysctl_perf_counter_priv to NMIs. Also, fail the counter creation instead of silently down-grading to regular interrupts. [ Impact: allow wider perf-counter usage ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-15 09:46:57 +02:00
Ingo Molnar	f5a5a2f6e6	perf_counter: x86: Fix throttling If counters are disabled globally when a perfcounter IRQ/NMI hits, and if we throttle in that case, we'll promote the '0' value to the next lapic IRQ and disable all perfcounters at that point, permanently ... Fix it. [ Impact: fix hung perfcounters under load ] Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-15 09:46:56 +02:00
Peter Zijlstra	ec3232bdf8	perf_counter: x86: More accurate counter update Take the counter width into account instead of assuming 32 bits. In particular Nehalem has 44 bit wide counters, and all arithmetics should happen on a 44-bit signed integer basis. [ Impact: fix rare event imprecision, warning message on Nehalem ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-15 09:46:54 +02:00
Steven Rostedt	29a679754b	x86/stacktrace: return 0 instead of -1 for stack ops If we return -1 in the ops->stack for the stacktrace saving, we end up breaking out of the loop if the stack we are tracing is in the exception stack. This causes traces like: <idle>-0 [002] 34263.745825: raise_softirq_irqoff <-__blk_complete_request <idle>-0 [002] 34263.745826: <= 0 <= 0 <= 0 <= 0 <= 0 <= 0 <= 0 By returning "0" instead, the irq stack is saved as well, and we see: <idle>-0 [003] 883.280992: raise_softirq_irqoff <-__hrtimer_star t_range_ns <idle>-0 [003] 883.280992: <= hrtimer_start_range_ns <= tick_nohz_restart_sched_tick <= cpu_idle <= start_secondary <= <= 0 <= 0 [ Impact: record stacks from interrupts ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-14 23:19:09 -04:00
Steven Rostedt	aa512a27e9	x86/function-graph: fix constraint for recording old return value After upgrading from gcc 4.2.2 to 4.4.0, the function graph tracer broke. Investigating, I found that in the asm that replaces the return value, gcc was using the same register for the old value as it was for the new value. mov (addr), old mov new, (addr) But if old and new are the same register, we clobber new with old! I first thought this was a bug in gcc 4.4.0 and reported it: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40132 Andrew Pinski responded (quickly), saying that it was correct gcc behavior and the code needed to denote old as an "early clobber". Instead of "=r"(old), we need "=&r"(old). [Impact: keep function graph tracer from breaking with gcc 4.4.0 ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-05-13 13:52:19 -04:00
Arun R Bharadwaj	5c333864a6	timers: Identifying the existing pinned timers * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]: The following pinned hrtimers have been identified and marked: 1)sched_rt_period_timer 2)tick_sched_timer 3)stack_trace_timer_fn [ tglx: fixup the hrtimer pinned mode ] Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-05-13 16:52:42 +02:00
Randy Dunlap	44408ad736	xen: use header for EXPORT_SYMBOL_GPL mmu.c needs to #include module.h to prevent these warnings: arch/x86/xen/mmu.c:239: warning: data definition has no type or storage class arch/x86/xen/mmu.c:239: warning: type defaults to 'int' in declaration of 'EXPORT_SYMBOL_GPL' arch/x86/xen/mmu.c:239: warning: parameter names (without types) in function declaration [ Impact: cleanup ] Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-13 15:43:55 +02:00
Peter Zijlstra	5bb9efe33e	perf_counter: fix print debug irq disable inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. bash/15802 [HC0[0]:SC0[0]:HE1:SE1] takes: (sysrq_key_table_lock){?.....}, Don't unconditionally enable interrupts in the perf_counter_print_debug() path. [ Impact: fix potential deadlock pointed out by lockdep ] LKML-Reference: <new-submission> Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2009-05-13 08:17:37 +02:00
H. Peter Anvin	c4f68236e4	x86-64: align __PHYSICAL_START, remove __KERNEL_ALIGN Handle the misconfiguration where CONFIG_PHYSICAL_START is incompatible with CONFIG_PHYSICAL_ALIGN. This is a configuration error, but one which arises easily since Kconfig doesn't have the smarts to express the true relationship between these two variables. Hence, align __PHYSICAL_START the same way we align LOAD_PHYSICAL_ADDR in <asm/boot.h>. For non-relocatable kernels, this would cause the boot to fail. [ Impact: fix boot failures for non-relocatable kernels ] Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-12 11:41:42 -07:00
H. Peter Anvin	7ed42a28b2	x86, boot: correct sanity checks in boot/compressed/misc.c arch/x86/boot/compressed/misc.c contains several sanity checks on the output address. Correct constraints that are no longer correct: - the alignment test should be MIN_KERNEL_ALIGN on both 32 and 64 bits. - the 64 bit maximum address was set to 2^40, which was the limit of one specific x86-64 implementation. Change the test to 2^46, the current Linux limit, and at least try to test the end rather than the beginning. - for non-relocatable kernels, test against LOAD_PHYSICAL_ADDR on both 32 and 64 bits. [ Impact: fix potential boot failure due to invalid tests ] Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2009-05-12 11:33:08 -07:00
Yinghai Lu	4797f6b021	x86: read apic ID in the !acpi_lapic case Ed found that on 32-bit, boot_cpu_physical_apicid is not read right, when the mptable is broken. Interestingly, actually three paths use/set it: 1. acpi: at that time that is already read from reg 2. mptable: only read from mptable 3. no madt, and no mptable, that use default apic id 0 for 64-bit, -1 for 32-bit so we could read the apic id for the 2/3 path. We trust the hardware register more than we trust a BIOS data structure (the mptable). We can also avoid the double set_fixmap() when acpi_lapic is used, and also need to move cpu_has_apic earlier and call apic_disable(). Also when need to update the apic id, we'd better read and set the apic version as well - so that quirks are applied precisely. v2: make path 3 with 64bit, use -1 as apic id, so could read it later. v3: fix whitespace problem pointed out by Ed Swierk v5: fix boot crash [ Impact: get correct apic id for bsp other than acpi path ] Reported-by: Ed Swierk <eswierk@aristanetworks.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> LKML-Reference: <49FC85A9.2070702@kernel.org> [ v4: sanity-check in the ACPI case too ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-12 12:22:06 +02:00
Ingo Molnar	6cda3eb62e	Merge branch 'x86/apic' into irq/numa Merge reason: both topics modify the APIC code but were able to do it in parallel so far. An upcoming patch generates a conflict so merge them to avoid the conflict. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-12 12:17:36 +02:00

... 5 6 7 8 9 ...

8145 commits