* 'x86-xen-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (42 commits)
xen: cache cr0 value to avoid trap'n'emulate for read_cr0
xen/x86-64: clean up warnings about IST-using traps
xen/x86-64: fix breakpoints and hardware watchpoints
xen: reserve Xen start_info rather than e820 reserving
xen: add FIX_TEXT_POKE to fixmap
lguest: update lazy mmu changes to match lguest's use of kvm hypercalls
xen: honour VCPU availability on boot
xen: add "capabilities" file
xen: drop kexec bits from /sys/hypervisor since kexec isn't implemented yet
xen/sys/hypervisor: change writable_pt to features
xen: add /sys/hypervisor support
xen/xenbus: export xenbus_dev_changed
xen: use device model for suspending xenbus devices
xen: remove suspend_cancel hook
xen/dev-evtchn: clean up locking in evtchn
xen: export ioctl headers to userspace
xen: add /dev/xen/evtchn driver
xen: add irq_from_evtchn
xen: clean up gate trap/interrupt constants
xen: set _PAGE_NX in __supported_pte_mask before pagetable construction
...
Xiaohui Xin and some other folks at Intel have been looking into what's
behind the performance hit of paravirt_ops when running native.
It appears that the hit is entirely due to the paravirtualized
spinlocks introduced by:
| commit 8efcbab674
| Date: Mon Jul 7 12:07:51 2008 -0700
|
| paravirt: introduce a "lock-byte" spinlock implementation
The extra call/return in the spinlock path is somehow
causing an increase in the cycles/instruction of somewhere around 2-7%
(seems to vary quite a lot from test to test). The working theory is
that the CPU's pipeline is getting upset about the
call->call->locked-op->return->return, and seems to be failing to
speculate (though I haven't seen anything definitive about the precise
reasons). This doesn't entirely make sense, because the performance
hit is also visible on unlock and other operations which don't involve
locked instructions. But spinlock operations clearly swamp all the
other pvops operations, even though I can't imagine that they're
nearly as common (there's only a .05% increase in instructions
executed).
If I disable just the pv-spinlock calls, my tests show that pvops is
identical to non-pvops performance on native (my measurements show that
it is actually about .1% faster, but Xiaohui shows a .05% slowdown).
Summary of results, averaging 10 runs of the "mmperf" test, using a
no-pvops build as baseline:
nopv Pv-nospin Pv-spin
CPU cycles 100.00% 99.89% 102.18%
instructions 100.00% 100.10% 100.15%
CPI 100.00% 99.79% 102.03%
cache ref 100.00% 100.84% 100.28%
cache miss 100.00% 90.47% 88.56%
cache miss rate 100.00% 89.72% 88.31%
branches 100.00% 99.93% 100.04%
branch miss 100.00% 103.66% 107.72%
branch miss rt 100.00% 103.73% 107.67%
wallclock 100.00% 99.90% 102.20%
The clear effect here is that the 2% increase in CPI is
directly reflected in the final wallclock time.
(The other interesting effect is that the more ops are
out of line calls via pvops, the lower the cache access
and miss rates. Not too surprising, but it suggests that
the non-pvops kernel is over-inlined. On the flipside,
the branch misses go up correspondingly...)
So, what's the fix?
Paravirt patching turns all the pvops calls into direct calls, so
_spin_lock etc do end up having direct calls. For example, the compiler
generated code for paravirtualized _spin_lock is:
<_spin_lock+0>: mov %gs:0xb4c8,%rax
<_spin_lock+9>: incl 0xffffffffffffe044(%rax)
<_spin_lock+15>: callq *0xffffffff805a5b30
<_spin_lock+22>: retq
The indirect call will get patched to:
<_spin_lock+0>: mov %gs:0xb4c8,%rax
<_spin_lock+9>: incl 0xffffffffffffe044(%rax)
<_spin_lock+15>: callq <__ticket_spin_lock>
<_spin_lock+20>: nop; nop /* or whatever 2-byte nop */
<_spin_lock+22>: retq
One possibility is to inline _spin_lock, etc, when building an
optimised kernel (ie, when there's no spinlock/preempt
instrumentation/debugging enabled). That will remove the outer
call/return pair, returning the instruction stream to a single
call/return, which will presumably execute the same as the non-pvops
case. The downsides arel 1) it will replicate the
preempt_disable/enable code at eack lock/unlock callsite; this code is
fairly small, but not nothing; and 2) the spinlock definitions are
already a very heavily tangled mass of #ifdefs and other preprocessor
magic, and making any changes will be non-trivial.
The other obvious answer is to disable pv-spinlocks. Making them a
separate config option is fairly easy, and it would be trivial to
enable them only when Xen is enabled (as the only non-default user).
But it doesn't really address the common case of a distro build which
is going to have Xen support enabled, and leaves the open question of
whether the native performance cost of pv-spinlocks is worth the
performance improvement on a loaded Xen system (10% saving of overall
system CPU when guests block rather than spin). Still it is a
reasonable short-term workaround.
[ Impact: fix pvops performance regression when running native ]
Analysed-by: "Xin Xiaohui" <xiaohui.xin@intel.com>
Analysed-by: "Li Xin" <xin.li@intel.com>
Analysed-by: "Nakajima Jun" <jun.nakajima@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Xen-devel <xen-devel@lists.xensource.com>
LKML-Reference: <4A0B62F7.5030802@goop.org>
[ fixed the help text ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
mmu.c needs to #include module.h to prevent these warnings:
arch/x86/xen/mmu.c:239: warning: data definition has no type or storage class
arch/x86/xen/mmu.c:239: warning: type defaults to 'int' in declaration of 'EXPORT_SYMBOL_GPL'
arch/x86/xen/mmu.c:239: warning: parameter names (without types) in function declaration
[ Impact: cleanup ]
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
stts() is implemented in terms of read_cr0/write_cr0 to update the
state of the TS bit. This happens during context switch, and so
is fairly performance critical. Rather than falling back to
a trap-and-emulate native read_cr0, implement our own by caching
the last-written value from write_cr0 (the TS bit is the only one
we really care about).
Impact: optimise Xen context switches
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Ignore known IST-using traps. Aside from the debugger traps, they're
low-level faults which Xen will handle for us, so the kernel needn't
worry about them. Keep warning in case unknown trap starts using IST.
Impact: suppress spurious warnings
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Native x86-64 uses the IST mechanism to run int3 and debug traps on
an alternative stack. Xen does not do this, and so the frames were
being misinterpreted by the ptrace code. This change special-cases
these two exceptions by using Xen variants which run on the normal
kernel stack properly.
Impact: avoid crash or bad data when IST trap is invoked under Xen
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Use reserve_early rather than e820 reservations for Xen start info and mfn->pfn
table, so that the memory use is a bit more self-documenting.
[ Impact: cleanup ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Xen-devel <xen-devel@lists.xensource.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <4A032EF1.6070708@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
arch/frv/include/asm/pgtable.h
arch/x86/include/asm/required-features.h
arch/x86/xen/mmu.c
Merge reason: x86/xen was on a .29 base still, move it to a fresher
branch and pick up Xen fixes as well, plus resolve
conflicts
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The Xen pagetables are no longer implicitly reserved as part of the other
i386_start_kernel reservations, so make sure we explicitly reserve them.
This prevents them from being released into the general kernel free page
pool and reused.
[ Impact: fix Xen guest crash ]
Also-Bisected-by: Bryan Donlan <bdonlan@gmail.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Xen-devel <xen-devel@lists.xensource.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <4A032EEC.30509@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Pass clocksource pointer to the read() callback for clocksources. This
allows us to share the callback between multiple instances.
[hugh@veritas.com: fix powerpc build of clocksource pass clocksource mods]
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Magnus Damm <damm@igel.co.jp>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-rc1/xen/core' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
xen: add FIX_TEXT_POKE to fixmap
xen: honour VCPU availability on boot
xen: clean up gate trap/interrupt constants
xen: set _PAGE_NX in __supported_pte_mask before pagetable construction
xen: resume interrupts before system devices.
xen/mmu: weaken flush_tlb_other test
xen/mmu: some early pagetable cleanups
Xen: Add virt_to_pfn helper function
x86-64: remove PGE from must-have feature list
xen: mask XSAVE from cpuid
NULL noise: arch/x86/xen/smp.c
xen: remove xen_load_gdt debug
xen: make xen_load_gdt simpler
xen: clean up xen_load_gdt
xen: split construction of p2m mfn tables from registration
xen: separate p2m allocation from setting
xen: disable preempt for leave_lazy_mmu
Use phys_addr_t for receiving a physical address argument instead of
unsigned long. This allows fixmap to handle pages higher than 4GB on
x86-32.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
FIX_TEXT_POKE[01] are used to map kernel addresses, so they're mapping
pfns, not mfns.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
FIX_TEXT_POKE[01] are used to map kernel addresses, so they're mapping
pfns, not mfns.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Some 64-bit machines don't support the NX flag in ptes.
Check for NX before constructing the kernel pagetables.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Impact: fixes crashing bug
There's no particular problem with getting an empty cpu mask,
so just shortcut-return if we get one.
Avoids crash reported by Christophe Saout <christophe@saout.de>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
1. make sure early-allocated ptes are pinned, so they can be later
unpinned
2. don't pin pmd+pud, just make them RO
3. scatter some __inits around
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Xen leaves XSAVE set in cpuid, but doesn't allow cr4.OSXSAVE
to be set. This confuses the kernel and it ends up crashing on
an xsetbv instruction.
At boot time, try to set cr4.OSXSAVE, and mask XSAVE out of
cpuid it we can't. This will produce a spurious error from Xen,
but allows us to support XSAVE if/when Xen does.
This also factors out the cpuid mask decisions to boot time.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Fix this sparse warnings:
arch/x86/xen/smp.c:316:52: warning: Using plain integer as NULL pointer
arch/x86/xen/smp.c:421:60: warning: Using plain integer as NULL pointer
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Remove use of multicall machinery which is unused (gdt loading
is never performance critical). This removes the implicit use
of percpu variables, which simplifies understanding how
the percpu code's use of load_gdt interacts with this code.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Build the p2m_mfn_list_list early with the rest of the p2m table, but
register it later when the real shared_info structure is in place.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
When doing very early p2m setting, we need to separate setting
from allocation, so split things up accordingly.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
xen_mc_flush() requires preemption to be disabled for its own sanity,
so disable it while we're flushing.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* commit 'origin/master': (4825 commits)
Fix build errors due to CONFIG_BRANCH_TRACER=y
parport: Use the PCI IRQ if offered
tty: jsm cleanups
Adjust path to gpio headers
KGDB_SERIAL_CONSOLE check for module
Change KCONFIG name
tty: Blackin CTS/RTS
Change hardware flow control from poll to interrupt driven
Add support for the MAX3100 SPI UART.
lanana: assign a device name and numbering for MAX3100
serqt: initial clean up pass for tty side
tty: Use the generic RS485 ioctl on CRIS
tty: Correct inline types for tty_driver_kref_get()
splice: fix deadlock in splicing to file
nilfs2: support nanosecond timestamp
nilfs2: introduce secondary super block
nilfs2: simplify handling of active state of segments
nilfs2: mark minor flag for checkpoint created by internal operation
nilfs2: clean up sketch file
nilfs2: super block operations fix endian bug
...
Conflicts:
arch/x86/include/asm/thread_info.h
arch/x86/lguest/boot.c
drivers/xen/manage.c
Some 64-bit machines don't support the NX flag in ptes.
Check for NX before constructing the kernel pagetables.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Impact: fixes crashing bug
There's no particular problem with getting an empty cpu mask,
so just shortcut-return if we get one.
Avoids crash reported by Christophe Saout <christophe@saout.de>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
1. make sure early-allocated ptes are pinned, so they can be later
unpinned
2. don't pin pmd+pud, just make them RO
3. scatter some __inits around
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Xen leaves XSAVE set in cpuid, but doesn't allow cr4.OSXSAVE
to be set. This confuses the kernel and it ends up crashing on
an xsetbv instruction.
At boot time, try to set cr4.OSXSAVE, and mask XSAVE out of
cpuid it we can't. This will produce a spurious error from Xen,
but allows us to support XSAVE if/when Xen does.
This also factors out the cpuid mask decisions to boot time.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Fix this sparse warnings:
arch/x86/xen/smp.c:316:52: warning: Using plain integer as NULL pointer
arch/x86/xen/smp.c:421:60: warning: Using plain integer as NULL pointer
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Remove use of multicall machinery which is unused (gdt loading
is never performance critical). This removes the implicit use
of percpu variables, which simplifies understanding how
the percpu code's use of load_gdt interacts with this code.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Build the p2m_mfn_list_list early with the rest of the p2m table, but
register it later when the real shared_info structure is in place.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
When doing very early p2m setting, we need to separate setting
from allocation, so split things up accordingly.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
xen_mc_flush() requires preemption to be disabled for its own sanity,
so disable it while we're flushing.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Impact: remove obsolete checks, simplification
Lift restrictions on preemption with lazy mmu mode, as it is now allowed.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Impact: fix lazy context switch API
Pass the previous and next tasks into the context switch start
end calls, so that the called functions can properly access the
task state (esp in end_context_switch, in which the next task
is not yet completely current).
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Impact: allow preemption during lazy mmu updates
If we're in lazy mmu mode when context switching, leave
lazy mmu mode, but remember the task's state in
TIF_LAZY_MMU_UPDATES. When we resume the task, check this
flag and re-enter lazy mmu mode if its set.
This sets things up for allowing lazy mmu mode while preemptible,
though that won't actually be active until the next change.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Impact: simplification, prepare for later changes
Make lazy cpu mode more specific to context switching, so that
it makes sense to do more context-switch specific things in
the callbacks.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Impact: new interface
Add a brk()-like allocator which effectively extends the bss in order
to allow very early code to do dynamic allocations. This is better than
using statically allocated arrays for data in subsystems which may never
get used.
The space for brk allocations is in the bss ELF segment, so that the
space is mapped properly by the code which maps the kernel, and so
that bootloaders keep the space free rather than putting a ramdisk or
something into it.
The bss itself, delimited by __bss_stop, ends before the brk area
(__brk_base to __brk_limit). The kernel text, data and bss is reserved
up to __bss_stop.
Any brk-allocated data is reserved separately just before the kernel
pagetable is built, as that code allocates from unreserved spaces
in the e820 map, potentially allocating from any unused brk memory.
Ultimately any unused memory in the brk area is used in the general
kernel memory pool.
Initially the brk space is set to 1MB, which is probably much larger
than any user needs (the largest current user is i386 head_32.S's code
to build the pagetables to map the kernel, which can get fairly large
with a big kernel image and no PSE support). So long as the system
has sufficient memory for the bootloader to reserve the kernel+1MB brk,
there are no bad effects resulting from an over-large brk.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The virtually mapped percpu space causes us two problems:
- for hypercalls which take an mfn, we need to do a full pagetable
walk to convert the percpu va into an mfn, and
- when a hypercall requires a page to be mapped RO via all its aliases,
we need to make sure its RO in both the percpu mapping and in the
linear mapping
This primarily affects the gdt and the vcpu info structure.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Xen-devel <xen-devel@lists.xensource.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <htejun@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>