Commit Graph

44314 Commits (08ed38b68099f2a492196414b08a7f5dd8dc3537)

Author SHA1 Message Date
Brice Goglin de3c450704 [PATCH] myri10ge: Full vlan frame in small_bytes
Receive full vlan frames into smalls when running with a jumbo MTU.

Signed-off-by: Brice Goglin <brice@myri.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11 09:54:06 -05:00
Brice Goglin 52ea6fb39b [PATCH] myri10ge: drop contiguous skb routines
Drop the old routines that used the physically contigous skb now
that we use the physical pages. And rename myri10ge_page_rx_done()
to myri10ge_rx_done() as it was previously.

Signed-off-by: Brice Goglin <brice@myri.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11 09:54:06 -05:00
Brice Goglin c7dab99b08 [PATCH] myri10ge: switch to page-based skb
Switch to physical page skb, by calling the new page-based
allocation routines and using myri10ge_page_rx_done().

Signed-off-by: Brice Goglin <brice@myri.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11 09:54:06 -05:00
Brice Goglin dd50f3361f [PATCH] myri10ge: add page-based skb routines
Add physical page skb allocation routines and page based rx_done,
to be used by upcoming patches.

Signed-off-by: Brice Goglin <brice@myri.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11 09:54:06 -05:00
Brice Goglin 6250223e05 [PATCH] myri10ge: indentation cleanups
Indentation cleanups to synchronize to our tree which is automatically
indent'ed.

Signed-off-by: Brice Goglin <brice@myri.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11 09:54:06 -05:00
Stephen Hemminger 7fe26a60e0 [PATCH] chelsio: working NAPI
This driver tries to enable/disable NAPI at runtime, but
does so in an unsafe manner, and the NAPI interrupt handling is
a mess. Replace it with a compile time selected NAPI implementation.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11 09:51:07 -05:00
Haavard Skinnemoen 0f0d84e52c [PATCH] MACB: Use __raw register access
Since macb is a chip-internal device, use __raw_readl and
__raw_writel instead of readl/writel. This will perform native-endian
accesses, which is the right thing to do on both AVR32 and ARM devices.

Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11 09:31:28 -05:00
Haavard Skinnemoen d836cae4f6 [PATCH] MACB: Use struct delayed_work instead of struct work_struct
The macb driver calls schedule_delayed_work() and friends, so we need
to use a struct delayed_work along with it. The conversion was
explained by David Howells on lkml Dec 5 2006:

http://lkml.org/lkml/2006/12/5/269

Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11 09:31:28 -05:00
Scott Wood 68dc44af63 [PATCH] ucc_geth: Initialize mdio_lock.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11 09:31:28 -05:00
Scott Wood 1083cfe112 [PATCH] ucc_geth: compilation error fixes
Fix compilation failures when building the ucc_geth driver with spinlock
debugging.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11 09:31:28 -05:00
Andrew Victor 99eeb8dfb1 AT91 MMC update for 2.6.19
The driver is usable on the newer SAM9 processors so replace all text
references to AT91RM9200 with just AT91.

The controller bug where all the words are byte-swapped is fixed on the
AT91SAM9 processors.  The byte-swapping work-around therefore only needs
to be done if cpu_is_at91rm9200().
[Original patch from Wojtek Kaniewski]

The AT91RM9200 and AT91SAM9260 processors support two MMC/SD slots - the
slot which is connected is now passed via the platform_data and the
correct slot selected in the AT91_MCI_SDCR register.

The driver should not be calling at91_set_gpio_output() since the VCC
pin should have already been configured as an output in the
processor/board setup code.  The driver should call
at91_set_gpio_value().

Signed-off-by: Andrew Victor <andrew@sanpeople.com>
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
2006-12-11 12:43:35 +01:00
Pierre Ossman a98087cf81 mmc: Change SDHCI iomem error to a warning
Some controllers report an invalid iomem size, but seem to work
correctly anyway. Change our current error to just a warning and
hope it doesn't cause too much problems.

Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
2006-12-11 09:48:42 +01:00
Vitaly Wool 7b30d281b9 mmc: fix "prev->state: 2 != TASK_RUNNING??" problem on SD/MMC card removal
Currently on SD/MMC card removal the system exhibits the following message (the platform is ARM Versatile):

    prev->state: 2 != TASK_RUNNING??
    mmcqd/762[CPU#0]: BUG in __schedule at linux-2.6/kernel/sched.c:3826

(akpm: someone tried to fix this, but it's still wrong)

Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
2006-12-11 09:48:16 +01:00
Andrew Victor f3a8efa90b AT91 MMC 5 : Minor cleanups
A number of small cleanups to the AT91RM9200 MMC driver:
 - fix warnings generated by pr_debug().
 - prepend "AT91 MMC:" to printk() messages.

Signed-off-by: Andrew Victor <andrew@sanpeople.com>
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
2006-12-11 09:47:21 +01:00
Andrew Victor df05a303e3 AT91 MMC 4 : Interrupt handler cleanup
This patch simplifies the AT91RM9200 MMC interrupt handler code so that
it doesn't re-read the Interrupt Status and Interrupt Mask registers
multiple times.

Also defined AT91_MCI_ERRORS instead of using the hard-coded 0xffff0000.

Signed-off-by: Andrew Victor <andrew@sanpeople.com>
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
2006-12-11 09:47:12 +01:00
Andrew Victor 3dd3b039d4 AT91 MMC 3 : Move global mci_clk variable
Move the global 'mci_clk' variable into the local 'at91mci_host'
structure.

Signed-off-by: Andrew Victor <andrew@sanpeople.com>
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
2006-12-11 09:47:02 +01:00
Andrew Victor 17ea0595f4 AT91 MMC 2 : Use platform resources
Use the I/O base-address and IRQ passed to the driver via the
platform_device resources instead of using hardcoded values.

Signed-off-by: Andrew Victor <andrew@sanpeople.com>
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
2006-12-11 09:46:51 +01:00
Andrew Victor e0b19b8365 AT91 MMC 1: Pass host structure.
The I/O base address is now stored in the 'at91mci_host' structure.  We
therefore have to pass this structure to at91_mci_read() and
at91_mci_write().

Signed-off-by: Andrew Victor <andrew@sanpeople.com>
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
2006-12-11 09:46:37 +01:00
Jeremy Fitzhardinge 73c9ceab40 [POWERPC] Generic BUG for powerpc
This makes powerpc use the generic BUG machinery.  The biggest reports the
function name, since it is redundant with kallsyms, and not needed in general.

There is an overall reduction of code, since module_32/64 duplicated several
functions.

Unfortunately there's no way to tell gcc that BUG won't return, so the BUG
macro includes a goto loop.  This will generate a real jmp instruction, which
is never used.

[akpm@osdl.org: build fix]
[paulus@samba.org: remove infinite loop in BUG_ON]
Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Hugh Dickens <hugh@veritas.com>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-12-11 16:35:07 +11:00
Paul Mackerras 973c1fabc7 Merge branch 'for_paulus' of master.kernel.org:/pub/scm/linux/kernel/git/galak/powerpc 2006-12-11 16:31:42 +11:00
Kumar Gala d10f73480b [PPC] Fix compile failure do to introduction of PHY_POLL
PHY_POLL is defined in <linux/phy.h> include it in <linux/fsl_devices.h> so
board code will have it defined.

Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
2006-12-10 23:26:16 -06:00
Kumar Gala c86c676cca Merge branch '85xx' into for_paulus 2006-12-10 23:16:08 -06:00
Kumar Gala 45d8e7aaf4 [POWERPC] Only export __mtdcr/__mfdcr if CONFIG_PPC_DCR is set
On 85xx we don't build in dcr support because the core doesn't implement the
instructions.  This caused problems when building an 85xx kernel.  Additionally
made it so we only build __mtdcr/__mfdcr if we are CONFIG_PPC_DCR_NATIVE.

The 85xx build issue wasPointed out by Dai Haruki.

Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
2006-12-10 23:15:47 -06:00
Benjamin Herrenschmidt 4383162c8f [POWERPC] Remove old dcr.S
When I renamed dcr.S to dcr_low.S (and added dcr.c) it looks like the
old dcr.S file didn't properly get removed.  This fixes it.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-12-11 15:13:47 +11:00
Paul Mackerras 39f44be375 [POWERPC] Fix SPU coredump code for max_fdset removal
Commit bbea9f6966 removed the max_fdset
element of struct fdtable.  It appears that checking max_fds is
sufficient now.

Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-12-11 15:13:37 +11:00
Benjamin Herrenschmidt dae4828d66 [POWERPC] Fix irq routing on some 32-bit PowerMacs
The changes to use pci_read_irq_line() broke interrupt parsing
on some 32-bit powermacs (oops).  The reason is a bit obscure.
The code to parse interrupts happens earlier now, during
pcibios_fixup() as the PCI bus is being probed.  However, the
current implementation pci_device_to_OF_node() for 32-bit
powerpc relies, on machines like PowerMac which renumber PCI buses,
on a table called pci_OF_bus_map containing a map of bus numbers
between the kernel and the firmware which is setup only later.
Thus, it fails to match the device node.  In addition, some of
Apple internal PCI devices lack a proper PCI_INTERRUPT_PIN, thus
preventing the fallback mapping code to work.

This patch fixes it by making pci_device_to_OF_node() 32-bit
implementation use a different algorithm that works without
using the pci_OF_bus_map thing (which I intend to deprecate
anyway). It's a bit slower but that function isn't called in
any hot path hopefully.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2006-12-11 14:45:53 +11:00
Geoff Levand 74e95d5de9 [POWERPC] ps3: Add vuart support
Adds support for the PS3 virtual UART (vuart).  The vuart provides a
bi-directional byte stream data link between logical partitions.

This is needed for the ps3 graphics driver and the ps3 power
control support to be able to communicate with the lv1 policy
module.

Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-12-11 13:49:53 +11:00
Paul Mackerras 0204568a08 [POWERPC] Support ibm,dynamic-reconfiguration-memory nodes
For PAPR partitions with large amounts of memory, the firmware has an
alternative, more compact representation for the information about the
memory in the partition and its NUMA associativity information.  This
adds the code to the kernel to parse this alternative representation.

The other part of this patch is telling the firmware that we can
handle the alternative representation.  There is however a subtlety
here, because the firmware will invoke a reboot if the memory
representation we request is different from the representation that
firmware is currently using.  This is because firmware can't change
the representation on the fly.  Further, some firmware versions used
on POWER5+ machines have a bug where this reboot leaves the machine
with an altered value of load-base, which will prevent any kernel
booting until it is reset to the normal value (0x4000).  Because of
this bug, we do NOT set fake_elf.rpanote.new_mem_def = 1, and thus we
do not request the new representation on POWER5+ and earlier machines.
We do request the new representation on POWER6, which uses the
ibm,client-architecture-support call.

Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-12-11 13:49:49 +11:00
Ralf Baechle 9202f32558 [MIPS] Export local_flush_data_cache_page for sake of IDE.
On a CPU with aliases the IDE core needs to flush caches in the special
IDE variants of insw, insl etc.  If IDE support is built as a module this
will only work if local_flush_data_cache_page happens is exported as a
module.

As per policy export local_flush_data_cache_page as GPL symbol only.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-12-10 21:52:11 +00:00
Ralf Baechle f8bf35a914 [MIPS] Export pm_power_off
This is required for ipmi_poweroff.c to work as a module.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-12-10 21:52:11 +00:00
Ralf Baechle ae32ffd65b [MIPS] Export csum_partial_copy_nocheck.
ibmtr.c and typhoon.c use it.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-12-10 21:52:11 +00:00
Ralf Baechle 2d911e9a4e [MIPS] Move die and die_if_kernel() from system.h to ptrace.h
This eleminates the need to include ptrace.h into system.h and fixes a
harmless namespace conflict on the PC symbol in bpck.c.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-12-10 21:52:11 +00:00
Ralf Baechle 86384d5441 [MIPS] Discard .exit.text at linktime.
This fixes fairly unobvious breakage of various drivers.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-12-10 21:52:11 +00:00
Ralf Baechle 5b1d221e62 [MIPS] Fix build of several IDE drivers by providing pci_get_legacy_ide_irq
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-12-10 21:52:11 +00:00
Herbert Xu 3263263f70 [CRYPTO] dm-crypt: Select CRYPTO_CBC
As CBC is the default chaining method for cryptoloop, we should select
it from cryptoloop to ease the transition.  Spotted by Rene Herman.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:18:57 -08:00
Cal Peake 0258736a0a [PATCH] add MODULE_* attributes to bit reversal library
Add MODULE_* attributes to the new bit reversal library. Most notably
MODULE_LICENSE which prevents superfluous kernel tainting.

Signed-off-by: Cal Peake <cp@absolutedigital.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 10:07:52 -08:00
Linus Torvalds edb16bec41 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
  [SPARC64]: Fix several kprobes bugs.
  [SPARC64]: Update defconfig.
  [SPARC64]: dma remove extra brackets
  [SPARC{32,64}]: Propagate ptrace_traceme() return value.
  [SPARC64]: Replace kmalloc+memset with kzalloc
  [SPARC]: Check kzalloc() return value in SUN4D irq/iommu init.
  [SPARC]: Replace kmalloc+memset with kzalloc
  [SPARC64]: Run ctrl-alt-del action for sun4v powerdown request.
  [SPARC64]: Unaligned accesses to userspace are hard errors.
  [SPARC64]: Call do_mathemu on illegal instruction traps too.
  [SPARC64]: Update defconfig.
  [SPARC64]: Add irqtrace/stacktrace/lockdep support.
2006-12-10 10:00:00 -08:00
Linus Torvalds bb7320d1d9 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/mchehab/v4l-dvb
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/mchehab/v4l-dvb: (132 commits)
  V4L/DVB 4949b: Fix container_of pointer retreival
  V4L/DVB (4949a): Fix INIT_WORK
  V4L/DVB (4949): Cxusb: codingstyle cleanups
  V4L/DVB (4948): Cxusb: Convert tuner functions to use dvb_pll_attach
  V4L/DVB (4947): Cx88: trivial cleanups
  V4L/DVB (4946): Cx88: Move cx88_dvb_bus_ctrl out of the card-specific area
  V4L/DVB (4945): Cx88: consolidate cx22702_config structs
  V4L/DVB (4944): Cx88: Convert DViCO FusionHDTV Hybrid to use dvb_pll_attach
  V4L/DVB (4943): Cx88: cleanup dvb_pll_attach for lgdt3302 tuners
  V4L/DVB (4953): Usbvision minor fixes
  V4L/DVB (4951): Add version.h, since it is required for VIDIOC_QUERYCAP
  V4L/DVB (4940): Or51211: Changed SNR and signal strength calculations
  V4L/DVB (4939): Or51132: Changed SNR and signal strength reporting
  V4L/DVB (4938): Cx88: Convert lgdt3302 tuning function to use dvb_pll_attach
  V4L/DVB (4941): Remove LINUX_VERSION_CODE and fix identations
  V4L/DVB (4942): Whitespace cleanups
  V4L/DVB (4937): Usbvision cleanup and code reorganization
  V4L/DVB (4936): Make MT4049FM5 tuner to set FM Gain to Normal
  V4L/DVB (4935): Added the capability of selecting fm gain by tuner
  V4L/DVB (4934): Usbvision radio requires GainNormal at e register
  ...
2006-12-10 09:59:18 -08:00
Avi Kivity 6aa8b732ca [PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net

mailing list: kvm-devel@lists.sourceforge.net
  (http://lists.sourceforge.net/lists/listinfo/kvm-devel)

The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture.  The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace.  Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.

Using this driver, one can start multiple virtual machines on a host.

Each virtual machine is a process on the host; a virtual cpu is a thread in
that process.  kill(1), nice(1), top(1) work as expected.  In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode.  Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm).  Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.

The driver supports i386 and x86_64 hosts and guests.  All combinations are
allowed except x86_64 guest on i386 host.  For i386 guests and hosts, both pae
and non-pae paging modes are supported.

SMP hosts and UP guests are supported.  At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.

Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch.  We plan to address this in two ways:

- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables

Currently a virtual desktop is responsive but consumes a lot of CPU.  Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization.  Linux/X is slower, probably due
to X being in a separate process.

In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.

Caveats (akpm: might no longer be true):

- The Windows install currently bluescreens due to a problem with the
  virtual APIC.  We are working on a fix.  A temporary workaround is to
  use an existing image or install through qemu
- Windows 64-bit does not work.  That's also true for qemu, so it's
  probably a problem with the device model.

[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Daniel Walker f5f1a24a2c [PATCH] clocksource: small cleanup
Mostly changing alignment.  Just some general cleanup.

[akpm@osdl.org: build fix]
Signed-off-by: Daniel Walker <dwalker@mvista.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Daniel Walker 2b0137001d [PATCH] clocksource: add usage of CONFIG_SYSFS
Simply adds some ifdefs to remove clocksoure sysfs code when CONFIG_SYSFS
isn't turn on.

Signed-off-by: Daniel Walker <dwalker@mvista.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Arjan van de Ven 2b2842146c [PATCH] user of the jiffies rounding patch: Slab
This patch introduces users of the round_jiffies() function in the slab code.

The slab code has a few "run every second" timers for background work; these
are obviously not timing critical as long as they happen roughly at the right
frequency.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Arjan van de Ven 44d306e150 [PATCH] user of the jiffies rounding code: JBD
This patch introduces a user: of the round_jiffies() function; the "5 second"
ext3/jbd wakeup.

While "every 5 seconds" doesn't sound as a problem, there can be many of these
(and these timers do add up over all the kernel).  The "5 second" wakeup isn't
really timing sensitive; in addition even with rounding it'll still happen
every 5 seconds (with the exception of the very first time, which is likely to
be rounded up to somewhere closer to 6 seconds)

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Arjan van de Ven 4c36a5dec2 [PATCH] round_jiffies infrastructure
Introduce a round_jiffies() function as well as a round_jiffies_relative()
function.  These functions round a jiffies value to the next whole second.
The primary purpose of this rounding is to cause all "we don't care exactly
when" timers to happen at the same jiffy.

This avoids multiple timers firing within the second for no real reason;
with dynamic ticks these extra timers cause wakeups from deep sleep CPU
sleep states and thus waste power.

The exact wakeup moment is skewed by the cpu number, to avoid all cpus from
waking up at the exact same time (and hitting the same lock/cachelines
there)

[akpm@osdl.org: fix variable type]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Vadim Lobanov 5466b456ed [PATCH] fdtable: Implement new pagesize-based fdtable allocator
This patch provides an improved fdtable allocation scheme, useful for
expanding fdtable file descriptor entries.  The main focus is on the fdarray,
as its memory usage grows 128 times faster than that of an fdset.

The allocation algorithm sizes the fdarray in such a way that its memory usage
increases in easy page-sized chunks. The overall algorithm expands the allowed
size in powers of two, in order to amortize the cost of invoking vmalloc() for
larger allocation sizes. Namely, the following sizes for the fdarray are
considered, and the smallest that accommodates the requested fd count is
chosen:

    pagesize / 4
    pagesize / 2
    pagesize      <- memory allocator switch point
    pagesize * 2
    pagesize * 4
    ...etc...

Unlike the current implementation, this allocation scheme does not require a
loop to compute the optimal fdarray size, and can be done in efficient
straightline code.

Furthermore, since the fdarray overflows the pagesize boundary long before any
of the fdsets do, it makes sense to optimize run-time by allocating both
fdsets in a single swoop.  Even together, they will still be, by far, smaller
than the fdarray.  The fdtable->open_fds is now used as the anchor for the
fdset memory allocation.

Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Vadim Lobanov 4fd45812cb [PATCH] fdtable: Remove the free_files field
An fdtable can either be embedded inside a files_struct or standalone (after
being expanded).  When an fdtable is being discarded after all RCU references
to it have expired, we must either free it directly, in the standalone case,
or free the files_struct it is contained within, in the embedded case.

Currently the free_files field controls this behavior, but we can get rid of
it entirely, as all the necessary information is already recorded.  We can
distinguish embedded and standalone fdtables using max_fds, and if it is
embedded we can divine the relevant files_struct using container_of().

Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Vadim Lobanov bbea9f6966 [PATCH] fdtable: Make fdarray and fdsets equal in size
Currently, each fdtable supports three dynamically-sized arrays of data: the
fdarray and two fdsets.  The code allows the number of fds supported by the
fdarray (fdtable->max_fds) to differ from the number of fds supported by each
of the fdsets (fdtable->max_fdset).

In practice, it is wasteful for these two sizes to differ: whenever we hit a
limit on the smaller-capacity structure, we will reallocate the entire fdtable
and all the dynamic arrays within it, so any delta in the memory used by the
larger-capacity structure will never be touched at all.

Rather than hogging this excess, we shouldn't even allocate it in the first
place, and keep the capacities of the fdarray and the fdsets equal.  This
patch removes fdtable->max_fdset.  As an added bonus, most of the supporting
code becomes simpler.

Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Vadim Lobanov f3d19c90fb [PATCH] fdtable: Delete pointless code in dup_fd()
The dup_fd() function creates a new files_struct and fdtable embedded inside
that files_struct, and then possibly expands the fdtable using expand_files().

The out_release error path is invoked when expand_files() returns an error
code.  However, when this attempt to expand fails, the fdtable is left in its
original embedded form, so it is pointless to try to free the associated
fdarray and fdsets.

Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:21 -08:00
Zach Brown 5eb6c7a2ab [PATCH] dio: lock refcount operations
The wait_for_more_bios() function name was poorly chosen.  While looking to
clean it up it I noticed that the dio struct refcounting between the bio
completion and dio submission paths was racey.

The bio submission path was simply freeing the dio struct if
atomic_dec_and_test() indicated that it dropped the final reference.

The aio bio completion path was dereferencing its dio struct pointer *after
dropping its reference* based on the remaining number of references.

These two paths could race and result in the aio bio completion path
dereferencing a freed dio, though this was not observed in the wild.

This moves the refcount under the bio lock so that bio completion can drop
its reference and decide to wake all in one atomic step.

Once testing and waking is locked dio_await_one() can test its sleeping
condition and mark itself uninterruptible under the lock.  It gets simpler
and wait_for_more_bios() disappears.

The addition of the interrupt masking spin lock acquiry in dio_bio_submit()
looks alarming.  This lock acquiry existed in that path before the recent
dio completion patch set.  We shouldn't expect significant performance
regression from returning to the behaviour that existed before the
completion clean up work.

This passed 4k block ext3 O_DIRECT fsx and aio-stress on an SMP machine.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: <xfs-masters@oss.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:21 -08:00
Zach Brown 8459d86aff [PATCH] dio: only call aio_complete() after returning -EIOCBQUEUED
The only time it is safe to call aio_complete() is when the ->ki_retry
function returns -EIOCBQUEUED to the AIO core.  direct_io_worker() has
historically done this by relying on its caller to translate positive return
codes into -EIOCBQUEUED for the aio case.  It did this by trying to keep
conditionals in sync.  direct_io_worker() knew when finished_one_bio() was
going to call aio_complete().  It would reverse the test and wait and free the
dio in the cases it thought that finished_one_bio() wasn't going to.

Not surprisingly, it ended up getting it wrong.  'ret' could be a negative
errno from the submission path but it failed to communicate this to
finished_one_bio().  direct_io_worker() would return < 0, it's callers
wouldn't raise -EIOCBQUEUED, and aio_complete() would be called.  In the
future finished_one_bio()'s tests wouldn't reflect this and aio_complete()
would be called for a second time which can manifest as an oops.

The previous cleanups have whittled the sync and async completion paths down
to the point where we can collapse them and clearly reassert the invariant
that we must only call aio_complete() after returning -EIOCBQUEUED.
direct_io_worker() will only return -EIOCBQUEUED when it is not the last to
drop the dio refcount and the aio bio completion path will only call
aio_complete() when it is the last to drop the dio refcount.
direct_io_worker() can ensure that it is the last to drop the reference count
by waiting for bios to drain.  It does this for sync ops, of course, and for
partial dio writes that must fall back to buffered and for aio ops that saw
errors during submission.

This means that operations that end up waiting, even if they were issued as
aio ops, will not call aio_complete() from dio.  Instead we return the return
code of the operation and let the aio core call aio_complete().  This is
purposely done to fix a bug where AIO DIO file extensions would call
aio_complete() before their callers have a chance to update i_size.

Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers
no longer have to translate for it.  XFS needs to be careful not to free
resources that will be used during AIO completion if -EIOCBQUEUED is returned.
 We maintain the previous behaviour of trying to write fs metadata for O_SYNC
aio+dio writes.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: <xfs-masters@oss.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:21 -08:00