Commit graph

118804 commits

Author SHA1 Message Date
Tejun Heo
89f97496e8 block: fix __blkdev_get() for removable devices
Commit 0762b8bde9 moved disk_get_part()
in front of recursive get on the whole disk, which caused removable
devices to try disk_get_part() before rescanning after a new media is
inserted, which might fail legit open attempts or give the old
partition.

This patch fixes the problem by moving disk_get_part() after
__blkdev_get() on the whole disk.

This problem was spotted by Borislav Petkov.

Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Borislav Petkov <petkovbb@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-11-06 08:41:56 +01:00
Suresh Siddha
561920a0d2 generic-ipi: fix the smp_mb() placement
smp_mb() is needed (to make the memory operations visible globally) before
sending the ipi on the sender and the receiver (on Alpha atleast) needs
smp_read_barrier_depends() in the handler before reading the call_single_queue
list in a lock-free fashion.

On x86, x2apic mode register accesses for sending IPI's don't have serializing
semantics. So the need for smp_mb() before sending the IPI becomes more
critical in x2apic mode.

Remove the unnecessary smp_mb() in csd_flag_wait(), as the presence of that
smp_mb() doesn't mean anything on the sender, when the ipi receiver is not
doing any thing special (like memory fence) after clearing the CSD_FLAG_WAIT.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-11-06 08:41:56 +01:00
Mike Anderson
e78042e5b8 blk: move blk_delete_timer call in end_that_request_last
Move the calling  blk_delete_timer to later in end_that_request_last to
address an issue where blkdev_dequeue_request may have add a timer for the
request.

Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-11-06 08:41:56 +01:00
Tejun Heo
2920ebbd65 block: add timer on blkdev_dequeue_request() not elv_next_request()
Block queue supports two usage models - one where block driver peeks
at the front of queue using elv_next_request(), processes it and
finishes it and the other where block driver peeks at the front of
queue, dequeue the request using blkdev_dequeue_request() and finishes
it.  The latter is more flexible as it allows the driver to process
multiple commands concurrently.

These two inconsistent usage models affect the block layer
implementation confusing.  For some, elv_next_request() is considered
the issue point while others consider blkdev_dequeue_request() the
issue point.

Till now the inconsistency mostly affect only accounting, so it didn't
really break anything seriously; however, with block layer timeout,
this inconsistency hits hard.  Block layer considers
elv_next_request() the issue point and adds timer but SCSI layer
thinks it was just peeking and when the request can't process the
command right away, it's just left there without further processing.
This makes the request dangling on the timer list and, when the timer
goes off, the request which the SCSI layer and below think is still on
the block queue ends up in the EH queue, causing various problems - EH
hang (failed count goes over busy count and EH never wakes up),
WARN_ON() and oopses as low level driver trying to handle the unknown
command, etc. depending on the timing.

As SCSI midlayer is the only user of block layer timer at the moment,
moving blk_add_timer() to elv_dequeue_request() fixes the problem;
however, this two usage models definitely need to be cleaned up in the
future.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-11-06 08:41:55 +01:00
Jeremy Fitzhardinge
f92131c3dd bio: define __BIOVEC_PHYS_MERGEABLE
Define __BIOVEC_PHYS_MERGEABLE as the default implementation of
BIOVEC_PHYS_MERGEABLE, so that its available for reuse within an
arch-specific definition of BIOVEC_PHYS_MERGEABLE.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-11-06 08:41:55 +01:00
FUJITA Tomonori
43381785a5 block: remove unused ll_new_mergeable()
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-11-06 08:41:55 +01:00
Bjorn Helgaas
da85f865b1 x86: mention ACPI in top-level Kconfig menu
Impact: clarify menuconfig text

Mention ACPI in the top-level menu to give a clue as to where
it lives. This matches what ia64 does.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-06 08:16:19 +01:00
NeilBrown
a53a6c8575 md: fix bug in raid10 recovery.
Adding a spare to a raid10 doesn't cause recovery to start.
This is due to an silly type in
  commit 6c2fce2ef6
and so is a bug in 2.6.27 and .28-rc.

Thanks to Thomas Backlund for bisecting to find this.

Cc: Thomas Backlund <tmb@mandriva.org>
Cc: stable@kernel.org

Signed-off-by: NeilBrown <neilb@suse.de>
2008-11-06 17:28:20 +11:00
NeilBrown
cb3ac42b8a md: revert the recent addition of a call to the BLKRRPART ioctl.
It turns out that it is only safe to call blkdev_ioctl when the device
is actually open (as ->bd_disk is set to NULL on last close).  And it
is quite possible for do_md_stop to be called when the device is not
open.  So discard the call to blkdev_ioctl(BLKRRPART) which was
added in
   commit 934d9c23b4

It is just as easy to call this ioctl from userspace when needed (on
mdadm -S) so leave it out of the kernel

Signed-off-by: NeilBrown <neilb@suse.de>
2008-11-06 17:28:01 +11:00
Yinghai Lu
1b48976880 x86: size NR_IRQS on 32-bit systems the same way as 64-bit
Impact: make NR_IRQS big enough for system with lots of apic/pins

If lots of IO_APIC's are there (or can be there), size the same way
as 64-bit, depending on MAX_IO_APICS and NR_CPUS.

This fixes the boot problem reported by Ben Hutchings on a 32-bit
server with 5 IO-APICs and 240 IO-APIC pins.

Signed-off-by: Yinghai <yinghai@kernel.org>
Tested-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-06 07:23:22 +01:00
Ben Hutchings
c78d0cf292 x86: don't allow nr_irqs > NR_IRQS
Impact: fix boot hang on 32-bit systems with more than 224 IO-APIC pins

On some 32-bit systems with a lot of IO-APICs probe_nr_irqs() can
return a value larger than NR_IRQS. This will lead to probe_irq_on()
overrunning the irq_desc array.

I hit this when running net-next-2.6 (close to 2.6.28-rc3) on a
Supermicro dual Xeon system.  NR_IRQS is 224 but probe_nr_irqs() detects
5 IOAPICs and returns 240.  Here are the log messages:

Tue Nov  4 16:53:47 2008 ACPI: IOAPIC (id[0x01] address[0xfec00000] gsi_base[0])
Tue Nov  4 16:53:47 2008 IOAPIC[0]: apic_id 1, version 32, address 0xfec00000, GSI 0-23
Tue Nov  4 16:53:47 2008 ACPI: IOAPIC (id[0x02] address[0xfec81000] gsi_base[24])
Tue Nov  4 16:53:47 2008 IOAPIC[1]: apic_id 2, version 32, address 0xfec81000, GSI 24-47
Tue Nov  4 16:53:47 2008 ACPI: IOAPIC (id[0x03] address[0xfec81400] gsi_base[48])
Tue Nov  4 16:53:47 2008 IOAPIC[2]: apic_id 3, version 32, address 0xfec81400, GSI 48-71
Tue Nov  4 16:53:47 2008 ACPI: IOAPIC (id[0x04] address[0xfec82000] gsi_base[72])
Tue Nov  4 16:53:47 2008 IOAPIC[3]: apic_id 4, version 32, address 0xfec82000, GSI 72-95
Tue Nov  4 16:53:47 2008 ACPI: IOAPIC (id[0x05] address[0xfec82400] gsi_base[96])
Tue Nov  4 16:53:47 2008 IOAPIC[4]: apic_id 5, version 32, address 0xfec82400, GSI 96-119
Tue Nov  4 16:53:47 2008 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
Tue Nov  4 16:53:47 2008 ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
Tue Nov  4 16:53:47 2008 Enabling APIC mode:  Flat.  Using 5 I/O APICs

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-06 07:23:21 +01:00
Steve Glendinning
fd9abb3d97 SMSC LAN911x and LAN921x vendor driver
Attached is a driver for SMSC's LAN911x and LAN921x families of embedded
ethernet controllers.

There is an existing smc911x driver in the tree; this is intended to
replace it.  Dustin McIntire (the author of the smc911x driver) has
expressed his support for switching to this driver.

This driver contains workarounds for all known hardware issues, and has
been tested on all flavours of the chip on multiple architectures.

This driver now uses phylib, so this patch also adds support for the
device's internal phy

Signed-off-by: Steve Glendinning <steve.glendinning@smsc.com>
Signed-off-by: Bahadir Balban <Bahadir.Balban@arm.com>
Signed-off-by: Dustin Mcintire <dustin@sensoria.com>
Signed-off-by: Bill Gatliff <bgat@billgatliff.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-11-06 00:58:40 -05:00
Hannes Hering
c5916cf8db ehea: Fix some whitespace issues
This patch removes some trailing whitespaces and spaces before tabs.

Signed-off-by: Hannes Hering <hering2@de.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-11-06 00:50:56 -05:00
Ben Hutchings
739bb23d72 sfc: Do not reset when hardware monitor detects a fault
The TX watchdog should trigger a reset, but a temperature/power alarm
should not as this is unlikely to solve the problem.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-11-06 00:50:15 -05:00
Ben Hutchings
3e133c44d2 sfc: Use lm87 and lm90 drivers for board temperature/power monitoring
Add board monitoring to periodic work whenever link is down.
For SFE4001, report when a fault has caused the PHY to turn off.
For SFE4002, switch XFP PHY into low-power state in case of a fault.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-11-06 00:50:09 -05:00
Ben Hutchings
f41507245e sfc: Expose flash region storing boot code as MTD
The boot code that appears as a PCI expansion ROM on the SFC4000 is
stored in flash.  Expose this as a standard MTD device to allow for
in-place upgrades.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-11-06 00:49:57 -05:00
Ben Hutchings
0a95f56323 sfc: Clean up non-volatile memory partitioning
Move flash and EEPROM partition boundary constants into spi.h and rename
them to be consistent.

Add a comment on the partitioning.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-11-06 00:49:56 -05:00
Ben Hutchings
2883f552f2 sfc: Correct address of gPXE boot configuration in EEPROM
Due to a hardware bug, the originally assigned range cannot reliably
be used for boot configuration and must not be modifiable through
ethtool.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-11-06 00:49:50 -05:00
Jay Vosburgh
fd989c8332 bonding: alternate agg selection policies for 802.3ad
This patch implements alternative aggregator selection policies
for 802.3ad.  The existing policy, now termed "stable," selects the active
aggregator by greatest bandwidth, and only reselects a new aggregator
if the active aggregator is entirely disabled (no more ports or all ports
down).

	This patch adds two new policies: bandwidth and count, selecting
the active aggregator by total bandwidth (like the stable policy) or by
the number of ports in the aggregator, respectively.  These two policies
also differ from the stable policy in that they will reselect the active
aggregator when availability-related changes occur in the bond (e.g.,
link state change).

	This permits "gang failover" within 802.3ad, allowing redundant
aggregators along parallel paths to always maintain the "best" aggregator
as the active aggregator (rather than having to wait for the active to
entirely fail).

	This patch also updates the driver version to 3.5.0.

Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-11-06 00:49:47 -05:00
Jay Vosburgh
6146b1a4da bonding: Fix ALB mode to balance traffic on VLANs
The current ALB function that processes incoming ARPs
does not handle traffic for VLANs configured above bonding.  This causes
traffic on those VLANs to all be assigned the same slave.  This patch
corrects that misbehavior by locating the bonding interface nested below
the VLAN interface.

	Bug reported by Sven Anders <anders@anduras.de>, who also
tested an earlier version of this patch and confirmed that it resolved
the problem.

Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-11-06 00:49:40 -05:00
Brian Haley
305d552acc bonding: send IPv6 neighbor advertisement on failover
This patch adds better IPv6 failover support for bonding devices,
especially when in active-backup mode and there are only IPv6 addresses
configured, as reported by Alex Sidorenko.

- Creates a new file, net/drivers/bonding/bond_ipv6.c, for the
   IPv6-specific routines.  Both regular bonds and VLANs over bonds
   are supported.

- Adds a new tunable, num_unsol_na, to limit the number of unsolicited
   IPv6 Neighbor Advertisements that are sent on a failover event.
   Default is 1.

- Creates two new IPv6 neighbor discovery functions:

   ndisc_build_skb()
   ndisc_send_skb()

   These were required to support VLANs since we have to be able to
   add the VLAN id to the skb since ndisc_send_na() and friends
   shouldn't be asked to do this.  These two routines are basically
   __ndisc_send() split into two pieces, in a slightly different order.

- Updates Documentation/networking/bonding.txt and bumps the rev of bond
   support to 3.4.0.

On failover, this new code will generate one packet:

- An unsolicited IPv6 Neighbor Advertisement, which helps the switch
   learn that the address has moved to the new slave.

Testing has shown that sending just the NA results in pretty good
behavior when in active-back mode, I saw no lost ping packets for example.

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-11-06 00:49:37 -05:00
Jarek Poplawski
61c9eaf900 pkt_sched: Fix qdisc len in qdisc_peek_dequeued()
A packet dequeued and stored as gso_skb in qdisc_peek_dequeued() should
be seen as part of the queue for sch->q.qlen queries until it's really
dequeued with qdisc_dequeue_peeked(), so qlen needs additional updating
in these functions. (Updating qstats.backlog shouldn't matter here.)

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-05 16:02:34 -08:00
Eric W. Biederman
0a36b345ab net: Don't leak packets when a netns is going down
I have been tracking for a while a case where when the
network namespace exits the cleanup gets stck in an
endless precessess of:

unregister_netdevice: waiting for lo to become free. Usage count = 3
unregister_netdevice: waiting for lo to become free. Usage count = 3
unregister_netdevice: waiting for lo to become free. Usage count = 3
unregister_netdevice: waiting for lo to become free. Usage count = 3
unregister_netdevice: waiting for lo to become free. Usage count = 3
unregister_netdevice: waiting for lo to become free. Usage count = 3
unregister_netdevice: waiting for lo to become free. Usage count = 3

It turns out that if you listen on a multicast address an unsubscribe
packet is sent when the network device goes down.   If you shutdown
the network namespace without carefully cleaning up this can trigger
the unsubscribe packet to be sent over the loopback interface while
the network namespace is going down.

All of which is fine except when we drop the packet and forget to
free it leaking the skb and the dst entry attached to.  As it
turns out the dst entry hold a reference to the idev which holds
the dev and keeps everything from being cleaned up.  Yuck!

By fixing my earlier thinko and add the needed kfree_skb and everything
cleans up beautifully. 

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-05 16:00:24 -08:00
Eric W. Biederman
ae33bc40c0 net: Guaranetee the proper ordering of the loopback device.
I was recently hunting a bug that occurred in network namespace
cleanup.  In looking at the code it became apparrent that we have
and will continue to have cases where if we have anything going
on in a network namespace there will be assumptions that the
loopback device is present.   Things like sending igmp unsubscribe
messages when we bring down network devices invokes the routing
code which assumes that at least the loopback driver is present.

Therefore to avoid magic initcall ordering hackery that is hard
to follow and hard to get right insert a call to register the
loopback device directly from net_dev_init().    This guarantes
that the loopback device is the first device registered and
the last network device to go away.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-05 16:00:02 -08:00
Eric W. Biederman
d0c082cea6 netns: Delete virtual interfaces during namespace cleanup
When physical devices are inside of network namespace and that
network namespace terminates we can not make them go away.  We
have to keep them and moving them to the initial network namespace
is the best we can do.

For virtual devices left in a network namespace that is exiting
we have no need to preserve them and we now have the infrastructure
that allows us to delete them.  So delete virtual devices when we
exit a network namespace.  Keeping the necessary user space clean up
after a network namespace exits much more tractable.

Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-05 15:59:38 -08:00
Geert Uytterhoeven
dc8a0843a4 [JFFS2] fix race condition in jffs2_lzo_compress()
deflate_mutex protects the globals lzo_mem and lzo_compress_buf.  However,
jffs2_lzo_compress() unlocks deflate_mutex _before_ it has copied out the
compressed data from lzo_compress_buf.  Correct this by moving the mutex
unlock after the copy.

In addition, document what deflate_mutex actually protects.

Cc: stable@kernel.org
Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Acked-by: Richard Purdie <rpurdie@openedhand.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2008-11-05 23:22:02 +01:00
Randy Dunlap
b0d5fdef52 net/9p: fix printk format warnings
Fix printk format warnings in net/9p.
Built cleanly on 7 arches.

net/9p/client.c:820: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'u64'
net/9p/client.c:820: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'u64'
net/9p/client.c:867: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'u64'
net/9p/client.c:867: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'u64'
net/9p/client.c:932: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'u64'
net/9p/client.c:932: warning: format '%llx' expects type 'long long unsigned int', but argument 6 has type 'u64'
net/9p/client.c:982: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'u64'
net/9p/client.c:982: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'u64'
net/9p/client.c:1025: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'u64'
net/9p/client.c:1025: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'u64'
net/9p/client.c:1227: warning: format '%llx' expects type 'long long unsigned int', but argument 7 has type 'u64'
net/9p/client.c:1227: warning: format '%llx' expects type 'long long unsigned int', but argument 12 has type 'u64'
net/9p/client.c:1227: warning: format '%llx' expects type 'long long unsigned int', but argument 8 has type 'u64'
net/9p/client.c:1227: warning: format '%llx' expects type 'long long unsigned int', but argument 13 has type 'u64'
net/9p/client.c:1252: warning: format '%llx' expects type 'long long unsigned int', but argument 7 has type 'u64'
net/9p/client.c:1252: warning: format '%llx' expects type 'long long unsigned int', but argument 12 has type 'u64'
net/9p/client.c:1252: warning: format '%llx' expects type 'long long unsigned int', but argument 8 has type 'u64'
net/9p/client.c:1252: warning: format '%llx' expects type 'long long unsigned int', but argument 13 has type 'u64'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-11-05 13:19:07 -06:00
Roel Kluin
9f3e9bbe62 unsigned fid->fid cannot be negative
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-11-05 13:19:07 -06:00
Huang Weiyi
1558c62149 9p: rdma: remove duplicated #include
Removed duplicated #include <rdma/ib_verbs.h> in
net/9p/trans_rdma.c.

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-11-05 13:19:07 -06:00
Tom Tucker
45abdf1c7b p9: Fix leak of waitqueue in request allocation path
If a T or R fcall cannot be allocated, the function returns an error
but neglects to free the wait queue that was successfully allocated.

If it comes through again a second time this wq will be overwritten
with a new allocation and the old allocation will be leaked.

Also, if the client is subsequently closed, the close path will
attempt to clean up these allocations, so set the req fields to
NULL to avoid duplicate free.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-11-05 13:19:06 -06:00
Tom Tucker
82b189eaaf 9p: Remove unneeded free of fcall for Flush
T and R fcall are reused until the client is destroyed. There does
not need to be a special case for Flush

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-11-05 13:19:06 -06:00
Tom Tucker
cac23d6505 9p: Make all client spin locks IRQ safe
The client lock must be IRQ safe. Some of the lock acquisition paths
took regular spin locks.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-11-05 13:19:06 -06:00
Tom Tucker
517ac45af4 9p: rdma: Set trans prior to requesting async connection ops
The RDMA connection manager is fundamentally asynchronous.
Since the async callback context is the client pointer, the
transport in the client struct needs to be set prior to calling
the first async op.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2008-11-05 13:19:06 -06:00
Ingo Molnar
9fcd18c9e6 sched: re-tune balancing
Impact: improve wakeup affinity on NUMA systems, tweak SMP systems

Given the fixes+tweaks to the wakeup-buddy code, re-tweak the domain
balancing defaults on NUMA and SMP systems.

Turn on SD_WAKE_AFFINE which was off on x86 NUMA - there's no reason
why we would not want to have wakeup affinity across nodes as well.
(we already do this in the standard NUMA template.)

lat_ctx on a NUMA box is particularly happy about this change:

before:

 |   phoenix:~/l> ./lat_ctx -s 0 2
 |   "size=0k ovr=2.60
 |   2 5.70

after:

 |   phoenix:~/l> ./lat_ctx -s 0 2
 |   "size=0k ovr=2.65
 |   2 2.07

a 2.75x speedup.

pipe-test is similarly happy about it too:

 |  phoenix:~/sched-tests> ./pipe-test
 |   18.26 usecs/loop.
 |   14.70 usecs/loop.
 |   14.38 usecs/loop.
 |   10.55 usecs/loop.              # +WAKE_AFFINE on domain0+domain1
 |   8.63 usecs/loop.
 |   8.59 usecs/loop.
 |   9.03 usecs/loop.
 |   8.94 usecs/loop.
 |   8.96 usecs/loop.
 |   8.63 usecs/loop.

Also:

 - disable SD_BALANCE_NEWIDLE on NUMA and SMP domains (keep it for siblings)
 - enable SD_WAKE_BALANCE on SMP domains

Sysbench+postgresql improves all around the board, quite significantly:

           .28-rc3-11474e2c  .28-rc3-11474e2c-tune
-------------------------------------------------
    1:             571              688    +17.08%
    2:            1236             1206    -2.55%
    4:            2381             2642    +9.89%
    8:            4958             5164    +3.99%
   16:            9580             9574    -0.07%
   32:            7128             8118    +12.20%
   64:            7342             8266    +11.18%
  128:            7342             8064    +8.95%
  256:            7519             7884    +4.62%
  512:            7350             7731    +4.93%
-------------------------------------------------
  SUM:           55412            59341    +6.62%

So it's a win both for the runup portion, the peak area and the tail.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-05 18:04:38 +01:00
Eric W. Biederman
467622ef2a [MTD] [NOR] Fix cfi_send_gen_cmd handling of x16 devices in x8 mode (v4)
For "unlock" cycles to 16bit devices in 8bit compatibility mode we need
to use the byte addresses 0xaaa and 0x555. These effectively match
the word address 0x555 and 0x2aa, except the latter has its low bit set.

Most chips don't care about the value of the 'A-1' pin in x8 mode,
but some -- like the ST M29W320D -- do. So we need to be careful to
set it where appropriate.

cfi_send_gen_cmd is only ever passed addresses where the low byte
is 0x00, 0x55 or 0xaa. Of those, only addresses ending 0xaa are
affected by this patch, by masking in the extra low bit when the device
is known to be in compatibility mode.

[dwmw2: Do it only when (cmd_ofs & 0xff) == 0xaa]
v4: Fix  stupid typo in cfi_build_cmd_addr that failed to compile
    I'm writing this patch way to late at night.
v3: Bring all of the work back into cfi_build_cmd_addr
    including calling of map_bankwidth(map) and cfi_interleave(cfi)
    So every caller doesn't need to.
v2: Only modified the address if we our device_type is larger than our
    bus width.

Cc: stable@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2008-11-05 14:40:25 +01:00
David S. Miller
518a09ef11 tcp: Fix recvmsg MSG_PEEK influence of blocking behavior.
Vito Caputo noticed that tcp_recvmsg() returns immediately from
partial reads when MSG_PEEK is used.  In particular, this means that
SO_RCVLOWAT is not respected.

Simply remove the test.  And this matches the behavior of several
other systems, including BSD.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-05 03:36:01 -08:00
Alexey Dobriyan
efb9a8c28c netfilter: netns ct: walk netns list under RTNL
netns list (just list) is under RTNL. But helper and proto unregistration
happen during rmmod when RTNL is not held, and that's how it was tested:
modprobe/rmmod vs clone(CLONE_NEWNET)/exit.

BUG: unable to handle kernel paging request at 0000000000100100	<===
IP: [<ffffffffa009890f>] nf_conntrack_l4proto_unregister+0x96/0xae [nf_conntrack]
PGD 15e300067 PUD 15e1d8067 PMD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
last sysfs file: /sys/kernel/uevent_seqnum
CPU 0
Modules linked in: nf_conntrack_proto_sctp(-) nf_conntrack_proto_dccp(-) af_packet iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 sr_mod cdrom [last unloaded: nf_conntrack_proto_sctp]
Pid: 16758, comm: rmmod Not tainted 2.6.28-rc2-netns-xfrm #3
RIP: 0010:[<ffffffffa009890f>]  [<ffffffffa009890f>] nf_conntrack_l4proto_unregister+0x96/0xae [nf_conntrack]
RSP: 0018:ffff88015dc1fec8  EFLAGS: 00010212
RAX: 0000000000000000 RBX: 00000000001000f8 RCX: 0000000000000000
RDX: ffffffffa009575c RSI: 0000000000000003 RDI: ffffffffa00956b5
RBP: ffff88015dc1fed8 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: ffff88015dc1fe48 R12: ffffffffa0458f60
R13: 0000000000000880 R14: 00007fff4c361d30 R15: 0000000000000880
FS:  00007f624435a6f0(0000) GS:ffffffff80521580(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000100100 CR3: 0000000168969000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process rmmod (pid: 16758, threadinfo ffff88015dc1e000, task ffff880179864218)
Stack:
 ffffffffa0459100 0000000000000000 ffff88015dc1fee8 ffffffffa0457934
 ffff88015dc1ff78 ffffffff80253fef 746e6e6f635f666e 6f72705f6b636172
 00707463735f6f74 ffffffff8024cb30 00000000023b8010 0000000000000000
Call Trace:
 [<ffffffffa0457934>] nf_conntrack_proto_sctp_fini+0x10/0x1e [nf_conntrack_proto_sctp]
 [<ffffffff80253fef>] sys_delete_module+0x19f/0x1fe
 [<ffffffff8024cb30>] ? trace_hardirqs_on_caller+0xf0/0x114
 [<ffffffff803ea9b2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff8020b52b>] system_call_fastpath+0x16/0x1b
Code: 13 35 e0 e8 c4 6c 1a e0 48 8b 1d 6d c6 46 e0 eb 16 48 89 df 4c 89 e2 48 c7 c6 fc 85 09 a0 e8 61 cd ff ff 48 8b 5b 08 48 83 eb 08 <48> 8b 43 08 0f 18 08 48 8d 43 08 48 3d 60 4f 50 80 75 d3 5b 41
RIP  [<ffffffffa009890f>] nf_conntrack_l4proto_unregister+0x96/0xae [nf_conntrack]
 RSP <ffff88015dc1fec8>
CR2: 0000000000100100
---[ end trace bde8ac82debf7192 ]---

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-05 03:03:18 -08:00
Benjamin Thery
e3ec6cfc26 ipv6: fix run pending DAD when interface becomes ready
With some net devices types, an IPv6 address configured while the
interface was down can stay 'tentative' forever, even after the interface
is set up. In some case, pending IPv6 DADs are not executed when the
device becomes ready.

I observed this while doing some tests with kvm. If I assign an IPv6 
address to my interface eth0 (kvm driver rtl8139) when it is still down
then the address is flagged tentative (IFA_F_TENTATIVE). Then, I set
eth0 up, and to my surprise, the address stays 'tentative', no DAD is
executed and the address can't be pinged.

I also observed the same behaviour, without kvm, with virtual interfaces
types macvlan and veth.

Some easy steps to reproduce the issue with macvlan:

1. ip link add link eth0 type macvlan
2. ip -6 addr add 2003::ab32/64 dev macvlan0
3. ip addr show dev macvlan0
   ... 
   inet6 2003::ab32/64 scope global tentative
   ...
4. ip link set macvlan0 up
5. ip addr show dev macvlan0
   ...
   inet6 2003::ab32/64 scope global tentative
   ...
   Address is still tentative

I think there's a bug in net/ipv6/addrconf.c, addrconf_notify():
addrconf_dad_run() is not always run when the interface is flagged IF_READY.
Currently it is only run when receiving NETDEV_CHANGE event. Looks like
some (virtual) devices doesn't send this event when becoming up.

For both NETDEV_UP and NETDEV_CHANGE events, when the interface becomes
ready, run_pending should be set to 1. Patch below.

'run_pending = 1' could be moved below the if/else block but it makes 
the code less readable.

Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-05 01:43:57 -08:00
Eric Dumazet
270acefafe net: sk_free_datagram() should use sk_mem_reclaim_partial()
I noticed a contention on udp_memory_allocated on regular UDP applications.

While tcp_memory_allocated is seldom used, it appears each incoming UDP frame
is currently touching udp_memory_allocated when queued, and when received by
application.

One possible solution is to use sk_mem_reclaim_partial() instead of
sk_mem_reclaim(), so that we keep a small reserve (less than one page)
of memory for each UDP socket.

We did something very similar on TCP side in commit
9993e7d313
([TCP]: Do not purge sk_forward_alloc entirely in tcp_delack_timer())

A more complex solution would need to convert prot->memory_allocated to
use a percpu_counter with batches of 64 or 128 pages.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-05 01:38:06 -08:00
Randy Dunlap
b22cecdd8f net/9p: fix printk format warnings
Fix printk format warnings in net/9p.
Built cleanly on 7 arches.

net/9p/client.c:820: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'u64'
net/9p/client.c:820: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'u64'
net/9p/client.c:867: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'u64'
net/9p/client.c:867: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'u64'
net/9p/client.c:932: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'u64'
net/9p/client.c:932: warning: format '%llx' expects type 'long long unsigned int', but argument 6 has type 'u64'
net/9p/client.c:982: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'u64'
net/9p/client.c:982: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'u64'
net/9p/client.c:1025: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'u64'
net/9p/client.c:1025: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'u64'
net/9p/client.c:1227: warning: format '%llx' expects type 'long long unsigned int', but argument 7 has type 'u64'
net/9p/client.c:1227: warning: format '%llx' expects type 'long long unsigned int', but argument 12 has type 'u64'
net/9p/client.c:1227: warning: format '%llx' expects type 'long long unsigned int', but argument 8 has type 'u64'
net/9p/client.c:1227: warning: format '%llx' expects type 'long long unsigned int', but argument 13 has type 'u64'
net/9p/client.c:1252: warning: format '%llx' expects type 'long long unsigned int', but argument 7 has type 'u64'
net/9p/client.c:1252: warning: format '%llx' expects type 'long long unsigned int', but argument 12 has type 'u64'
net/9p/client.c:1252: warning: format '%llx' expects type 'long long unsigned int', but argument 8 has type 'u64'
net/9p/client.c:1252: warning: format '%llx' expects type 'long long unsigned int', but argument 13 has type 'u64'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-05 01:35:55 -08:00
Peter Zijlstra
02479099c2 sched: fix buddies for group scheduling
Impact: scheduling order fix for group scheduling

For each level in the hierarchy, set the buddy to point to the right entity.
Therefore, when we do the hierarchical schedule, we have a fair chance of
ending up where we meant to.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-05 10:30:15 +01:00
Peter Zijlstra
4793241be4 sched: backward looking buddy
Impact: improve/change/fix wakeup-buddy scheduling

Currently we only have a forward looking buddy, that is, we prefer to
schedule to the task we last woke up, under the presumption that its
going to consume the data we just produced, and therefore will have
cache hot benefits.

This allows co-waking producer/consumer task pairs to run ahead of the
pack for a little while, keeping their cache warm. Without this, we
would interleave all pairs, utterly trashing the cache.

This patch introduces a backward looking buddy, that is, suppose that
in the above scenario, the consumer preempts the producer before it
can go to sleep, we will therefore miss the wakeup from consumer to
producer (its already running, after all), breaking the cycle and
reverting to the cache-trashing interleaved schedule pattern.

The backward buddy will try to schedule back to the task that woke us
up in case the forward buddy is not available, under the assumption
that the last task will be the one with the most cache hot task around
barring current.

This will basically allow a task to continue after it got preempted.

In order to avoid starvation, we allow either buddy to get wakeup_gran
ahead of the pack.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-05 10:30:14 +01:00
Peter Zijlstra
d95f98d069 sched: fix fair preempt check
Impact: fix cross-class preemption

Inter-class wakeup preemptions should go on class order.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-05 10:30:13 +01:00
Peter Zijlstra
f4b6755fb3 sched: cleanup fair task selection
Impact: cleanup

Clean up task selection

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-05 10:30:13 +01:00
Stephen Rothwell
454666eb78 powerpc: Fix "unused variable" warning in pci_dlpar.c
This gets rid of this build warning:

arch/powerpc/platforms/pseries/pci_dlpar.c: In function 'init_phb_dynamic':
arch/powerpc/platforms/pseries/pci_dlpar.c:192: warning: unused variable 'b'

This is one of the very few warnings left in a ppc64_defconfig build and
getting rid of it will make it easier to see future introduced ones (in
fact this was introduced very recently).

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2008-11-05 19:59:08 +11:00
Alexey Dobriyan
9c8b4aff18 powerpc/cell: Fix compile error in ras.c
This fixes this error on Cell when CONFIG_KEXEC = n:

arch/powerpc/platforms/cell/ras.c:299: error: implicit declaration of function 'crash_shutdown_register'

We have to include <asm/kexec.h> because it contains the dummy
definition of crash_shutdown_register that is used when
CONFIG_KEXEC=n, but <linux/kexec.h> doesn't include <asm/kexec.h> in
that case.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2008-11-05 19:59:08 +11:00
Alexey Dobriyan
fce4d58353 powerpc/ps3: Fix compile error in ps3-lpm.c
Compiling with CONFIG_SMP = n and CONFIG_PS3_LPM != n gives this error:

drivers/ps3/ps3-lpm.c:838: error: implicit declaration of function 'get_hard_smp_processor_id'

This fixes it.  We have to include <asm/smp.h> rather than
<linux/smp.h> because the UP definition of get_hard_smp_processor_id()
is in <asm/smp.h>, and <linux/smp.h> only includes <asm/smp.h> if
CONFIG_SMP = y.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Geoff Levand <geoffrey.levand@am.sony.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2008-11-05 19:59:08 +11:00
Gerrit Renker
d99a7bd210 dccp: Cleanup routines for feature negotiation
This inserts the required de-allocation routines for memory allocated
by feature negotiation in the socket destructors, replacing
dccp_feat_clean() in one instance.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-04 23:56:30 -08:00
Gerrit Renker
ac75773c27 dccp: Per-socket initialisation of feature negotiation
This provides feature-negotiation initialisation for both DCCP sockets
and DCCP request_sockets, to support feature negotiation during
connection setup.

It also resolves a FIXME regarding the congestion control
initialisation.

Thanks to Wei Yongjun for help with the IPv6 side of this patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-04 23:55:49 -08:00
Gerrit Renker
61e6473efb dccp: List management for new feature negotiation
This adds list initial fields and list management functions for the
new feature negotiation implementation.

Thanks to Arnaldo for suggestions and improvements.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-04 23:54:04 -08:00