Commit Graph

799 Commits (01627d968c8b5e2810fe8c417b406b968297c236)

Author SHA1 Message Date
Michał Mirosław 6cb6a27c45 net: Call netdev_features_change() from netdev_update_features()
Issue FEAT_CHANGE notification when features are changed by
netdev_update_features().  This will allow changes made by extra constraints
on e.g. MTU change to be properly propagated like changes via ethtool.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-02 22:48:47 -07:00
Lucas De Marchi 25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
Daniel Lezcano 79b569f0ec netdev: fix mtu check when TSO is enabled
In case the device where is coming from the packet has TSO enabled,
we should not check the mtu size value as this one could be bigger
than the expected value.

This is the case for the macvlan driver when the lower device has
TSO enabled. The macvlan inherit this feature and forward the packets
without fragmenting them. Then the packets go through dev_forward_skb
and are dropped. This patch fix this by checking TSO is not enabled
when we want to check the mtu size.

Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-30 02:42:17 -07:00
stephen hemminger edf947f100 bridge: notify applications if address of bridge device changes
The mac address of the bridge device may be changed when a new interface
is added to the bridge. If this happens, then the bridge needs to call
the network notifiers to tickle any other systems that care. Since bridge
can be a module, this also means exporting the notifier function.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-27 23:35:02 -07:00
Amerigo Wang 3b261ade42 net: remove useless comments in net/core/dev.c
The code itself can explain what it is doing, no need these comments.

Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-27 23:34:59 -07:00
Michał Mirosław 27660515a2 net: implement dev_disable_lro() hw_features compatibility
Implement compatibility with new hw_features for dev_disable_lro().
This is a transition path - dev_disable_lro() should be later
integrated into netdev_fix_features() after all drivers are converted.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-22 01:00:26 -07:00
Jiri Pirko 8a4eb5734e net: introduce rx_handler results and logic around that
This patch allows rx_handlers to better signalize what to do next to
it's caller. That makes skb->deliver_no_wcard no longer needed.

kernel-doc for rx_handler_result is taken from Nicolas' patch.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-16 12:53:54 -07:00
David S. Miller 33175d84ee Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bnx2x/bnx2x_cmn.c
2011-03-10 14:26:00 -08:00
Vasiliy Kulikov 8909c9ad8f net: don't allow CAP_NET_ADMIN to load non-netdev kernel modules
Since a8f80e8ff9 any process with
CAP_NET_ADMIN may load any module from /lib/modules/.  This doesn't mean
that CAP_NET_ADMIN is a superset of CAP_SYS_MODULE as modules are
limited to /lib/modules/**.  However, CAP_NET_ADMIN capability shouldn't
allow anybody load any module not related to networking.

This patch restricts an ability of autoloading modules to netdev modules
with explicit aliases.  This fixes CVE-2011-1019.

Arnd Bergmann suggested to leave untouched the old pre-v2.6.32 behavior
of loading netdev modules by name (without any prefix) for processes
with CAP_SYS_MODULE to maintain the compatibility with network scripts
that use autoloading netdev modules by aliases like "eth0", "wlan0".

Currently there are only three users of the feature in the upstream
kernel: ipip, ip_gre and sit.

    root@albatros:~# capsh --drop=$(seq -s, 0 11),$(seq -s, 13 34) --
    root@albatros:~# grep Cap /proc/$$/status
    CapInh:	0000000000000000
    CapPrm:	fffffff800001000
    CapEff:	fffffff800001000
    CapBnd:	fffffff800001000
    root@albatros:~# modprobe xfs
    FATAL: Error inserting xfs
    (/lib/modules/2.6.38-rc6-00001-g2bf4ca3/kernel/fs/xfs/xfs.ko): Operation not permitted
    root@albatros:~# lsmod | grep xfs
    root@albatros:~# ifconfig xfs
    xfs: error fetching interface information: Device not found
    root@albatros:~# lsmod | grep xfs
    root@albatros:~# lsmod | grep sit
    root@albatros:~# ifconfig sit
    sit: error fetching interface information: Device not found
    root@albatros:~# lsmod | grep sit
    root@albatros:~# ifconfig sit0
    sit0      Link encap:IPv6-in-IPv4
	      NOARP  MTU:1480  Metric:1

    root@albatros:~# lsmod | grep sit
    sit                    10457  0
    tunnel4                 2957  1 sit

For CAP_SYS_MODULE module loading is still relaxed:

    root@albatros:~# grep Cap /proc/$$/status
    CapInh:	0000000000000000
    CapPrm:	ffffffffffffffff
    CapEff:	ffffffffffffffff
    CapBnd:	ffffffffffffffff
    root@albatros:~# ifconfig xfs
    xfs: error fetching interface information: Device not found
    root@albatros:~# lsmod | grep xfs
    xfs                   745319  0

Reference: https://lkml.org/lkml/2011/2/24/203

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Kees Cook <kees.cook@canonical.com>
Signed-off-by: James Morris <jmorris@namei.org>
2011-03-10 10:25:19 +11:00
Jiri Pirko e3f48d37cf net: allow handlers to be processed for orig_dev
This was there before, I forgot about this. Allows deliveries to
ptype_base handlers registered for orig_dev. I presume this is still
desired.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-07 15:37:16 -08:00
David S. Miller 63d8ea7f93 net: Forgot to commit net/core/dev.c part of Jiri's ->rx_handler patch.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-28 10:48:59 -08:00
Michał Mirosław 14d1232f49 net: avoid initial "Features changed" message
Avoid "Features changed" message and ndo_set_features call on device
registration caused by automatic enabling of GSO and GRO. Driver should
have enabled hardware offloads it set in features, so the ndo_set_features()
is not needed at registration time.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-23 14:23:31 -08:00
Michał Mirosław 8e9b59b219 Fix "(unregistered net_device): Features changed" message
Fix netdev_update_features() messages on register time by moving
the call further in register_netdevice(). When
netdev->reg_state != NETREG_REGISTERED, netdev_name() returns
"(unregistered netdevice)" even if the dev's name is already filled.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-23 14:23:31 -08:00
David S. Miller 2a3bcfdde6 Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc-next-2.6 2011-02-22 10:21:36 -08:00
David S. Miller da935c66ba Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	Documentation/feature-removal-schedule.txt
	drivers/net/e1000e/netdev.c
	net/xfrm/xfrm_policy.c
2011-02-19 19:17:35 -08:00
Eric Dumazet ceaaec98ad net: deinit automatic LIST_HEAD
commit 9b5e383c11 (net: Introduce
unregister_netdevice_many()) left an active LIST_HEAD() in
rollback_registered(), with possible memory corruption.

Even if device is freed without touching its unreg_list (and therefore
touching the previous memory location holding LISTE_HEAD(single), better
close the bug for good, since its really subtle.

(Same fix for default_device_exit_batch() for completeness)

Reported-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Eric W. Biderman <ebiderman@xmission.com>
Tested-by: Eric W. Biderman <ebiderman@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Octavian Purdila <opurdila@ixiacom.com>
CC: stable <stable@kernel.org> [.33+]
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-18 11:49:36 -08:00
Linus Torvalds f87e6f4793 net: dont leave active on stack LIST_HEAD
Eric W. Biderman and Michal Hocko reported various memory corruptions
that we suspected to be related to a LIST head located on stack, that
was manipulated after thread left function frame (and eventually exited,
so its stack was freed and reused).

Eric Dumazet suggested the problem was probably coming from commit
443457242b (net: factorize
sync-rcu call in unregister_netdevice_many)

This patch fixes __dev_close() and dev_close() to properly deinit their
respective LIST_HEAD(single) before exiting.

References: https://lkml.org/lkml/2011/2/16/304
References: https://lkml.org/lkml/2011/2/14/223

Reported-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Eric W. Biderman <ebiderman@xmission.com>
Tested-by: Eric W. Biderman <ebiderman@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-18 11:49:35 -08:00
Michał Mirosław 5455c6998d net: Introduce new feature setting ops
This introduces a new framework to handle device features setting.
It consists of:
  - new fields in struct net_device:
	+ hw_features - features that hw/driver supports toggling
	+ wanted_features - features that user wants enabled, when possible
  - new netdev_ops:
	+ feat = ndo_fix_features(dev, feat) - API checking constraints for
		enabling features or their combinations
	+ ndo_set_features(dev) - API updating hardware state to match
		changed dev->features
  - new ethtool commands:
	+ ETHTOOL_GFEATURES/ETHTOOL_SFEATURES: get/set dev->wanted_features
		and trigger device reconfiguration if resulting dev->features
		changed
	+ ETHTOOL_GSTRINGS(ETH_SS_FEATURES): get feature bits names (meaning)

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-17 14:16:33 -08:00
Michał Mirosław 212b573f55 ethtool: enable GSO and GRO by default
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-17 14:16:32 -08:00
Ben Hutchings 69a19ee60d net: RPS: Make hardware-accelerated RFS conditional on NETIF_F_NTUPLE
For testing and debugging purposes it is useful to be able to disable
hardware acceleration of RFS without disabling RFS altogether.  Since
this is a similar feature to 'n-tuple' flow steering through the
ethtool API, test the same feature flag that controls that.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
2011-02-15 20:36:11 +00:00
David S. Miller f878b995b0 Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc-next-2.6 2011-02-15 12:25:19 -08:00
Ben Hutchings 5c56580b74 net: Adjust TX queue kobjects if number of queues changes during unregister
If the root qdisc for a net device is mqprio, and the driver's
ndo_setup_tc() operation dynamically adds and remvoes TX queues,
netif_set_real_num_tx_queues() will be called during device
unregistration to remove the extra TX queues when the qdisc is
destroyed.  Currently this causes the corresponding kobjects
to be leaked, and the device's reference count never drops to 0.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
2011-02-15 19:45:33 +00:00
Jiri Pirko 1765a57533 net: make dev->master general
dev->master is now tightly connected to bonding driver. This patch makes
this pointer more general and ready to be used by others.

 - netdev_set_master() - bond specifics moved to new function
   netdev_set_bond_master()
 - introduced netif_is_bond_slave() to check if device is a bonding slave

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-13 10:42:07 -08:00
Jiri Pirko d59cfde2fb net: remove the unnecessary dance around skb_bond_should_drop
No need to check (master) twice and to drive in and out the header file.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-13 10:42:06 -08:00
David S. Miller 263fb5b1bf Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/e1000e/netdev.c
2011-02-08 17:19:01 -08:00
David S. Miller 8d3bdbd55a net: Fix lockdep regression caused by initializing netdev queues too early.
In commit aa94210411 ("net: init ingress
queue") we moved the allocation and lock initialization of the queues
into alloc_netdev_mq() since register_netdevice() is way too late.

The problem is that dev->type is not setup until the setup()
callback is invoked by alloc_netdev_mq(), and the dev->type is
what determines the lockdep class to use for the locks in the
queues.

Fix this by doing the queue allocation after the setup() callback
runs.

This is safe because the setup() callback is not allowed to make any
state changes that need to be undone on error (memory allocations,
etc.).  It may, however, make state changes that are undone by
free_netdev() (such as netif_napi_add(), which is done by the
ipoib driver's setup routine).

The previous code also leaked a reference to the &init_net namespace
object on RX/TX queue allocation failures.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-08 15:02:50 -08:00
David S. Miller bd4a6974cc Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2011-02-04 14:28:58 -08:00
Andy Gospodarek 6d152e23ad gro: reset skb_iif on reuse
Like Herbert's change from a few days ago:

66c46d741e gro: Reset dev pointer on reuse

this may not be necessary at this point, but we should still clean up
the skb->skb_iif.  If not we may end up with an invalid valid for
skb->skb_iif when the skb is reused and the check is done in
__netif_receive_skb.

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-02 14:53:25 -08:00
Tom Herbert 8587523640 net: Check rps_flow_table when RPS map length is 1
In get_rps_cpu, add check that the rps_flow_table for the device is
NULL when trying to take fast path when RPS map length is one.
Without this, RFS is effectively disabled if map length is one which
is not correct.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-31 16:23:42 -08:00
David S. Miller 5403c8a295 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2011-01-31 13:13:24 -08:00
Herbert Xu 66c46d741e gro: Reset dev pointer on reuse
On older kernels the VLAN code may zero skb->dev before dropping
it and causing it to be reused by GRO.

Unfortunately we didn't reset skb->dev in that case which causes
the next GRO user to get a bogus skb->dev pointer.

This particular problem no longer happens with the current upstream
kernel due to changes in VLAN processing.

However, for correctness we should still reset the skb->dev pointer
in the GRO reuse function in case a future user does the same thing.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-29 22:36:24 -08:00
Eric Dumazet ccf434380d net: fix dev_seq_next()
Commit c6d14c8456 (net: Introduce for_each_netdev_rcu() iterator)
added a race in dev_seq_next().

The rcu_dereference() call should be done _before_ testing the end of
list, or we might return a wrong net_device if a concurrent thread
changes net_device list under us.

Note : discovered thanks to a sparse warning :

net/core/dev.c:3919:9: error: incompatible types in comparison expression
(different address spaces)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-27 15:02:56 -08:00
Michał Mirosław acd1130e87 net: reduce and unify printk level in netdev_fix_features()
Reduce printk() levels to KERN_INFO in netdev_fix_features() as this will
be used by ethtool and might spam dmesg unnecessarily.

This converts the function to use netdev_info() instead of plain printk().

As a side effect, bonding and bridge devices will now log dropped features
on every slave device change.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24 15:45:15 -08:00
Michał Mirosław 04ed3e741d net: change netdev->features to u32
Quoting Ben Hutchings: we presumably won't be defining features that
can only be enabled on 64-bit architectures.

Occurences found by `grep -r` on net/, drivers/net, include/

[ Move features and vlan_features next to each other in
  struct netdev, as per Eric Dumazet's suggestion -DaveM ]

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24 15:32:47 -08:00
Michał Mirosław 57422dc530 net: Move check of checksum features to netdev_fix_features()
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24 15:29:11 -08:00
Ben Hutchings c445477d74 net: RPS: Enable hardware acceleration of RFS
Allow drivers for multiqueue hardware with flow filter tables to
accelerate RFS.  The driver must:

1. Set net_device::rx_cpu_rmap to a cpu_rmap of the RX completion
IRQs (in queue order).  This will provide a mapping from CPUs to the
queues for which completions are handled nearest to them.

2. Implement net_device_ops::ndo_rx_flow_steer.  This operation adds
or replaces a filter steering the given flow to the given RX queue, if
possible.

3. Periodically remove filters for which rps_may_expire_flow() returns
true.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24 14:53:01 -08:00
David S. Miller 5bdc22a565 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/sched/sch_hfsc.c
	net/sched/sch_htb.c
	net/sched/sch_tbf.c
2011-01-24 14:09:35 -08:00
David S. Miller e92427b289 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6 2011-01-24 13:17:06 -08:00
Eric Dumazet c506653d35 net: arp_ioctl() must hold RTNL
Commit 941666c2e3 "net: RCU conversion of dev_getbyhwaddr() and
arp_ioctl()" introduced a regression, reported by Jamie Heilman.
"arp -Ds 192.168.2.41 eth0 pub" triggered the ASSERT_RTNL() assert
in pneigh_lookup()

Removing RTNL requirement from arp_ioctl() was a mistake, just revert
that part.

Reported-by: Jamie Heilman <jamie@audible.transient.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24 13:16:16 -08:00
Eric Dumazet bb134d2298 net: netif_setup_tc() is static
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-21 13:08:27 -08:00
Eric Dumazet a2da570d62 net_sched: RCU conversion of stab
This patch converts stab qdisc management to RCU, so that we can perform
the qdisc_calculate_pkt_len() call before getting qdisc lock.

This shortens the lock's held time in __dev_xmit_skb().

This permits more qdiscs to get TCQ_F_CAN_BYPASS status, avoiding lot of
cache misses and so reducing latencies.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Jesper Dangaard Brouer <hawk@diku.dk>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Jamal Hadi Salim <hadi@cyberus.ca>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-20 16:59:32 -08:00
Eric Dumazet 3fbd8758b0 net: dev_close_many() is static
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Octavian Purdila <opurdila@ixiacom.com>
Reviewed-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-20 16:55:30 -08:00
John Fastabend 4f57c087de net: implement mechanism for HW based QOS
This patch provides a mechanism for lower layer devices to
steer traffic using skb->priority to tx queues. This allows
for hardware based QOS schemes to use the default qdisc without
incurring the penalties related to global state and the qdisc
lock. While reliably receiving skbs on the correct tx ring
to avoid head of line blocking resulting from shuffling in
the LLD. Finally, all the goodness from txq caching and xps/rps
can still be leveraged.

Many drivers and hardware exist with the ability to implement
QOS schemes in the hardware but currently these drivers tend
to rely on firmware to reroute specific traffic, a driver
specific select_queue or the queue_mapping action in the
qdisc.

By using select_queue for this drivers need to be updated for
each and every traffic type and we lose the goodness of much
of the upstream work. Firmware solutions are inherently
inflexible. And finally if admins are expected to build a
qdisc and filter rules to steer traffic this requires knowledge
of how the hardware is currently configured. The number of tx
queues and the queue offsets may change depending on resources.
Also this approach incurs all the overhead of a qdisc with filters.

With the mechanism in this patch users can set skb priority using
expected methods ie setsockopt() or the stack can set the priority
directly. Then the skb will be steered to the correct tx queues
aligned with hardware QOS traffic classes. In the normal case with
single traffic class and all queues in this class everything
works as is until the LLD enables multiple tcs.

To steer the skb we mask out the lower 4 bits of the priority
and allow the hardware to configure upto 15 distinct classes
of traffic. This is expected to be sufficient for most applications
at any rate it is more then the 8021Q spec designates and is
equal to the number of prio bands currently implemented in
the default qdisc.

This in conjunction with a userspace application such as
lldpad can be used to implement 8021Q transmission selection
algorithms one of these algorithms being the extended transmission
selection algorithm currently being used for DCB.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-19 23:31:10 -08:00
Vlad Dogaru cbda10fa97 net_device: add support for network device groups
Net devices can now be grouped, enabling simpler manipulation from
userspace. This patch adds a group field to the net_device structure, as
well as rtnetlink support to query and modify it.

Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-19 23:31:09 -08:00
Linus Torvalds 1268afe676 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (41 commits)
  sctp: user perfect name for Delayed SACK Timer option
  net: fix can_checksum_protocol() arguments swap
  Revert "netlink: test for all flags of the NLM_F_DUMP composite"
  gianfar: Fix misleading indentation in startup_gfar()
  net/irda/sh_irda: return to RX mode when TX error
  net offloading: Do not mask out NETIF_F_HW_VLAN_TX for vlan.
  USB CDC NCM: tx_fixup() race condition fix
  ns83820: Avoid bad pointer deref in ns83820_init_one().
  ipv6: Silence privacy extensions initialization
  bnx2x: Update bnx2x version to 1.62.00-4
  bnx2x: Fix AER setting for BCM57712
  bnx2x: Fix BCM84823 LED behavior
  bnx2x: Mark full duplex on some external PHYs
  bnx2x: Fix BCM8073/BCM8727 microcode loading
  bnx2x: LED fix for BCM8727 over BCM57712
  bnx2x: Common init will be executed only once after POR
  bnx2x: Swap BCM8073 PHY polarity if required
  iwlwifi: fix valid chain reading from EEPROM
  ath5k: fix locking in tx_complete_poll_work
  ath9k_hw: do PA offset calibration only on longcal interval
  ...
2011-01-19 20:25:45 -08:00
Eric Dumazet d402786ea4 net: fix can_checksum_protocol() arguments swap
commit 0363466866 (net offloading: Convert checksums to use
centrally computed features.) mistakenly swapped can_checksum_protocol()
arguments.

This broke IPv6 on bnx2 for instance, on NIC without TCPv6 checksum
offloads.

Reported-by: Hans de Bruin <jmdebruin@xmsnet.nl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-19 14:15:21 -08:00
David S. Miller a5db219f4c Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2011-01-18 16:28:31 -08:00
Jesse Gross 6ee400aafb net offloading: Do not mask out NETIF_F_HW_VLAN_TX for vlan.
In netif_skb_features() we return only the features that are valid for vlans
if we have a vlan packet.  However, we should not mask out NETIF_F_HW_VLAN_TX
since it enables transmission of vlan tags and is obviously valid.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-18 16:13:50 -08:00
Linus Torvalds d018b6f4f1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits)
  GRETH: resolve SMP issues and other problems
  GRETH: handle frame error interrupts
  GRETH: avoid writing bad speed/duplex when setting transfer mode
  GRETH: fixed skb buffer memory leak on frame errors
  GRETH: GBit transmit descriptor handling optimization
  GRETH: fix opening/closing
  GRETH: added raw AMBA vendor/device number to match against.
  cassini: Fix build bustage on x86.
  e1000e: consistent use of Rx/Tx vs. RX/TX/rx/tx in comments/logs
  e1000e: update Copyright for 2011
  e1000: Avoid unhandled IRQ
  r8169: keep firmware in memory.
  netdev: tilepro: Use is_unicast_ether_addr helper
  etherdevice.h: Add is_unicast_ether_addr function
  ks8695net: Use default implementation of ethtool_ops::get_link
  ks8695net: Disable non-working ethtool operations
  USB CDC NCM: Don't deref NULL in cdc_ncm_rx_fixup() and don't use uninitialized variable.
  vxge: Remember to release firmware after upgrading firmware
  netdev: bfin_mac: Remove is_multicast_ether_addr use in netdev_for_each_mc_addr
  ipsec: update MAX_AH_AUTH_LEN to support sha512
  ...
2011-01-14 13:25:30 -08:00
Eric Dumazet 1ac9ad1394 net: remove dev_txq_stats_fold()
After recent changes, (percpu stats on vlan/tunnels...), we dont need
anymore per struct netdev_queue tx_bytes/tx_packets/tx_dropped counters.

Only remaining users are ixgbe, sch_teql, gianfar & macvlan :

1) ixgbe can be converted to use existing tx_ring counters.

2) macvlan incremented txq->tx_dropped, it can use the
dev->stats.tx_dropped counter.

3) sch_teql : almost revert ab35cd4b8f (Use net_device internal stats)
    Now we have ndo_get_stats64(), use it, even for "unsigned long"
fields (No need to bring back a struct net_device_stats)

4) gianfar adds a stats structure per tx queue to hold
tx_bytes/tx_packets

This removes a lockdep warning (and possible lockup) in rndis gadget,
calling dev_get_stats() from hard IRQ context.

Ref: http://www.spinics.net/lists/netdev/msg149202.html

Reported-by: Neil Jones <neiljay@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Sandeep Gopalpet <sandeep.kumar@freescale.com>
CC: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-13 21:44:34 -08:00
Linus Torvalds 008d23e485 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
  Documentation/trace/events.txt: Remove obsolete sched_signal_send.
  writeback: fix global_dirty_limits comment runtime -> real-time
  ppc: fix comment typo singal -> signal
  drivers: fix comment typo diable -> disable.
  m68k: fix comment typo diable -> disable.
  wireless: comment typo fix diable -> disable.
  media: comment typo fix diable -> disable.
  remove doc for obsolete dynamic-printk kernel-parameter
  remove extraneous 'is' from Documentation/iostats.txt
  Fix spelling milisec -> ms in snd_ps3 module parameter description
  Fix spelling mistakes in comments
  Revert conflicting V4L changes
  i7core_edac: fix typos in comments
  mm/rmap.c: fix comment
  sound, ca0106: Fix assignment to 'channel'.
  hrtimer: fix a typo in comment
  init/Kconfig: fix typo
  anon_inodes: fix wrong function name in comment
  fix comment typos concerning "consistent"
  poll: fix a typo in comment
  ...

Fix up trivial conflicts in:
 - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c)
 - fs/ext4/ext4.h

Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.
2011-01-13 10:05:56 -08:00
Eric Dumazet bfe0d0298f net_sched: factorize qdisc stats handling
HTB takes into account skb is segmented in stats updates.
Generalize this to all schedulers.

They should use qdisc_bstats_update() helper instead of manipulating
bstats.bytes and bstats.packets

Add bstats_update() helper too for classes that use
gnet_stats_basic_packed fields.

Note : Right now, TCQ_F_CAN_BYPASS shortcurt can be taken only if no
stab is setup on qdisc.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-10 16:07:54 -08:00
Tom Herbert 36909ea438 net: Add alloc_netdev_mqs function
Added alloc_netdev_mqs function which allows the number of transmit and
receive queues to be specified independenty.  alloc_netdev_mq was
changed to a macro to call the new function.  Also added
alloc_etherdev_mqs with same purpose.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-10 16:05:30 -08:00
Jesse Gross 0363466866 net offloading: Convert checksums to use centrally computed features.
In order to compute the features for other offloads (primarily
scatter/gather), we need to first check the ability of the NIC to
offload the checksum for the packet.  Since we have already computed
this, we can directly use the result instead of figuring it out
again.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-09 23:35:35 -08:00
Jesse Gross 02932ce9e2 net offloading: Convert skb_need_linearize() to use precomputed features.
This switches skb_need_linearize() to use the features that have
been centrally computed.  In doing so, this fixes a problem where
scatter/gather should not be used because the card does not support
checksum offloading on that type of packet.  On device registration
we only check that some form of checksum offloading is available if
scatter/gatther is enabled but we must also check at transmission
time.  Examples of this include IPv6 or vlan packets on a NIC that
only supports IPv4 offloading.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-09 23:35:35 -08:00
Jesse Gross 91ecb63c07 net offloading: Convert dev_gso_segment() to use precomputed features.
This switches dev_gso_segment() to use the device features computed
by the centralized routine.  In doing so, it fixes a problem where
it would always use dev->features, instead of those appropriate
to the number of vlan tags if any are present.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-09 23:35:34 -08:00
Jesse Gross fc741216db net offloading: Pass features into netif_needs_gso().
Now that there is a single function that can compute the device
features relevant to a packet, we don't want to run it for each
offload.  This converts netif_needs_gso() to take the features
of the device, rather than computing them itself.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-09 23:35:34 -08:00
Jesse Gross f01a5236bd net offloading: Generalize netif_get_vlan_features().
netif_get_vlan_features() is currently only used by netif_needs_gso(),
so it only concerns itself with GSO features.  However, several other
places also should take into account the contents of the packet when
deciding whether to offload to hardware.  This generalizes the function
to return features about all of the various forms of offloading.  Since
offloads tend to be linked together, this avoids duplicating the logic
in each location (i.e. the scatter/gather code also needs the checksum
logic).

Suggested-by: Michał Mirosław <mirqus@gmail.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-09 23:35:33 -08:00
Jesse Gross 9497a0518e net offloading: Accept NETIF_F_HW_CSUM for all protocols.
We currently only have software fallback for one type of checksum: the
TCP/UDP one's complement.  This means that a protocol that uses hardware
offloading for a different type of checksum (FCoE, SCTP) must directly
check the device's features and do the right thing ahead of time.  By
the time we get to dev_can_checksum(), we're only deciding whether to
apply the one algorithm in software or hardware.  NETIF_F_HW_CSUM has the
same capabilities as the software version, so we should always use it if
present.  The primary advantage of this is multiply tagged vlans can use
hardware checksumming.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-09 23:35:33 -08:00
Jiri Kosina 4b7bd36470 Merge branch 'master' into for-next
Conflicts:
	MAINTAINERS
	arch/arm/mach-omap2/pm24xx.c
	drivers/scsi/bfa/bfa_fcpim.c

Needed to update to apply fixes for which the old branch was too
outdated.
2010-12-22 18:57:02 +01:00
Eric Dumazet 70978182d4 net: timestamp cloned packet in dev_queue_xmit_nit
Le vendredi 17 décembre 2010 à 10:26 +0100, Eric Dumazet a écrit :

>
> I think we can add this after latest Changli patch :
>
> He does one skb_clone() before calling the sniffers.
> We could set timestamp on this clone, instead of original skb.
>
> Problem solved.
>

[PATCH net-next-2.6] net: timestamp cloned packet in dev_queue_xmit_nit

Now we do one clone of skb if at least one sniffer might take packet,
we also can do the skb timestamping on the clone and let original packet
unchanged.

This is a generalization of commit 8caf153974 (net: sch_netem: Fix an
inconsistency in ingress netem timestamps.)

This way, we can have a good idea when packets are delivered to our
stack (tcpdump -i ifb0), while a tcpdump on original device gives
timestamps right before ingressing.

This also speedup our stack, avoiding taking timestamps if not needed.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Changli Gao <xiaosuo@gmail.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jarek Poplawski <jarkao2@gmail.com>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-21 10:50:38 -08:00
Changli Gao 71d9dec24d net: increase skb->users instead of skb_clone()
In dev_queue_xmit_nit(), we have to clone skbs as we need to mangle skbs,
however, we don't need to clone skbs for all the packet_types.

Except for the first packet_type, we increase skb->users instead of
skb_clone().

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-19 21:50:20 -08:00
Michał Mirosław 55508d601d net: Use skb_checksum_start_offset()
Replace skb->csum_start - skb_headroom(skb) with skb_checksum_start_offset().

Note for usb/smsc95xx: skb->data - skb->head == skb_headroom(skb).

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-16 14:43:14 -08:00
Octavian Purdila 443457242b net: factorize sync-rcu call in unregister_netdevice_many
Add dev_close_many and dev_deactivate_many to factorize another
sync-rcu operation on the netdevice unregister path.

$ modprobe dummy numdummies=10000
$ ip link set dev dummy* up
$ time rmmod dummy

Without the patch           With the patch

real    0m 24.63s           real    0m 5.15s
user    0m 0.00s            user    0m 0.00s
sys     0m 6.05s            sys     0m 5.14s

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-16 14:04:44 -08:00
Changli Gao b236da6931 net: use NUMA_NO_NODE instead of the magic number -1
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-16 13:16:06 -08:00
Vladislav Zolotarov a3d22a68d7 bnx2x: Take the distribution range definition out of skb_tx_hash()
Move the calcualation of the Tx hash for a given hash range into a separate
function and define the skb_tx_hash(), which calculates a Tx hash for a
[0; dev->real_num_tx_queues - 1] hash values range, using this
function (__skb_tx_hash()).

Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-16 13:15:53 -08:00
Eric Dumazet 15c2d75f49 net: call dev_queue_xmit_nit() after skb_dst_drop()
Avoid some atomic ops on dst refcount, calling dev_queue_xmit_nit()
after skb_dst_drop() in dev_hard_start_xmit().

When queueing a packet into af_packet socket, we drop dst anyway.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-08 10:39:54 -08:00
Eric Dumazet 941666c2e3 net: RCU conversion of dev_getbyhwaddr() and arp_ioctl()
Le dimanche 05 décembre 2010 à 09:19 +0100, Eric Dumazet a écrit :

> Hmm..
>
> If somebody can explain why RTNL is held in arp_ioctl() (and therefore
> in arp_req_delete()), we might first remove RTNL use in arp_ioctl() so
> that your patch can be applied.
>
> Right now it is not good, because RTNL wont be necessarly held when you
> are going to call arp_invalidate() ?

While doing this analysis, I found a refcount bug in llc, I'll send a
patch for net-2.6

Meanwhile, here is the patch for net-next-2.6

Your patch then can be applied after mine.

Thanks

[PATCH] net: RCU conversion of dev_getbyhwaddr() and arp_ioctl()

dev_getbyhwaddr() was called under RTNL.

Rename it to dev_getbyhwaddr_rcu() and change all its caller to now use
RCU locking instead of RTNL.

Change arp_ioctl() to use RCU instead of RTNL locking.

Note: this fix a dev refcount bug in llc

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-08 10:07:24 -08:00
Changli Gao aa94210411 net: init ingress queue
The dev field of ingress queue is forgot to initialized, then NULL
pointer dereference happens in qdisc_alloc().

Move inits of tx queues to netif_alloc_netdev_queues().

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-08 09:43:27 -08:00
Michał Mirosław 7903264402 net: Fix too optimistic NETIF_F_HW_CSUM features
NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM+NETIF_F_IPV6_CSUM, but
some drivers miss the difference. Fix this and also fix UFO dependency
on checksumming offload as it makes the same mistake in assumptions.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Acked-by: Jon Mason <jon.mason@exar.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-06 12:59:04 -08:00
Eric Dumazet f2cd2d3e9b net sched: use xps information for qdisc NUMA affinity
Allocate qdisc memory according to NUMA properties of cpus included in
xps map.

To be effective, qdisc should be (re)setup after changes
of /sys/class/net/eth<n>/queues/tx-<n>/xps_cpus

I added a numa_node field in struct netdev_queue, containing NUMA node
if all cpus included in xps_cpus share same node, else -1.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-01 12:47:42 -08:00
Tom Herbert bf26414510 xps: Add CONFIG_XPS
This patch adds XPS_CONFIG option to enable and disable XPS.  This is
done in the same manner as RPS_CONFIG.  This is also fixes build
failure in XPS code when SMP is not enabled.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-28 18:24:14 -08:00
Tom Herbert 1d24eb4815 xps: Transmit Packet Steering
This patch implements transmit packet steering (XPS) for multiqueue
devices.  XPS selects a transmit queue during packet transmission based
on configuration.  This is done by mapping the CPU transmitting the
packet to a queue.  This is the transmit side analogue to RPS-- where
RPS is selecting a CPU based on receive queue, XPS selects a queue
based on the CPU (previously there was an XPS patch from Eric
Dumazet, but that might more appropriately be called transmit completion
steering).

Each transmit queue can be associated with a number of CPUs which will
use the queue to send packets.  This is configured as a CPU mask on a
per queue basis in:

/sys/class/net/eth<n>/queues/tx-<n>/xps_cpus

The mappings are stored per device in an inverted data structure that
maps CPUs to queues.  In the netdevice structure this is an array of
num_possible_cpu structures where each structure holds and array of
queue_indexes for queues which that CPU can use.

The benefits of XPS are improved locality in the per queue data
structures.  Also, transmit completions are more likely to be done
nearer to the sending thread, so this should promote locality back
to the socket on free (e.g. UDP).  The benefits of XPS are dependent on
cache hierarchy, application load, and other factors.  XPS would
nominally be configured so that a queue would only be shared by CPUs
which are sharing a cache, the degenerative configuration woud be that
each CPU has it's own queue.

Below are some benchmark results which show the potential benfit of
this patch.  The netperf test has 500 instances of netperf TCP_RR test
with 1 byte req. and resp.

bnx2x on 16 core AMD
   XPS (16 queues, 1 TX queue per CPU)  1234K at 100% CPU
   No XPS (16 queues)                   996K at 100% CPU

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-24 11:44:20 -08:00
Tom Herbert 3853b5841c xps: Improvements in TX queue selection
In dev_pick_tx, don't do work in calculating queue
index or setting
the index in the sock unless the device has more than one queue.  This
allows the sock to be set only with a queue index of a multi-queue
device which is desirable if device are stacked like in a tunnel.

We also allow the mapping of a socket to queue to be changed.  To
maintain in order packet transmission a flag (ooo_okay) has been
added to the sk_buff structure.  If a transport layer sets this flag
on a packet, the transmit queue can be changed for the socket.
Presumably, the transport would set this if there was no possbility
of creating OOO packets (for instance, there are no packets in flight
for the socket).  This patch includes the modification in TCP output
for setting this flag.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-24 11:44:19 -08:00
David S. Miller 6b35308850 net: Export netif_get_vlan_features().
ERROR: "netif_get_vlan_features" [drivers/net/xen-netfront.ko] undefined!

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-15 20:15:03 -08:00
Tom Herbert fe8222406c net: Simplify RX queue allocation
This patch move RX queue allocation to alloc_netdev_mq and freeing of
the queues to free_netdev (symmetric to TX queue allocation).  Each
kobject RX queue takes a reference to the queue's device so that the
device can't be freed before all the kobjects have been released-- this
obviates the need for reference counts specific to RX queues.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-15 10:57:28 -08:00
Tom Herbert ed9af2e839 net: Move TX queue allocation to alloc_netdev_mq
TX queues are now allocated in alloc_netdev_mq and freed in
free_netdev.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-15 10:56:54 -08:00
Jesse Gross 58e998c6d2 offloading: Force software GSO for multiple vlan tags.
We currently use vlan_features to check for TSO support if there is
a vlan tag.  However, it's quite likely that the NIC is not able to
do TSO when there is an arbitrary number of tags.  Therefore if there
is more than one tag (in-band or out-of-band), fall back to software
emulation.

Signed-off-by: Jesse Gross <jesse@nicira.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-15 09:22:53 -08:00
Jesse Gross c8d5bcd1af offloading: Support multiple vlan tags in GSO.
We assume that hardware TSO can't support multiple levels of vlan tags
but we allow it to be done.  Therefore, enable GSO to parse these tags
so we can fallback to software.

Signed-off-by: Jesse Gross <jesse@nicira.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-15 09:22:53 -08:00
Jesse Gross e1e78db628 offloading: Make scatter/gather more tolerant of vlans.
When checking if it is necessary to linearize a packet, we currently
use vlan_features if the packet contains either an in-band or out-
of-band vlan tag.  However, in-band tags aren't special in any way
for scatter/gather since they are part of the packet buffer and are
simply more data to DMA.  Therefore, only use vlan_features for out-
of-band tags, which could potentially have some interaction with
scatter/gather.

Signed-off-by: Jesse Gross <jesse@nicira.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-15 09:22:52 -08:00
Joe Perches b194a3674f net/core/dev.c: Update WARN uses
Coalesce long formats.
Add missing newlines.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-09 09:22:31 -08:00
Tom Herbert df32cc193a net: check queue_index from sock is valid for device
In dev_pick_tx recompute the queue index if the value stored in the
socket is greater than or equal to the number of real queues for the
device.  The saved index in the sock structure is not guaranteed to
be appropriate for the egress device (this could happen on a route
change or in presence of tunnelling).  The result of the queue index
being bad would be to return a bogus queue (crash could prersumably
follow).

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-01 12:55:52 -07:00
Uwe Kleine-König b595076a18 tree-wide: fix comment/printk typos
"gadget", "through", "command", "maintain", "maintain", "controller", "address",
"between", "initiali[zs]e", "instead", "function", "select", "already",
"equal", "access", "management", "hierarchy", "registration", "interest",
"relative", "memory", "offset", "already",

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-11-01 15:38:34 -04:00
Ben Hutchings 66c68bcc48 net: NETIF_F_HW_CSUM does not imply FCoE CRC offload
NETIF_F_HW_CSUM indicates the ability to update an TCP/IP-style 16-bit
checksum with the checksum of an arbitrary part of the packet data,
whereas the FCoE CRC is something entirely different.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Cc: stable@kernel.org [2.6.32+]
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-27 11:37:29 -07:00
Ben Hutchings af1905dbec net: Fix some corner cases in dev_can_checksum()
dev_can_checksum() incorrectly returns true in these cases:

1. The skb has both out-of-band and in-band VLAN tags and the device
   supports checksum offload for the encapsulated protocol but only with
   one layer of encapsulation.
2. The skb has a VLAN tag and the device supports generic checksumming
   but not in conjunction with VLAN encapsulation.

Rearrange the VLAN tag checks to avoid these.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-27 11:37:29 -07:00
Eric Dumazet 6e3f7faf3e rps: add __rcu annotations
Add __rcu annotations to :
	(struct netdev_rx_queue)->rps_map
	(struct netdev_rx_queue)->rps_flow_table
	struct rps_sock_flow_table *rps_sock_flow_table;

And use appropriate rcu primitives.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-25 14:18:27 -07:00
Eric Dumazet 198caeca3e ipv6: ip6_ptr rcu annotations
(struct net_device)->ip6_ptr is rcu protected :

add __rcu annotation and proper rcu primitives.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-25 13:09:43 -07:00
David S. Miller 11a766ce91 net: Increase xmit RECURSION_LIMIT to 10.
Three is definitely too low, and we know from reports that GRE tunnels
stacked as deeply as 37 levels cause stack overflows, so pick some
reasonable value between those two.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-25 12:51:55 -07:00
Linus Torvalds 5f05647dd8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1699 commits)
  bnx2/bnx2x: Unsupported Ethtool operations should return -EINVAL.
  vlan: Calling vlan_hwaccel_do_receive() is always valid.
  tproxy: use the interface primary IP address as a default value for --on-ip
  tproxy: added IPv6 support to the socket match
  cxgb3: function namespace cleanup
  tproxy: added IPv6 support to the TPROXY target
  tproxy: added IPv6 socket lookup function to nf_tproxy_core
  be2net: Changes to use only priority codes allowed by f/w
  tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
  tproxy: added tproxy sockopt interface in the IPV6 layer
  tproxy: added udp6_lib_lookup function
  tproxy: added const specifiers to udp lookup functions
  tproxy: split off ipv6 defragmentation to a separate module
  l2tp: small cleanup
  nf_nat: restrict ICMP translation for embedded header
  can: mcp251x: fix generation of error frames
  can: mcp251x: fix endless loop in interrupt handler if CANINTF_MERRF is set
  can-raw: add msg_flags to distinguish local traffic
  9p: client code cleanup
  rds: make local functions/variables static
  ...

Fix up conflicts in net/core/dev.c, drivers/net/pcmcia/smc91c92_cs.c and
drivers/net/wireless/ath/ath9k/debug.c as per David
2010-10-23 11:47:02 -07:00
David S. Miller 2198a10b50 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/core/dev.c
2010-10-21 08:43:05 -07:00
stephen hemminger d0c2b0d265 napi: unexport napi_reuse_skb
The function napi_reuse_skb is only used inside core.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-21 04:26:38 -07:00
Ben Greear d2ed817766 net/core: Allow tagged VLAN packets to flow through VETH devices.
When there are VLANs on a VETH device, the packets being transmitted
through the VETH device may be 4 bytes bigger than MTU.  A check
in dev_forward_skb did not take this into account and so dropped
these packets.

This patch is needed at least as far back as 2.6.34.7 and should
be considered for -stable.

Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-21 04:06:29 -07:00
Jesse Gross 3701e51382 vlan: Centralize handling of hardware acceleration.
Currently each driver that is capable of vlan hardware acceleration
must be aware of the vlan groups that are configured and then pass
the stripped tag to a specialized receive function.  This is

different from other types of hardware offload in that it places a
significant amount of knowledge in the driver itself rather keeping
it in the networking core.

This makes vlan offloading function more similarly to other forms
of offloading (such as checksum offloading or TSO) by doing the
following:
* On receive, stripped vlans are passed directly to the network
core, without attempting to check for vlan groups or reconstructing
the header if no group
* vlans are made less special by folding the logic into the main
receive routines
* On transmit, the device layer will add the vlan header in software
if the hardware doesn't support it, instead of spreading that logic
out in upper layers, such as bonding.

There are a number of advantages to this:
* Fixes all bugs with drivers incorrectly dropping vlan headers at once.
* Avoids having to disable VLAN acceleration when in promiscuous mode
(good for bridging since it always puts devices in promiscuous mode).
* Keeps VLAN tag separate until given to ultimate consumer, which
avoids needing to do header reconstruction as in tg3 unless absolutely
necessary.
* Consolidates common code in core networking.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-21 01:26:53 -07:00
Jesse Gross 7b9c609037 vlan: Enable software emulation for vlan accleration.
Currently users of hardware vlan accleration need to know whether
the device supports it before generating packets.  However, vlan
acceleration will soon be available in a more flexible manner so
knowing ahead of time becomes much more difficult.  This adds
a software fallback path for vlan packets on devices without the
necessary offloading support, similar to other types of hardware
accleration.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-21 01:26:52 -07:00
Tom Herbert e6484930d7 net: allocate tx queues in register_netdevice
This patch introduces netif_alloc_netdev_queues which is called from
register_device instead of alloc_netdev_mq.  This makes TX queue
allocation symmetric with RX allocation.  Also, queue locks allocation
is done in netdev_init_one_queue.  Change set_real_num_tx_queues to
fail if requested number < 1 or greater than number of allocated
queues.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-20 02:27:59 -07:00
Tom Herbert bd25fa7ba5 net: cleanups in RX queue allocation
Clean up in RX queue allocation.  In netif_set_real_num_rx_queues
return error on attempt to set zero queues, or requested number is
greater than number of allocated queues.  In netif_alloc_rx_queues,
do BUG_ON if queue_count is zero.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-20 02:27:59 -07:00
Tom Herbert 55513fb428 net: fail alloc_netdev_mq if queue count < 1
In alloc_netdev_mq fail if requested queue_count < 1.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-20 02:27:58 -07:00
Eric Dumazet 29b4433d99 net: percpu net_device refcount
We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.

There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers, mmap af_packet)

We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.

On x86, dev_hold(dev) code :

before
        lock    incl 0x280(%ebx)
after:
        movl    0x260(%ebx),%eax
        incl    fs:(%eax)

Stress bench :

(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)

Before:

real    1m1.662s
user    0m14.373s
sys     12m55.960s

After:

real    0m51.179s
user    0m15.329s
sys     10m15.942s

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-12 12:35:25 -07:00
Tom Herbert 4315d834c1 net: Fix rxq ref counting
The rx->count reference is used to track reference counts to the
number of rx-queue kobjects created for the device.  This patch
eliminates initialization of the counter in netif_alloc_rx_queues
and instead increments the counter each time a kobject is created.
This is now symmetric with the decrement that is done when an object is
released.

Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-08 14:34:32 -07:00
Ben Hutchings 4e7f79511e net: Update kernel-doc for netif_set_real_num_rx_queues()
Synchronise the comment with the preceding implementation change.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-08 10:33:39 -07:00
John Fastabend 3d3211ef5c net: netif_set_real_num_rx_queues may cap num_rx_queues at init time
Do not set num_rx_queues in netif_set_real_num_rx_queues() some
drivers will increase the real_num_rx_queues later due to a feature
changes or available interrupts increasing. By setting num_rx_queues
here this ends up creating a cap on the number of rx queues
available.

For example the ixgbe driver sets the max number of queues it intends
to use ever then sets the current number in use with the
netif_set_num_{rx|tx}_queues calls. With the current implementation
the number of rx queues gets limited so when a feature such as DCB
or FCoE is enabled the queues are no longer available.

kobjects will only be allocated for real_num_rx_queues so the waste
in memory is minimal.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-06 23:35:15 -07:00
Eric Dumazet caf586e5f2 net: add a core netdev->rx_dropped counter
In various situations, a device provides a packet to our stack and we
drop it before it enters protocol stack :
- softnet backlog full (accounted in /proc/net/softnet_stat)
- bad vlan tag (not accounted)
- unknown/unregistered protocol (not accounted)

We can handle a per-device counter of such dropped frames at core level,
and automatically adds it to the device provided stats (rx_dropped), so
that standard tools can be used (ifconfig, ip link, cat /proc/net/dev)

This is a generalization of commit 8990f468a (net: rx_dropped
accounting), thus reverting it.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-05 14:47:55 -07:00
Eric Dumazet 24824a09e3 net: dynamic ingress_queue allocation
ingress being not used very much, and net_device->ingress_queue being
quite a big object (128 or 256 bytes), use a dynamic allocation if
needed (tc qdisc add dev eth0 ingress ...)

dev_ingress_queue(dev) helper should be used only with RTNL taken.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-05 00:23:44 -07:00
Eric Dumazet bfa5ae63b8 net: rename netdev rx_queue to ingress_queue
There is some confusion with rx_queue name after RPS, and net drivers
private rx_queue fields.

I suggest to rename "struct net_device"->rx_queue to ingress_queue.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-29 13:25:53 -07:00
Eric Dumazet 745e20f1b6 net: add a recursion limit in xmit path
As tunnel devices are going to be lockless, we need to make sure a
misconfigured machine wont enter an infinite loop.

Add a percpu variable, and limit to three the number of stacked xmits.

Reported-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-29 13:23:09 -07:00
Ben Hutchings 62fe0b40ab net: Allow changing number of RX queues after device allocation
For RPS, we create a kobject for each RX queue based on the number of
queues passed to alloc_netdev_mq().  However, drivers generally do not
determine the numbers of hardware queues to use until much later, so
this usually represents the maximum number the driver may use and not
the actual number in use.

For TX queues, drivers can update the actual number using
netif_set_real_num_tx_queues().  Add a corresponding function for RX
queues, netif_set_real_num_rx_queues().

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-27 22:09:49 -07:00
David S. Miller e40051d134 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/qlcnic/qlcnic_init.c
	net/ipv4/ip_output.c
2010-09-27 01:03:03 -07:00
Eric Dumazet 1b4bf461f0 rps: allocate rx queues in register_netdevice only
Instead of having two places were we allocate dev->_rx, introduce
netif_alloc_rx_queues() helper and call it only from
register_netdevice(), not from alloc_netdev_mq()

Goal is to let drivers change dev->num_rx_queues after allocating netdev
and before registering it.

This also removes a lot of ifdefs in net/core/dev.c

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-26 19:04:07 -07:00
Eric Dumazet c5256c5123 net: propagate NETIF_F_HIGHDMA to vlans
Automatically allows vlans to get NETIF_F_HIGHDMA if underlying device
supports it.

On 32bit arches (and more precisely if CONFIG_HIGHMEM is enabled), it
can help to reduce cost of illegal_highdma() and __skb_linearize()
calls.

Tested on tg3 , bnx2, bonding, this worked very well.

This is a generalization of a patch provided by Yi Zou & Jeff Kirsher.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-26 18:27:15 -07:00
Ingo Molnar 7ed569206e Merge commit 'v2.6.36-rc5' into perf/core
Merge reason: Pick up the latest fixes in -rc5.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-09-21 13:55:11 +02:00
David Lamparter 3b27e10555 netns: keep vlan slaves on master netns move
previously, if a vlan master device was moved from one network namespace
to another, all 802.1q and macvlan slaves were deleted.

we can use dev->reg_state to figure out whether dev_change_net_namespace
is happening, since that won't set dev->reg_state NETREG_UNREGISTERING.
so, this changes 8021q and macvlan to ignore NETDEV_UNREGISTER when
reg_state is not NETREG_UNREGISTERING.

Signed-off-by: David Lamparter <equinox@diac24.net>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-17 16:46:04 -07:00
Stephen Rothwell caeda9b926 net: include inetdevice.h for rcu_dereference_raw api change
rcu_dereference_raw() now needs to know the type of its argument.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-16 21:39:16 -07:00
Brandon Philips 16c3ea785f net: enable GRO by default for vlan devices
Currently vlan devices don't have GRO by default as none of the Ethernet
drivers add NETIF_F_GRO to their vlan_features.

As GRO is a software feature add GRO to dev->vlan_features in
register_netdevice() and let vlan_dev_init() take care that it gets
enabled only when dev->features has NETIF_F_GRO too.

Signed-off-by: Brandon Philips <bphilips@suse.de>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-15 22:32:29 -07:00
Eric Dumazet 95ae6b228f ipv4: ip_ptr cleanups
dev->ip_ptr is protected by rtnl and rcu.

Yet some places dont use appropriate primitives and/or locking rules.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-15 22:06:05 -07:00
Ingo Molnar 3aabae7d9d Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core 2010-09-15 10:27:31 +02:00
Eric Dumazet ef885afbf8 net: use rcu_barrier() in rollback_registered_many
netdev_wait_allrefs() waits that all references to a device vanishes.

It currently uses a _very_ pessimistic 250 ms delay between each probe.
Some users reported that no more than 4 devices can be dismantled per
second, this is a pretty serious problem for some setups.

Most of the time, a refcount is about to be released by an RCU callback,
that is still in flight because rollback_registered_many() uses a
synchronize_rcu() call instead of rcu_barrier(). Problem is visible if
number of online cpus is one, because synchronize_rcu() is then a no op.

time to remove 50 ipip tunnels on a UP machine :

before patch : real 11.910s
after patch : real 1.250s

Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reported-by: Octavian Purdila <opurdila@ixiacom.com>
Reported-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-14 14:27:29 -07:00
David S. Miller e548833df8 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/mac80211/main.c
2010-09-09 22:27:33 -07:00
Changli Gao 6febfca98f net: rps: add the shortcut for one rps_cpus
When there is only one rps_cpus, skb_get_rxhash() can be eliminated.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-08 13:10:53 -07:00
Helmut Schaa deabc772f3 net: fix tx queue selection for bridged devices implementing select_queue
When a net device is implementing the select_queue callback and is part of
a bridge, frames coming from the bridge already have a tx queue associated
to the socket (introduced in commit a4ee3ce329,
"net: Use sk_tx_queue_mapping for connected sockets"). The call to
sk_tx_queue_get will then return the tx queue used by the bridge instead
of calling the select_queue callback.

In case of mac80211 this broke QoS which is implemented by using the
select_queue callback. Furthermore it introduced problems with rt2x00
because frames with the same TID and RA sometimes appeared on different
tx queues which the hw cannot handle correctly.

Fix this by always calling select_queue first if it is available and only
afterwards use the socket tx queue mapping.

Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-07 13:57:20 -07:00
Koki Sanagi 07dc22e729 skb: Add tracepoints to freeing skb
This patch adds tracepoint to consume_skb and add trace_kfree_skb
before __kfree_skb in skb_free_datagram_locked and net_tx_action.
Combinating with tracepoint on dev_hard_start_xmit, we can check
how long it takes to free transmitted packets. And using it, we can
calculate how many packets driver had at that time. It is useful when
a drop of transmitted packet is a problem.

            sshd-6828  [000] 112689.258154: consume_skb: skbaddr=f2d99bb8

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com>
Cc: Izumo Taku <izumi.taku@jp.fujitsu.com>
Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Scott Mcmillan <scott.a.mcmillan@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
LKML-Reference: <4C724364.50903@jp.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-09-07 17:51:53 +02:00
Koki Sanagi cf66ba58b5 netdev: Add tracepoints to netdev layer
This patch adds tracepoint to dev_queue_xmit, dev_hard_start_xmit,
netif_rx and netif_receive_skb. These tracepoints help you to monitor
network driver's input/output.

          <idle>-0     [001] 112447.902030: netif_rx: dev=eth1 skbaddr=f3ef0900 len=84
          <idle>-0     [001] 112447.902039: netif_receive_skb: dev=eth1 skbaddr=f3ef0900 len=84
            sshd-6828  [000] 112447.903257: net_dev_queue: dev=eth4 skbaddr=f3fca538 len=226
            sshd-6828  [000] 112447.903260: net_dev_xmit: dev=eth4 skbaddr=f3fca538 len=226 rc=0

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com>
Cc: Izumo Taku <izumi.taku@jp.fujitsu.com>
Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Scott Mcmillan <scott.a.mcmillan@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
LKML-Reference: <4C72431E.3000901@jp.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-09-07 17:51:33 +02:00
Eric Dumazet c07b68e841 net: dev_add_pack() & __dev_remove_pack() changes
Add a small helper ptype_head() to get the head to manipulate

dev_add_pack() & __dev_remove_pack() can use a spinlock without
blocking BH, since softirq use RCU, and these functions are run from
process context only.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-02 10:12:06 -07:00
Eric Dumazet 86cac58b71 skge: add GRO support
- napi_gro_flush() is exported from net/core/dev.c, to avoid
  an irq_save/irq_restore in the packet receive path.
- use napi_gro_receive() instead of netif_receive_skb()
- use napi_gro_flush() before calling __napi_complete()
- turn on NETIF_F_GRO by default
- Tested on a Marvell 88E8001 Gigabit NIC

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-01 10:57:55 -07:00
Eric Dumazet 40d0802b3e gro: __napi_gro_receive() optimizations
compare_ether_header() can have a special implementation on 64 bit
arches if CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is defined.

__napi_gro_receive() and vlan_gro_common() can avoid a conditional
branch to perform device match.

On x86_64, __napi_gro_receive() has now 38 instructions instead of 53

As gcc-4.4.3 still choose to not inline it, add inline keyword to this
performance critical function.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-26 22:03:08 -07:00
David S. Miller 21dc330157 net: Rename skb_has_frags to skb_has_frag_list
SKBs can be "fragmented" in two ways, via a page array (called
skb_shinfo(skb)->frags[]) and via a list of SKBs (called
skb_shinfo(skb)->frag_list).

Since skb_has_frags() tests the latter, it's name is confusing
since it sounds more like it's testing the former.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-23 00:13:46 -07:00
Changli Gao 05532121da net: 802.1q: make vlan_hwaccel_do_receive() return void
vlan_hwaccel_do_receive() always returns 0, so make it return void.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-22 21:03:33 -07:00
David S. Miller d3c6e7ad09 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-08-21 23:32:24 -07:00
Changli Gao 1003489e06 net: rps: fix the wrong network header pointer
__skb_get_rxhash() was broken after the commit:

 commit bfb564e739
 Author: Krishna Kumar <krkumar2@in.ibm.com>
 Date:   Wed Aug 4 06:15:52 2010 +0000

 core: Factor out flow calculation from get_rps_cpu

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-21 22:54:49 -07:00
Changli Gao 12fcdefb36 net: rps: use proto_ports_offset() to handle the AH message correctly
The SPI isn't at the beginning of an AH message.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-19 17:16:23 -07:00
Changli Gao dbe5775bbc net: rps: skip fragment when computing rxhash
Fragmented IP packets may have no transfer header, so when computing
rxhash, we should skip them.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-19 17:10:38 -07:00
Changli Gao 2d47b45951 net: rps: reset network header before calling skb_get_rxhash()
skb_get_rxhash() assumes the network header pointer of the skb is set
properly after the commit:

commit bfb564e739
Author: Krishna Kumar <krkumar2@in.ibm.com>
Date:   Wed Aug 4 06:15:52 2010 +0000

    core: Factor out flow calculation from get_rps_cpu

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-19 17:08:37 -07:00
Oliver Hartkopp 2244d07bfa net: simplify flags for tx timestamping
This patch removes the abstraction introduced by the union skb_shared_tx in
the shared skb data.

The access of the different union elements at several places led to some
confusion about accessing the shared tx_flags e.g. in skb_orphan_try().

    http://marc.info/?l=linux-netdev&m=128084897415886&w=2

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-19 00:08:30 -07:00
Jarek Poplawski e5093aec2e net: Fix a memmove bug in dev_gro_receive()
>Xin Xiaohui wrote:
> I looked into the code dev_gro_receive(), found the code here:
> if the frags[0] is pulled to 0, then the page will be released,
> and memmove() frags left.
> Is that right? I'm not sure if memmove do right or not, but
> frags[0].size is never set after memove at least. what I think
> a simple way is not to do anything if we found frags[0].size == 0.
> The patch is as followed.
...

This version of the patch fixes the bug directly in memmove.

Reported-by: "Xin, Xiaohui" <xiaohui.xin@intel.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-17 17:37:28 -07:00
Krishna Kumar bfb564e739 core: Factor out flow calculation from get_rps_cpu
Factor out flow calculation code from get_rps_cpu, since other
functions can use the same code.

Revisions:

v2 (Ben): Separate flow calcuation out and use in select queue.
v3 (Arnd): Don't re-implement MIN.
v4 (Changli): skb->data points to ethernet header in macvtap, and
	make a fast path. Tested macvtap with this patch.
v5 (Changli):
	- Cache skb->rxhash in skb_get_rxhash
	- macvtap may not have pow(2) queues, so change code for
	  queue selection.
    (Arnd):
	- Use first available queue if all fails.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-16 21:06:24 -07:00
Changli Gao cece1945bf net: disable preemption before call smp_processor_id()
Although netif_rx() isn't expected to be called in process context with
preemption enabled, it'd better handle this case. And this is why get_cpu()
is used in the non-RPS #ifdef branch. If tree RCU is selected,
rcu_read_lock() won't disable preemption, so preempt_disable() should be
called explictly.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-07 20:35:43 -07:00
Jarek Poplawski ce9e76c845 net: Fix napi_gro_frags vs netpoll path
The netpoll_rx_on() check in __napi_gro_receive() skips part of the
"common" GRO_NORMAL path, especially "pull:" in dev_gro_receive(),
where at least eth header should be copied for entirely paged skbs.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-05 13:21:25 -07:00
David S. Miller 3578b0c8ab Revert "net: remove zap_completion_queue"
This reverts commit 15e83ed788.

As explained by Johannes Berg, the optimization made here is
invalid.  Or, at best, incomplete.

Not only destructor invocation, but conntract entry releasing
must be executed outside of hw IRQ context.

So just checking "skb->destructor" is insufficient.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-03 00:24:04 -07:00
Changli Gao a427615e04 net: cleanup inclusion
Commit ab95bfe01f replaces bridge and macvlan
hooks in __netif_receive_skb(), so dev.c doesn't need to include their headers.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-02 22:45:49 -07:00
Stephen Hemminger de38483010 net: ingress filter message limit
If user misconfigures ingress and causes a redirection loop, don't
overwhelm the log.  This is also a error case so make it unlikely.
Found by inspection, luckily not in real system.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-01 00:33:23 -07:00
David S. Miller bb7e95c8fd Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bnx2x_main.c

Merge bnx2x bug fixes in by hand... :-/

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-27 21:01:35 -07:00
Ben Greear c736eefadb net: dev_forward_skb should call nf_reset
With conn-track zones and probably with different network
namespaces, the netfilter logic needs to be re-calculated
on packet receive.  If the netfilter logic is not reset,
it will not be recalculated properly.  This patch adds
the nf_reset logic to dev_forward_skb.

Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-25 21:58:46 -07:00
David S. Miller 11fe883936 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/vhost/net.c
	net/bridge/br_device.c

Fix merge conflict in drivers/vhost/net.c with guidance from
Stephen Rothwell.

Revert the effects of net-2.6 commit 573201f36f
since net-next-2.6 has fixes that make bridge netpoll work properly thus
we don't need it disabled.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-20 18:25:24 -07:00
Eric Dumazet bd27290a59 net: 64bit stats for netdev_queue
Since struct netdev_queue tx_bytes/tx_packets/tx_dropped are already
protected by _xmit_lock, its easy to convert these fields to u64 instead
of unsigned long.
This completes 64bit stats for devices using them (vlan, macvlan, ...)

Strictly, we could avoid the locking in dev_txq_stats_fold() on 64bit
arches, but its slow path and we prefer keep it simple.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-19 09:35:40 -07:00
Richard Cochran c1f19b51d1 net: support time stamping in phy devices.
This patch adds a new networking option to allow hardware time stamps
from PHY devices. When enabled, likely candidates among incoming and
outgoing network packets are offered to the PHY driver for possible
time stamping. When accepted by the PHY driver, incoming packets are
deferred for later delivery by the driver.

The patch also adds phylib driver methods for the SIOCSHWTSTAMP ioctl
and callbacks for transmit and receive time stamping. Drivers may
optionally implement these functions.

Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-18 19:15:26 -07:00
Tom Herbert b0f77d0eae net: fix problem in reading sock TX queue
Fix problem in reading the tx_queue recorded in a socket.  In
dev_pick_tx, the TX queue is read by doing a check with
sk_tx_queue_recorded on the socket, followed by a sk_tx_queue_get.
The problem is that there is not mutual exclusion across these
calls in the socket so it it is possible that the queue in the
sock can be invalidated after sk_tx_queue_recorded is called so
that sk_tx_queue get returns -1, which sets 65535 in queue_index
and thus dev_pick_tx returns 65536 which is a bogus queue and
can cause crash in dev_queue_xmit.

We fix this by only calling sk_tx_queue_get which does the proper
checks.  The interface is that sk_tx_queue_get returns the TX queue
if the sock argument is non-NULL and TX queue is recorded, else it
returns -1.  sk_tx_queue_recorded is no longer used so it can be
completely removed.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-14 20:50:29 -07:00
Eric Dumazet 87fd308cfc net: skb_tx_hash() fix relative to skb_orphan_try()
commit fc6055a5ba (net: Introduce skb_orphan_try()) added early
orphaning of skbs.

This unfortunately added a performance regression in skb_tx_hash() in
case of stacked devices (bonding, vlans, ...)

Since skb->sk is now NULL, we cannot access sk->sk_hash anymore to
spread tx packets to multiple NIC queues on multiqueue devices.

skb_tx_hash() in this case only uses skb->protocol, same value for all
flows.

skb_orphan_try() can copy sk->sk_hash into skb->rxhash and skb_tx_hash()
can use this saved sk_hash value to compute its internal hash value.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-14 15:33:27 -07:00
Ben Hutchings d77535162e net: Document that dev_get_stats() returns the given pointer
Document that dev_get_stats() returns the same stats pointer it was
given.  Remove const qualification from the returned pointer since the
caller may do what it likes with that structure.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-09 17:41:57 -07:00
Ben Hutchings 3cfde79c6c net: Get rid of rtnl_link_stats64 / net_device_stats union
In commit be1f3c2c02 "net: Enable 64-bit
net device statistics on 32-bit architectures" I redefined struct
net_device_stats so that it could be used in a union with struct
rtnl_link_stats64, avoiding the need for explicit copying or
conversion between the two.  However, this is unsafe because there is
no locking required and no lock consistently held around calls to
dev_get_stats() and use of the statistics structure it returns.

In commit 28172739f0 "net: fix 64 bit
counters on 32 bit arches" Eric Dumazet dealt with that problem by
requiring callers of dev_get_stats() to provide storage for the
result.  This means that the net_device::stats64 field and the padding
in struct net_device_stats are now redundant, so remove them.

Update the comment on net_device_ops::ndo_get_stats64 to reflect its
new usage.

Change dev_txq_stats_fold() to use struct rtnl_link_stats64, since
that is what all its callers are really using and it is no longer
going to be compatible with struct net_device_stats.

Eric Dumazet suggested the separate function for the structure
conversion.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-09 17:41:56 -07:00
David S. Miller 597e608a84 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-07-07 15:59:38 -07:00
Eric Dumazet 28172739f0 net: fix 64 bit counters on 32 bit arches
There is a small possibility that a reader gets incorrect values on 32
bit arches. SNMP applications could catch incorrect counters when a
32bit high part is changed by another stats consumer/provider.

One way to solve this is to add a rtnl_link_stats64 param to all
ndo_get_stats64() methods, and also add such a parameter to
dev_get_stats().

Rule is that we are not allowed to use dev->stats64 as a temporary
storage for 64bit stats, but a caller provided area (usually on stack)

Old drivers (only providing get_stats() method) need no changes.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-07 14:58:56 -07:00