* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (34 commits)
virtio_net: Add schedule check to napi_enable call
x25: Do not reference freed memory.
pch_can: fix tseg1/tseg2 setting issue
isdn: hysdn: Kill (partially buggy) CVS regision log reporting.
can: softing_cs needs slab.h
pch_gbe: Fix the issue which a driver locks when rx offload is set by ethtool
netfilter: nf_conntrack: set conntrack templates again if we return NF_REPEAT
pch_can: fix module reload issue with MSI
pch_can: fix rmmod issue
pch_can: fix 800k comms issue
net: Fix lockdep regression caused by initializing netdev queues too early.
net/caif: Fix dangling list pointer in freed object on error.
USB CDC NCM errata updates for cdc_ncm host driver
CDC NCM errata updates for cdc.h
ixgbe: update version string
ixgbe: cleanup variable initialization
ixgbe: limit VF access to network traffic
ixgbe: fix for 82599 erratum on Header Splitting
ixgbe: fix variable set but not used warnings by gcc 4.6
e1000: add support for Marvell Alaska M88E1118R PHY
...
In x25_link_free(), we destroy 'nb' before dereferencing
'nb->dev'. Don't do this, because 'nb' might be freed
by then.
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Tested-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The TCP tracking code has a special case that allows to return
NF_REPEAT if we receive a new SYN packet while in TIME_WAIT state.
In this situation, the TCP tracking code destroys the existing
conntrack to start a new clean session.
[DESTROY] tcp 6 src=192.168.0.2 dst=192.168.1.2 sport=38925 dport=8000 src=192.168.1.2 dst=192.168.1.100 sport=8000 dport=38925 [ASSURED]
[NEW] tcp 6 120 SYN_SENT src=192.168.0.2 dst=192.168.1.2 sport=38925 dport=8000 [UNREPLIED] src=192.168.1.2 dst=192.168.1.100 sport=8000 dport=38925
However, this is a problem for the iptables' CT target event filtering
which will not work in this case since the conntrack template will not
be there for the new session. To fix this, we reassign the conntrack
template to the packet if we return NF_REPEAT.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
In commit aa94210411 ("net: init ingress
queue") we moved the allocation and lock initialization of the queues
into alloc_netdev_mq() since register_netdevice() is way too late.
The problem is that dev->type is not setup until the setup()
callback is invoked by alloc_netdev_mq(), and the dev->type is
what determines the lockdep class to use for the locks in the
queues.
Fix this by doing the queue allocation after the setup() callback
runs.
This is safe because the setup() callback is not allowed to make any
state changes that need to be undone on error (memory allocations,
etc.). It may, however, make state changes that are undone by
free_netdev() (such as netif_napi_add(), which is done by the
ipoib driver's setup routine).
The previous code also leaked a reference to the &init_net namespace
object on RX/TX queue allocation failures.
Signed-off-by: David S. Miller <davem@davemloft.net>
rtnl_link_ops->setup(), and the "setup" callback passed to alloc_netdev*(),
cannot make state changes which need to be undone on failure. There is
no cleanup mechanism available at this point.
So we have to add the caif private instance to the global list once we
are sure that register_netdev() has succedded in ->newlink().
Otherwise, if register_netdev() fails, the caller will invoke free_netdev()
and we will have a reference to freed up memory on the chnl_net_list.
Signed-off-by: David S. Miller <davem@davemloft.net>
We access the data inside the skbs of two fragments directly using memmove
during the merge. The data of the skb could span over multiple skb pages. An
direct access without knowledge about the pages would lead to an invalid memory
access.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
[lindner_marek@yahoo.de: Move return from function to the end]
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Originally x25_parse_facilities returned
-1 for an error
0 meaning 0 length facilities
>0 the length of the facilities parsed.
5ef41308f9 ("x25: Prevent crashing when parsing bad X.25 facilities") introduced more
error checking in x25_parse_facilities however used 0 to indicate bad parsing
a6331d6f9a ("memory corruption in X.25 facilities parsing") followed this further for
DTE facilities, again using 0 for bad parsing.
The meaning of 0 got confused in the callers.
If the facilities are messed up we can't determine where the data starts.
So patch makes all parsing errors return -1 and ensures callers close and don't use the skb further.
Reported-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Andrew Hendry <andrew.hendry@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using skb_header_cloned to check if it's safe to write to the skb is not
enough - mac80211 also touches the tailroom of the skb.
Initially this check was only used to increase a counter, however this
commit changed the code to also skip skb data reallocation if no extra
head/tailroom was needed:
commit 4cd06a344d
mac80211: skip unnecessary pskb_expand_head calls
It added a regression at least with iwl3945, which is fixed by this patch.
Reported-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Tested-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (68 commits)
net: can: janz-ican3: world-writable sysfs termination file
net: can: at91_can: world-writable sysfs files
MAINTAINERS: update email ids of the be2net driver maintainers.
bridge: Don't put partly initialized fdb into hash
r8169: prevent RxFIFO induced loops in the irq handler.
r8169: RxFIFO overflow oddities with 8168 chipsets.
r8169: use RxFIFO overflow workaround for 8168c chipset.
include/net/genetlink.h: Allow genlmsg_cancel to accept a NULL argument
net: Provide compat support for SIOCGETMIFCNT_IN6 and SIOCGETSGCNT_IN6.
net: Support compat SIOCGETVIFCNT ioctl in ipv4.
net: Fix bug in compat SIOCGETSGCNT handling.
niu: Fix races between up/down and get_stats.
tcp_ecn is an integer not a boolean
atl1c: Add missing PCI device ID
s390: Fix possibly wrong size in strncmp (smsgiucv)
s390: Fix wrong size in memcmp (netiucv)
qeth: allow OSA CHPARM change in suspend state
qeth: allow HiperSockets framesize change in suspend
qeth: add more strict MTU checking
qeth: show new mac-address if its setting fails
...
The fdb_create() puts a new fdb into hash with only addr set. This is
not good, since there are callers, that search the hash w/o the lock
and access all the other its fields.
Applies to current netdev tree.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 709b46e8d9 ("net: Add compat
ioctl support for the ipv4 multicast ioctl SIOCGETSGCNT") added the
correct plumbing to handle SIOCGETSGCNT properly.
However, whilst definiting a proper "struct compat_sioc_sg_req" it
isn't actually used in ipmr_compat_ioctl().
Correct this oversight.
Signed-off-by: David S. Miller <davem@davemloft.net>
Like Herbert's change from a few days ago:
66c46d741e gro: Reset dev pointer on reuse
this may not be necessary at this point, but we should still clean up
the skb->skb_iif. If not we may end up with an invalid valid for
skb->skb_iif when the skb is reused and the check is done in
__netif_receive_skb.
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the off-channel TX is done with remain-on-channel
offloaded to hardware, the reported cookie is wrong as
in that case we shouldn't use the SKB as the cookie but
need to instead use the corresponding r-o-c cookie
(XOR'ed with 2 to prevent API mismatches).
Fix this by keeping track of the hw_roc_skb pointer
just for the status processing and use the correct
cookie to report in this case. We can't use the
hw_roc_skb pointer itself because it is NULL'ed when
the frame is transmitted to prevent it being used
twice.
This fixes a bug where the P2P state machine in the
supplicant gets stuck because it never gets a correct
result for its transmitted frame.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
For the following rule:
iptables -I PREROUTING -t raw -j CT --ctevents assured
The event delivered looks like the following:
[UPDATE] tcp 6 src=192.168.0.2 dst=192.168.1.2 sport=37041 dport=80 src=192.168.1.2 dst=192.168.1.100 sport=80 dport=37041 [ASSURED]
Note that the TCP protocol state is not included. For that reason
the CT event filtering is not very useful for conntrackd.
To resolve this issue, instead of conditionally setting the CT events
bits based on the ctmask, we always set them and perform the filtering
in the late stage, just before the delivery.
Thus, the event delivered looks like the following:
[UPDATE] tcp 6 432000 ESTABLISHED src=192.168.0.2 dst=192.168.1.2 sport=37041 dport=80 src=192.168.1.2 dst=192.168.1.100 sport=80 dport=37041 [ASSURED]
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
In 135367b "netfilter: xtables: change xt_target.checkentry return type",
the type returned by checkentry was changed from boolean to int, but the
return values where not adjusted.
arptables: Input/output error
This broke arptables with the mangle target since it returns true
under success, which is interpreted by xtables as >0, thus
returning EIO.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
In my testing of 2.6.37 I was occassionally getting a warning about
sysctl table entries being unregistered in the wrong order. Digging
in it turns out this dates back to the last great sysctl reorg done
where Al Viro introduced the requirement that sysctl directories
needed to be created before and destroyed after the files in them.
It turns out that in that great reorg /proc/sys/net/ipv6/neigh was
overlooked. So this patch fixes that oversight and makes an annoying
warning message go away.
>------------[ cut here ]------------
>WARNING: at kernel/sysctl.c:1992 unregister_sysctl_table+0x134/0x164()
>Pid: 23951, comm: kworker/u:3 Not tainted 2.6.37-350888.2010AroraKernelBeta.fc14.x86_64 #1
>Call Trace:
> [<ffffffff8103e034>] warn_slowpath_common+0x80/0x98
> [<ffffffff8103e061>] warn_slowpath_null+0x15/0x17
> [<ffffffff810452f8>] unregister_sysctl_table+0x134/0x164
> [<ffffffff810e7834>] ? kfree+0xc4/0xd1
> [<ffffffff813439b2>] neigh_sysctl_unregister+0x22/0x3a
> [<ffffffffa02cd14e>] addrconf_ifdown+0x33f/0x37b [ipv6]
> [<ffffffff81331ec2>] ? skb_dequeue+0x5f/0x6b
> [<ffffffffa02ce4a5>] addrconf_notify+0x69b/0x75c [ipv6]
> [<ffffffffa02eb953>] ? ip6mr_device_event+0x98/0xa9 [ipv6]
> [<ffffffff813d2413>] notifier_call_chain+0x32/0x5e
> [<ffffffff8105bdea>] raw_notifier_call_chain+0xf/0x11
> [<ffffffff8133cdac>] call_netdevice_notifiers+0x45/0x4a
> [<ffffffff8133d2b0>] rollback_registered_many+0x118/0x201
> [<ffffffff8133d3af>] unregister_netdevice_many+0x16/0x6d
> [<ffffffff8133d571>] default_device_exit_batch+0xa4/0xb8
> [<ffffffff81337c42>] ? cleanup_net+0x0/0x194
> [<ffffffff81337a2a>] ops_exit_list+0x4e/0x56
> [<ffffffff81337d36>] cleanup_net+0xf4/0x194
> [<ffffffff81053318>] process_one_work+0x187/0x280
> [<ffffffff8105441b>] worker_thread+0xff/0x19f
> [<ffffffff8105431c>] ? worker_thread+0x0/0x19f
> [<ffffffff8105776d>] kthread+0x7d/0x85
> [<ffffffff81003824>] kernel_thread_helper+0x4/0x10
> [<ffffffff810576f0>] ? kthread+0x0/0x85
> [<ffffffff81003820>] ? kernel_thread_helper+0x0/0x10
>---[ end trace 8a7e9310b35e9486 ]---
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In get_rps_cpu, add check that the rps_flow_table for the device is
NULL when trying to take fast path when RPS map length is one.
Without this, RFS is effectively disabled if map length is one which
is not correct.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an IPSEC SA is still being set up, __xfrm_lookup() will return
-EREMOTE and so ip_route_output_flow() will return a blackhole route.
This can happen in a sndmsg call, and after d33e455337 ("net: Abstract
default MTU metric calculation behind an accessor.") this leads to a
crash in ip_append_data() because the blackhole dst_ops have no
default_mtu() method and so dst_mtu() calls a NULL pointer.
Fix this by adding default_mtu() methods (that simply return 0, matching
the old behavior) to the blackhole dst_ops.
The IPv4 part of this patch fixes a crash that I saw when using an IPSEC
VPN; the IPv6 part is untested because I don't have an IPv6 VPN, but it
looks to be needed as well.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The batman-adv vis server has to a stack which stores all information
about packets which should be send later. This stack is protected
with a spinlock that is used to prevent concurrent write access to it.
The send_vis_packets function has to take all elements from the stack
and send them to other hosts over the primary interface. The send will
be initiated without the lock which protects the stack.
The implementation using list_for_each_entry_safe has the problem that
it stores the next element as "safe ptr" to allow the deletion of the
current element in the list. The list may be modified during the
unlock/lock pair in the loop body which may make the safe pointer
not pointing to correct next element.
It is safer to remove and use the first element from the stack until no
elements are available. This does not need reduntant information which
would have to be validated each time the lock was removed.
Reported-by: Russell Senior <russell@personaltelco.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
The free_info function will be called when no reference to the info
object exists anymore. It must be ensured that the allocated memory
gets freed and not only the elements which are managed by the info
object.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
A newly created vis info object must be removed when it couldn't be
added to the hash. The old_info which has to be replaced was already
removed and isn't related to the hash anymore.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
SIOCGETSGCNT is not a unique ioctl value as it it maps tio SIOCPROTOPRIVATE +1,
which unfortunately means the existing infrastructure for compat networking
ioctls is insufficient. A trivial compact ioctl implementation would conflict
with:
SIOCAX25ADDUID
SIOCAIPXPRISLT
SIOCGETSGCNT_IN6
SIOCGETSGCNT
SIOCRSSCAUSE
SIOCX25SSUBSCRIP
SIOCX25SDTEFACILITIES
To make this work I have updated the compat_ioctl decode path to mirror the
the normal ioctl decode path. I have added an ipv4 inet_compat_ioctl function
so that I can have ipv4 specific compat ioctls. I have added a compat_ioctl
function into struct proto so I can break out ioctls by which kind of ip socket
I am using. I have added a compat_raw_ioctl function because SIOCGETSGCNT only
works on raw sockets. I have added a ipmr_compat_ioctl that mirrors the normal
ipmr_ioctl.
This was necessary because unfortunately the struct layout for the SIOCGETSGCNT
has unsigned longs in it so changes between 32bit and 64bit kernels.
This change was sufficient to run a 32bit ip multicast routing daemon on a
64bit kernel.
Reported-by: Bill Fenner <fenner@aristanetworks.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On older kernels the VLAN code may zero skb->dev before dropping
it and causing it to be reused by GRO.
Unfortunately we didn't reset skb->dev in that case which causes
the next GRO user to get a bogus skb->dev pointer.
This particular problem no longer happens with the current upstream
kernel due to changes in VLAN processing.
However, for correctness we should still reset the skb->dev pointer
in the GRO reuse function in case a future user does the same thing.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
They are bogus. The basic idea is that I wanted to make sure
that prefixed routes never bind to peers.
The test I used was whether RTF_CACHE was set.
But first of all, the RTF_CACHE flag is set at different spots
depending upon which ip6_rt_copy() caller you're talking about.
I've validated all of the code paths, and even in the future
where we bind peers more aggressively (for route metric COW'ing)
we never bind to prefix'd routes, only fully specified ones.
This even applies when addrconf or icmp6 routes are allocated.
Signed-off-by: David S. Miller <davem@davemloft.net>
pskb_expand_head() triggers a kmemcheck warning when copy of
skb_shared_info is done in pskb_expand_head()
This is because destructor_arg field is not necessarily initialized at
this point. Add kmemcheck_annotate_variable() call in __alloc_skb() to
instruct kmemcheck this is a normal situation.
Resolves bugzilla.kernel.org 27212
Reference: https://bugzilla.kernel.org/show_bug.cgi?id=27212
Reported-by: Christian Casteyde <casteyde.christian@free.fr>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I'm testing an API that uses IFLA_AF_SPEC attribute.
In the rtnetlink core , the set_link_af() member
of the rtnl_af_ops struct receives the nested attribute
(as I expected), but the validate_link_af() member
receives the parent attribute.
IMO, this patch fixes this.
Signed-off-by: Kurt Van Dijck <kurt.van.dijck@eia.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/econet/af_econet.c: In function ‘econet_sendmsg’:
net/econet/af_econet.c:494: warning: label ‘error’ defined but not used
net/econet/af_econet.c:268: warning: unused variable ‘sk’
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Phil Blundell <philb@gnu.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like ipv4, we have to propagate the ipv6 route peer into
the ipsec top-level route during instantiation.
Signed-off-by: David S. Miller <davem@davemloft.net>
The hash_iterate removal introduced a bug leading to a kernel panic when
fetching the vis data on a vis server. That commit forgot to rename one
variable name, which this commit fixes now.
Reported-by: Russell Senior <russell@personaltelco.net>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
This patch fixes a bug that causes TCP RST packets to be generated
on otherwise correctly behaved applications, e.g., no unread data
on close,..., etc. To trigger the bug, at least two conditions must
be met:
1. The FIN flag is set on the last data packet, i.e., it's not on a
separate, FIN only packet.
2. The size of the last data chunk on the receive side matches
exactly with the size of buffer posted by the receiver, and the
receiver closes the socket without any further read attempt.
This bug was first noticed on our netperf based testbed for our IW10
proposal to IETF where a large number of RST packets were observed.
netperf's read side code meets the condition 2 above 100%.
Before the fix, tcp_data_queue() will queue the last skb that meets
condition 1 to sk_receive_queue even though it has fully copied out
(skb_copy_datagram_iovec()) the data. Then if condition 2 is also met,
tcp_recvmsg() often returns all the copied out data successfully
without actually consuming the skb, due to a check
"if ((chunk = len - tp->ucopy.len) != 0) {"
and
"len -= chunk;"
after tcp_prequeue_process() that causes "len" to become 0 and an
early exit from the big while loop.
I don't see any reason not to free the skb whose data have been fully
consumed in tcp_data_queue(), regardless of the FIN flag. We won't
get there if MSG_PEEK is on. Am I missing some arcane cases related
to urgent data?
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some drivers (e.g. ath9k) do not always disable beacons when they're
supposed to. When an interface is changed using the change_interface op,
the mode specific sdata part is in an undefined state and trying to
get a beacon at this point can produce weird crashes.
To fix this, add a check for ieee80211_sdata_running before using
anything from the sdata.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Cc: stable@kernel.org
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This reverts the following set of commits:
d1ed113f16 ("ipv6: remove duplicate neigh_ifdown")
29ba5fed1b ("ipv6: don't flush routes when setting loopback down")
9d82ca98f7 ("ipv6: fix missing in6_ifa_put in addrconf")
2de7957072 ("ipv6: addrconf: don't remove address state on ifdown if the address is being kept")
8595805aaf ("IPv6: only notify protocols if address is compeletely gone")
27bdb2abcc ("IPv6: keep tentative addresses in hash table")
93fa159abe ("IPv6: keep route for tentative address")
8f37ada5b5 ("IPv6: fix race between cleanup and add/delete address")
84e8b803f1 ("IPv6: addrconf notify when address is unavailable")
dc2b99f71e ("IPv6: keep permanent addresses on admin down")
because the core semantic change to ipv6 address handling on ifdown
has broken some things, in particular "disable_ipv6" sysctl handling.
Stephen has made several attempts to get things back in working order,
but nothing has restored disable_ipv6 fully yet.
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Tested-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The information required to find the nfs_client cooresponding to the incoming
back channel request is contained in the NFS layer. Perform minimal checking
in the RPC layer pg_authenticate method, and push more detailed checking into
the NFS layer where the nfs_client can be found.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There is a conflict between commit b00916b1 and a77f5db3. This patch resolves
the conflict by clearing the heap allocation in ethtool_get_regs().
Cc: stable@kernel.org
Signed-off-by: Eugene Teo <eugeneteo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not handle PMTU vs. route lookup creation any differently
wrt. offlink routes, always clone them.
Reported-by: PK <runningdoglackey@yahoo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IEEE get/set app handlers use generic routines and do not
require the net_device to implement the dcbnl_ops routines. This
patch makes it symmetric so user space and drivers do not have
to handle the CEE version and IEEE DCBx versions differently.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit a8b690f98b (tcp: Fix slowness in read /proc/net/tcp)
introduced a bug in handling of SYN_RECV sockets.
st->offset represents number of sockets found since beginning of
listening_hash[st->bucket].
We should not reset st->offset when iterating through
syn_table[st->sbucket], or else if more than ~25 sockets (if
PAGE_SIZE=4096) are in SYN_RECV state, we exit from listening_get_next()
with a too small st->offset
Next time we enter tcp_seek_last_pos(), we are not able to seek past
already found sockets.
Reported-by: PK <runningdoglackey@yahoo.com>
CC: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Family was hard-coded to AF_INET but should be daddr->family.
This fixes crashes when unlinking ipv6 peer entries, since the
unlink code was looking up the base pointer properly.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suppose that several linear skbs of the same flow were received by GRO. They
were thus merged into one skb with a frag_list. Then a new skb of the same flow
arrives, but it is a paged skb with data starting in its frags[].
Before adding the skb to the frag_list skb_gro_receive() will of course adjust
the skb to throw away the headers. It correctly modifies the page_offset and
size of the frag, but it leaves incorrect information in the skb:
->data_len is not decreased at all.
->len is decreased only by headlen, as if no change were done to the frag.
Later in a receiving process this causes skb_copy_datagram_iovec() to return
-EFAULT and this is seen in userspace as the result of the recv() syscall.
In practice the bug can be reproduced with the sfc driver. By default the
driver uses an adaptive scheme when it switches between using
napi_gro_receive() (with skbs) and napi_gro_frags() (with pages). The bug is
reproduced when under rx load with enough successful GRO merging the driver
decides to switch from the former to the latter.
Manual control is also possible, so reproducing this is easy with netcat:
- on machine1 (with sfc): nc -l 12345 > /dev/null
- on machine2: nc machine1 12345 < /dev/zero
- on machine1:
echo 1 > /sys/module/sfc/parameters/rx_alloc_method # use skbs
echo 2 > /sys/module/sfc/parameters/rx_alloc_method # use pages
- See that nc has quit suddenly.
[v2: Modified by Eric Dumazet to avoid advancing skb->data past the end
and to use a temporary variable.]
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>