The fragments sysctls also contains some, that are to be
visible, but read-only in net namespaces.
The naming in net/core/sysctl_net_core.c is - tables, that are
to be registered in namespaces have a "ns" word in their names.
So rename ones in ipv4/ip_fragment.c and ipv6/reassembly.c to
fit this.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (73 commits)
net: Fix typo in net/core/sock.c.
ppp: Do not free not yet unregistered net device.
netfilter: xt_iprange: module aliases for xt_iprange
netfilter: ctnetlink: dump conntrack ID in event messages
irda: Fix a misalign access issue. (v2)
sctp: Fix use of uninitialized pointer
cipso: Relax too much careful cipso hash function.
tcp FRTO: work-around inorder receivers
tcp FRTO: Fix fallback to conventional recovery
New maintainer for Intel ethernet adapters
DM9000: Use delayed work to update MII PHY state
DM9000: Update and fix driver debugging messages
DM9000: Add __devinit and __devexit attributes to probe and remove
sky2: fix simple define thinko
[netdrvr] sfc: sfc: Add self-test support
[netdrvr] sfc: Increment rx_reset when reported as driver event
[netdrvr] sfc: Remove unused macro EFX_XAUI_RETRAIN_MAX
[netdrvr] sfc: Fix code formatting
[netdrvr] sfc: Remove kernel-doc comments for removed members of struct efx_nic
[netdrvr] sfc: Remove garbage from comment
...
The cipso_v4_cache is allocated to contain CIPSO_V4_CACHE_BUCKETS
buckets. The CIPSO_V4_CACHE_BUCKETS = 1 << CIPSO_V4_CACHE_BUCKETBITS,
where CIPSO_V4_CACHE_BUCKETBITS = 7.
The bucket-selection function for this hash is calculated like this:
bkt = hash & (CIPSO_V4_CACHE_BUCKETBITS - 1);
^^^
i.e. picking only 4 buckets of possible 128 :)
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If receiver consumes segments successfully only in-order, FRTO
fallback to conventional recovery produces RTO loop because
FRTO's forward transmissions will always get dropped and need to
be resent, yet by default they're not marked as lost (which are
the only segments we will retransmit in CA_Loss).
Price to pay about this is occassionally unnecessarily
retransmitting the forward transmission(s). SACK blocks help
a bit to avoid this, so it's mainly a concern for NewReno case
though SACK is not fully immune either.
This change has a side-effect of fixing SACKFRTO problem where
it didn't have snd_nxt of the RTO time available anymore when
fallback become necessary (this problem would have only occured
when RTO would occur for two or more segments and ECE arrives
in step 3; no need to figure out how to fix that unless the
TODO item of selective behavior is considered in future).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Reported-by: Damon L. Chesser <damon@damtek.com>
Tested-by: Damon L. Chesser <damon@damtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems that commit 009a2e3e4e ("[TCP] FRTO: Improve
interoperability with other undo_marker users") run into
another land-mine which caused fallback to conventional
recovery to break:
1. Cumulative ACK arrives after FRTO retransmission
2. tcp_try_to_open sees zero retrans_out, clears retrans_stamp
which should be kept like in CA_Loss state it would be
3. undo_marker change allowed tcp_packet_delayed to return
true because of the cleared retrans_stamp once FRTO is
terminated causing LossUndo to occur, which means all loss
markings FRTO made are reverted.
This means that the conventional recovery basically recovered
one loss per RTT, which is not that efficient. It was quite
unobvious that the undo_marker change broken something like
this, I had a quite long session to track it down because of
the non-intuitiviness of the bug (luckily I had a trivial
reproducer at hand and I was also able to learn to use kprobes
in the process as well :-)).
This together with the NewReno+FRTO fix and FRTO in-order
workaround this fixes Damon's problems, this and the first
mentioned are enough to fix Bugzilla #10063.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Reported-by: Damon L. Chesser <damon@damtek.com>
Tested-by: Damon L. Chesser <damon@damtek.com>
Tested-by: Sebastian Hyrwall <zibbe@cisko.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds needed_headroom/needed_tailroom members to struct
net_device and updates many places that allocate sbks to use them. Not
all of them can be converted though, and I'm sure I missed some (I
mostly grepped for LL_RESERVED_SPACE)
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (32 commits)
net: Added ASSERT_RTNL() to dev_open() and dev_close().
can: Fix can_send() handling on dev_queue_xmit() failures
netns: Fix arbitrary net_device-s corruptions on net_ns stop.
netfilter: Kconfig: default DCCP/SCTP conntrack support to the protocol config values
netfilter: nf_conntrack_sip: restrict RTP expect flushing on error to last request
macvlan: Fix memleak on device removal/crash on module removal
net/ipv4: correct RFC 1122 section reference in comment
tcp FRTO: SACK variant is errorneously used with NewReno
e1000e: don't return half-read eeprom on error
ucc_geth: Don't use RX clock as TX clock.
cxgb3: Use CAP_SYS_RAWIO for firmware
pcnet32: delete non NAPI code from driver.
fs_enet: Fix a memory leak in fs_enet_mdio_probe
[netdrvr] eexpress: IPv6 fails - multicast problems
3c59x: use netstats in net_device structure
3c980-TX needs EXTRA_PREAMBLE
fix warning in drivers/net/appletalk/cops.c
e1000e: Add support for BM PHYs on ICH9
uli526x: fix endianness issues in the setup frame
uli526x: initialize the hardware prior to requesting interrupts
...
RFC 1122 does not have a section 3.1.2.2. The requirement to silently
discard datagrams with a bad checksum is in section 3.2.1.2 instead.
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=10611
Signed-off-by: J.H.M. Dassen (Ray) <jdassen@debian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Note: there's actually another bug in FRTO's SACK variant, which
is the causing failure in NewReno case because of the error
that's fixed here. I'll fix the SACK case separately (it's
a separate bug really, though related, but in order to fix that
I need to audit tp->snd_nxt usage a bit).
There were two places where SACK variant of FRTO is getting
incorrectly used even if SACK wasn't negotiated by the TCP flow.
This leads to incorrect setting of frto_highmark with NewReno
if a previous recovery was interrupted by another RTO.
An eventual fallback to conventional recovery then incorrectly
considers one or couple of segments as forward transmissions
though they weren't, which then are not LOST marked during
fallback making them "non-retransmittable" until the next RTO.
In a bad case, those segments are really lost and are the only
one left in the window. Thus TCP needs another RTO to continue.
The next FRTO, however, could again repeat the same events
making the progress of the TCP flow extremely slow.
In order for these events to occur at all, FRTO must occur
again in FRTOs step 3 while the key segments must be lost as
well, which is not too likely in practice. It seems to most
frequently with some small devices such as network printers
that *seem* to accept TCP segments only in-order. In cases
were key segments weren't lost, things get automatically
resolved because those wrongly marked segments don't need to be
retransmitted in order to continue.
I found a reproducer after digging up relevant reports (few
reports in total, none at netdev or lkml I know of), some
cases seemed to indicate middlebox issues which seems now
to be a false assumption some people had made. Bugzilla
#10063 _might_ be related. Damon L. Chesser <damon@damtek.com>
had a reproducable case and was kind enough to tcpdump it
for me. With the tcpdump log it was quite trivial to figure
out.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
net_cls_act: act_simple dont ignore realloc code
iwlwifi: make IWLWIFI a tristate
Revert "atm: Do not free already unregistered net device."
dccp: return -EINVAL on invalid feature length
irda: fix !PNP support for drivers/net/irda/smsc-ircc2.c
irda: fix !PNP support in drivers/net/irda/nsc-ircc.c
net_cls_act: Make act_simple use of netlink policy.
ip: Use inline function dst_metric() instead of direct access to dst->metric[]
ip: Make use of the inline function dst_metric_locked()
atm: Bad locking on br2684_devs modifications.
atm: Do not free already unregistered net device.
mac80211: Do not free net device after it is unregistered.
bridge: Consolidate error paths in br_add_bridge().
bridge: Net device leak in br_add_bridge().
niu: Fix probing regression for maramba on-board chips.
lapbeth: Release ->ethdev when unregistering device.
xfrm: convert empty xfrm_audit_* macros to functions
net: Fix useless comment reference loop.
sch_htb: remove from event queue in htb_parent_to_leaf()
There are functions to refer to the value of dst->metric[THE_METRIC-1]
directly without use of a inline function "dst_metric" defined in
net/dst.h.
The following patch changes them to use the inline function
consistently.
Signed-off-by: Satoru SATOH <satoru.satoh@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits)
rose: Wrong list_lock argument in rose_node seqops
netns: Fix reassembly timer to use the right namespace
netns: Fix device renaming for sysfs
bnx2: Update version to 1.7.5.
bnx2: Update RV2P firmware for 5709.
bnx2: Zero out context memory for 5709.
bnx2: Fix register test on 5709.
bnx2: Fix remote PHY initial link state.
bnx2: Refine remote PHY locking.
bridge: forwarding table information for >256 devices
tg3: Update version to 3.92
tg3: Add link state reporting to UMP firmware
tg3: Fix ethtool loopback test for 5761 BX devices
tg3: Fix 5761 NVRAM sizes
tg3: Use constant 500KHz MI clock on adapters with a CPMU
hci_usb.h: fix hard-to-trigger race
dccp: ccid2.c, ccid3.c use clamp(), clamp_t()
net: remove NR_CPUS arrays in net/core/dev.c
net: use get/put_unaligned_* helpers
bluetooth: use get/put_unaligned_* helpers
...
The check for PDE->data != NULL becomes useless after the replacement
of proc_net_fops_create with proc_create_data.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simply replace proc_create and further data assigned with proc_create_data.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename div64_64 to div64_u64 to make it consistent with the other divide
functions, so it clearly includes the type of the divide. Move its definition
to math64.h as currently no architecture overrides the generic implementation.
They can still override it of course, but the duplicated declarations are
avoided.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (53 commits)
tcp: Overflow bug in Vegas
[IPv4] UFO: prevent generation of chained skb destined to UFO device
iwlwifi: move the selects to the tristate drivers
ipv4: annotate a few functions __init in ipconfig.c
atm: ambassador: vcc_sf semaphore to mutex
MAINTAINERS: The socketcan-core list is subscribers-only.
netfilter: nf_conntrack: padding breaks conntrack hash on ARM
ipv4: Update MTU to all related cache entries in ip_rt_frag_needed()
sch_sfq: use del_timer_sync() in sfq_destroy()
net: Add compat support for getsockopt (MCAST_MSFILTER)
net: Several cleanups for the setsockopt compat support.
ipvs: fix oops in backup for fwmark conn templates
bridge: kernel panic when unloading bridge module
bridge: fix error handling in br_add_if()
netfilter: {nfnetlink,ip,ip6}_queue: fix skb_over_panic when enlarging packets
netfilter: x_tables: fix net namespace leak when reading /proc/net/xxx_tables_names
netfilter: xt_TCPOPTSTRIP: signed tcphoff for ipv6_skip_exthdr() retval
tcp: Limit cwnd growth when deferring for GSO
tcp: Allow send-limited cwnd to grow up to max_burst when gso disabled
[netdrvr] gianfar: Determine TBIPA value dynamically
...
From: Lachlan Andrew <lachlan.andrew@gmail.com>
There is an overflow bug in net/ipv4/tcp_vegas.c for large BDPs
(e.g. 400Mbit/s, 400ms). The multiplication (old_wnd *
vegas->baseRTT) << V_PARAM_SHIFT overflows a u32.
[ Fix tcp_veno.c too, it has similar calculations. -DaveM ]
Signed-off-by: David S. Miller <davem@davemloft.net>
Problem: ip_append_data() could wrongly generate a chained skb for
devices which support UFO. When sk_write_queue is not empty
(e.g. MSG_MORE), __instead__ of appending data into the next nr_frag
of the queued skb, a new chained skb is created.
I would normally assume UFO device should get data in nr_frags and not
in frag_list. Later the udp4_hwcsum_outgoing() resets csum to NONE
and skb_gso_segment() has oops.
Proposal:
1. Even length is less than mtu, employ ip_ufo_append_data()
and append data to the __existed__ skb in the sk_write_queue.
2. ip_ufo_append_data() is fixed due to a wrong manipulation of
peek-ing and later enqueue-ing of the same skb. Now, enqueuing is
always performed, because on error the further
ip_flush_pending_frames() would release the queued skb.
Signed-off-by: Kostya B <bkostya@hotmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
A few functions are only used from __init context.
So annotate these with __init for consistency and silence
the following warnings:
WARNING: net/ipv4/built-in.o(.text+0x2a876): Section mismatch
in reference from the function ic_bootp_init() to
the variable .init.data:bootp_packet_type
WARNING: net/ipv4/built-in.o(.text+0x2a907): Section mismatch
in reference from the function ic_bootp_cleanup() to
the variable .init.data:bootp_packet_type
Note: The warnings only appear with CONFIG_DEBUG_SECTION_MISMATCH=y
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some drivers have duplicated unlikely() macros. IS_ERR() already has
unlikely() in itself.
This patch cleans up such pointless code.
Signed-off-by: Hirofumi Nakagawa <hnakagawa@miraclelinux.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jeff Garzik <jeff@garzik.org>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: David Brownell <david-b@pacbell.net>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit 0794935e "[NETFILTER]: nf_conntrack: optimize hash_conntrack()"
results in ARM platforms hashing uninitialised padding. This padding
doesn't exist on other architectures.
Fix this by replacing NF_CT_TUPLE_U_BLANK() with memset() to ensure
everything is initialised. There were only 4 bytes that
NF_CT_TUPLE_U_BLANK() wasn't clearing anyway (or 12 bytes on ARM).
Signed-off-by: Philip Craig <philipc@snapgear.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add struct net_device parameter to ip_rt_frag_needed() and update MTU to
cache entries where ifindex is specified. This is similar to what is
already done in ip_rt_redirect().
Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for getsockopt for MCAST_MSFILTER for
both IPv4 and IPv6. It depends on the previous setsockopt patch,
and uses the same method.
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes bug http://bugzilla.kernel.org/show_bug.cgi?id=10556
where conn templates with protocol=IPPROTO_IP can oops backup box.
Result from ip_vs_proto_get() should be checked because
protocol value can be invalid or unsupported in backup. But
for valid message we should not fail for templates which use
IPPROTO_IP. Also, add checks to validate message limits and
connection state. Show state NONE for templates using IPPROTO_IP.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reinjecting *bigger* modified versions of IPv6 packets using
libnetfilter_queue, things work fine on a 2.6.24 kernel (2.6.22 too)
but I get the following on recents kernels (2.6.25, trace below is
against today's net-2.6 git tree):
skb_over_panic: text:c04fddb0 len:696 put:632 head:f7592c00 data:f7592c00 tail:0xf7592eb8 end:0xf7592e80 dev:eth0
------------[ cut here ]------------
invalid opcode: 0000 [#1] PREEMPT
Process sendd (pid: 3657, ti=f6014000 task=f77c31d0 task.ti=f6014000)
Stack: c071e638 c04fddb0 000002b8 00000278 f7592c00 f7592c00 f7592eb8 f7592e80
f763c000 f6bc5200 f7592c40 f6015c34 c04cdbfc f6bc5200 00000278 f6015c60
c04fddb0 00000020 f72a10c0 f751b420 00000001 0000000a 000002b8 c065582c
Call Trace:
[<c04fddb0>] ? nfqnl_recv_verdict+0x1c0/0x2e0
[<c04cdbfc>] ? skb_put+0x3c/0x40
[<c04fddb0>] ? nfqnl_recv_verdict+0x1c0/0x2e0
[<c04fd115>] ? nfnetlink_rcv_msg+0xf5/0x160
[<c04fd03e>] ? nfnetlink_rcv_msg+0x1e/0x160
[<c04fd020>] ? nfnetlink_rcv_msg+0x0/0x160
[<c04f8ed7>] ? netlink_rcv_skb+0x77/0xa0
[<c04fcefc>] ? nfnetlink_rcv+0x1c/0x30
[<c04f8c73>] ? netlink_unicast+0x243/0x2b0
[<c04cfaba>] ? memcpy_fromiovec+0x4a/0x70
[<c04f9406>] ? netlink_sendmsg+0x1c6/0x270
[<c04c8244>] ? sock_sendmsg+0xc4/0xf0
[<c011970d>] ? set_next_entity+0x1d/0x50
[<c0133a80>] ? autoremove_wake_function+0x0/0x40
[<c0118f9e>] ? __wake_up_common+0x3e/0x70
[<c0342fbf>] ? n_tty_receive_buf+0x34f/0x1280
[<c011d308>] ? __wake_up+0x68/0x70
[<c02cea47>] ? copy_from_user+0x37/0x70
[<c04cfd7c>] ? verify_iovec+0x2c/0x90
[<c04c837a>] ? sys_sendmsg+0x10a/0x230
[<c011967a>] ? __dequeue_entity+0x2a/0xa0
[<c011970d>] ? set_next_entity+0x1d/0x50
[<c0345397>] ? pty_write+0x47/0x60
[<c033d59b>] ? tty_default_put_char+0x1b/0x20
[<c011d2e9>] ? __wake_up+0x49/0x70
[<c033df99>] ? tty_ldisc_deref+0x39/0x90
[<c033ff20>] ? tty_write+0x1a0/0x1b0
[<c04c93af>] ? sys_socketcall+0x7f/0x260
[<c0102ff9>] ? sysenter_past_esp+0x6a/0x91
[<c05f0000>] ? snd_intel8x0m_probe+0x270/0x6e0
=======================
Code: 00 00 89 5c 24 14 8b 98 9c 00 00 00 89 54 24 0c 89 5c 24 10 8b 40 50 89 4c 24 04 c7 04 24 38 e6 71 c0 89 44 24 08 e8 c4 46 c5 ff <0f> 0b eb fe 55 89 e5 56 89 d6 53 89 c3 83 ec 0c 8b 40 50 39 d0
EIP: [<c04ccdfc>] skb_over_panic+0x5c/0x60 SS:ESP 0068:f6015bf8
Looking at the code, I ended up in nfq_mangle() function (called by
nfqnl_recv_verdict()) which performs a call to skb_copy_expand() due to
the increased size of data passed to the function. AFAICT, it should ask
for 'diff' instead of 'diff - skb_tailroom(e->skb)'. Because the
resulting sk_buff has not enough space to support the skb_put(skb, diff)
call a few lines later, this results in the call to skb_over_panic().
The patch below asks for allocation of a copy with enough space for
mangled packet and the same amount of headroom as old sk_buff. While
looking at how the regression appeared (e2b58a67), I noticed the same
pattern in ipq_mangle_ipv6() and ipq_mangle_ipv4(). The patch corrects
those locations too.
Tested with bigger reinjected IPv6 packets (nfqnl_mangle() path), things
are ok (2.6.25 and today's net-2.6 git tree).
Signed-off-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes inappropriately large cwnd growth on sender-limited flows
when GSO is enabled, limiting cwnd growth to 64k.
Signed-off-by: John Heffner <johnwheffner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This changes the logic in tcp_is_cwnd_limited() so that cwnd may grow
up to tcp_max_burst() even when sk_can_gso() is false, or when
sysctl_tcp_tso_win_divisor != 0.
Signed-off-by: John Heffner <johnwheffner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
iwlwifi: Allow building iwl3945 without iwl4965.
wireless: Fix compile error with wifi & leds
tcp: Fix slab corruption with ipv6 and tcp6fuzz
ipv4/ipv6 compat: Fix SSM applications on 64bit kernels.
[IPSEC]: Use digest_null directly for auth
sunrpc: fix missing kernel-doc
can: Fix copy_from_user() results interpretation
Revert "ipv6: Fix typo in net/ipv6/Kconfig"
tipc: endianness annotations
ipv6: result of csum_fold() is already 16bit, no need to cast
[XFRM] AUDIT: Fix flowlabel text format ambibuity.
From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
This fixes a regression added by ec3c0982a2
("[TCP]: TCP_DEFER_ACCEPT updates - process as established")
tcp_v6_do_rcv()->tcp_rcv_established(), the latter goes to step5, where
eventually skb can be freed via tcp_data_queue() (drop: label), then if
check for tcp_defer_accept_check() returns true and thus
tcp_rcv_established() returns -1, which forces tcp_v6_do_rcv() to jump
to reset: label, which in turn will pass through discard: label and free
the same skb again.
Tested by Eric Sesterhenn.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-By: Patrick McManus <mcmanus@ducksong.com>
Add support on 64-bit kernels for seting 32-bit compatible MCAST*
socket options.
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_probe has a bounds-checking bug that causes many programs (less,
python) to crash reading /proc/net/tcp_probe. When it outputs a log
line to the reader, it only checks if that line alone will fit in the
reader's buffer, rather than that line and all the previous lines it
has already written.
tcpprobe_read also returns the wrong value if copy_to_user fails--it
just passes on the return value of copy_to_user (number of bytes not
copied), which makes a failure look like a success.
This patch fixes the buffer overflow and sets the return value to
-EFAULT if copy_to_user fails.
Patch is against latest net-2.6; tested briefly and seems to fix the
crashes in less and python.
Signed-off-by: Tom Quetchenbach <virtualphtn@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (22 commits)
tun: Multicast handling in tun_chr_ioctl() needs proper locking.
[NET]: Fix heavy stack usage in seq_file output routines.
[AF_UNIX] Initialise UNIX sockets before general device initcalls
[RTNETLINK]: Fix bogus ASSERT_RTNL warning
iwlwifi: Fix built-in compilation of iwlcore (part 2)
tun: Fix minor race in TUNSETLINK ioctl handling.
ppp_generic: use stats from net_device structure
iwlwifi: Don't unlock priv->mutex if it isn't locked
wireless: rndis_wlan: modparam_workaround_interval is never below 0.
prism54: prism54_get_encode() test below 0 on unsigned index
mac80211: update mesh EID values
b43: Workaround DMA quirks
mac80211: fix use before check of Qdisc length
net/mac80211/rx.c: fix off-by-one
mac80211: Fix race between ieee80211_rx_bss_put and lookup routines.
ath5k: Fix radio identification on AR5424/2424
ssb: Fix all-ones boardflags
b43: Add more btcoexist workarounds
b43: Fix HostFlags data types
b43: Workaround invalid bluetooth settings
...
Plan C: we can follow the Al Viro's proposal about %n like in this patch.
The same applies to udp, fib (the /proc/net/route file), rt_cache and
sctp debug. This is minus ~150-200 bytes for each.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
iwlwifi: Fix built-in compilation of iwlcore
net: Unexport move_addr_to_{kernel,user}
rt2x00: Select LEDS_CLASS.
iwlwifi: Select LEDS_CLASS.
leds: Do not guard NEW_LEDS with HAS_IOMEM
[IPSEC]: Fix catch-22 with algorithm IDs above 31
time: Export set_normalized_timespec.
tcp: Make use of before macro in tcp_input.c
hamradio: Remove unneeded and deprecated cli()/sti() calls in dmascc.c
[NETNS]: Remove empty ->init callback.
[DCCP]: Convert do_gettimeofday() to getnstimeofday().
[NETNS]: Don't initialize err variable twice.
[NETNS]: The ip6_fib_timer can work with garbage on net namespace stop.
[IPV4]: Convert do_gettimeofday() to getnstimeofday().
[IPV4]: Make icmp_sk_init() static.
[IPV6]: Make struct ip6_prohibit_entry_template static.
tcp: Trivial fix to correct function name in a comment in net/ipv4/tcp.c
[NET]: Expose netdevice dev_id through sysfs
skbuff: fix missing kernel-doc notation
[ROSE]: Fix soft lockup wrt. rose_node_list_lock
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
rose: Socket lock was not released before returning to user space
hci_usb: remove code obfuscation
drivers/net/appletalk: use time_before, time_before_eq, etc
drivers/atm: use time_before, time_before_eq, etc
hci_usb: do not initialize static variables to 0
tg3: 5701 DMA corruption fix
atm nicstar: Removal of debug code containing deprecated calls to cli()/sti()
iwlwifi: Fix unconditional access to station->tidp[].agg.
netfilter: Fix SIP conntrack build with NAT disabled.
netfilter: Fix SCTP nat build.
What do_gettimeofday() does is to call getnstimeofday() and
to convert the result from timespec{} to timeval{}.
After that, these callers convert the result again to msec.
Use getnstimeofday() and convert the units at once.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes the needlessly global icmp_sk_init() static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a trivial fix to correct function name in a comment in
net/ipv4/tcp.c.
Signed-off-by: Satoru SATOH <satoru.satoh@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
None of these files use any of the functionality promised by
asm/semaphore.h. It's possible that they rely on it dragging in some
unrelated header file, but I can't build all these files, so we'll have
fix any build failures as they come up.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
This deblats ~200 bytes when ipv6 and dccp are 'y'.
Besides, this will ease compilation issues for patches
I'm working on to make inet hash tables more scalable
wrt net namespaces.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
[TCP]: Add return value indication to tcp_prune_ofo_queue().
PS3: gelic: fix the oops on the broken IE returned from the hypervisor
b43legacy: fix DMA mapping leakage
mac80211: remove message on receiving unexpected unencrypted frames
Update rt2x00 MAINTAINERS entry
Add rfkill to MAINTAINERS file
rfkill: Fix device type check when toggling states
b43legacy: Fix usage of struct device used for DMAing
ssb: Fix usage of struct device used for DMAing
MAINTAINERS: move to generic repository for iwlwifi
b43legacy: fix initvals loading on bcm4303
rtl8187: Add missing priv->vif assignments
netconsole: only set CON_PRINTBUFFER if the user specifies a netconsole
[CAN]: Update documentation of struct sockaddr_can
MAINTAINERS: isdn4linux@listserv.isdn4linux.de is subscribers-only
[TCP]: Fix never pruned tcp out-of-order queue.
[NET_SCHED] sch_api: fix qdisc_tree_decrease_qlen() loop
This makes sit-generated traffic enter the namespace.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This one was also disabled by default for sanity.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I.e. set the proper net and mark as NETNS_LOCAL.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As for the IPIP tunnel, there are some ip_route_output_key()
calls in there that require a proper net so give one to them.
And a proper net for the __get_dev_by_index hanging around.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Very similar to what was done for the IPIP code.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Everything is prepared for this change now. Create on in
init callback, use it over the code and destroy on net exit.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the part#2 of the patch #2 - get the proper net for
these functions. This change in a separate patch in order not
to get lost in a large previous patch.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The fallback device and hashes are to become per-net, but many
code doesn't have anything to get the struct net pointer from.
So pass the proper net there with an extra argument.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set the proper net before calling register_netdev and disable
the tunnel device netns changing.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some ip_route_output_key() calls in there that require
a proper net so give one to them.
Besides - give a proper net to a single __get_dev_by_index call
in ipip_tunnel_bind_dev().
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Either net or ipip_net already exists in all the required
places, so just use one.
Besides, tune net_init and net_exit calls to respectively
initialize the hashes and destroy devices.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the part#2 of the previous patch - get the proper
net for these functions.
I make it in a separate patch, so that this change does not
get lost in a large previous patch.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The hashes of tunnels will be per-net too, so prepare all the
functions that uses them for this change by adding an argument.
Use init_net temporarily in places, where the net does not exist
explicitly yet.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create on in ipip_init_net(), use it all over the code (the
proper place to get the net from already exists) and destroy
in ipip_net_exit().
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Needed can only be more strict than what was checked by the
earlier common case check for non-tail skbs, thus
cwnd_len <= needed will never match in that case anyway.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Returns non-zero if tp->out_of_order_queue was seen non-empty.
This allows tcp_try_rmem_schedule() to return early.
Signed-off-by: Vitaliy Gusev <vgusev@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_prune_queue() doesn't prune an out-of-order queue at all.
Therefore sk_rmem_schedule() can fail but the out-of-order queue isn't
pruned . This can lead to tcp deadlock state if the next two
conditions are held:
1. There are a sequence hole between last received in
order segment and segments enqueued to the out-of-order queue.
2. Size of all segments in the out-of-order queue is more than tcp_mem[2].
Signed-off-by: Vitaliy Gusev <vgusev@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (31 commits)
[BRIDGE]: Fix crash in __ip_route_output_key with bridge netfilter
[NETFILTER]: ipt_CLUSTERIP: fix race between clusterip_config_find_get and _entry_put
[IPV6] ADDRCONF: Don't generate temporary address for ip6-ip6 interface.
[IPV6] ADDRCONF: Ensure disabling multicast RS even if privacy extensions are disabled.
[IPV6]: Use appropriate sock tclass setting for routing lookup.
[IPV6]: IPv6 extension header structures need to be packed.
[IPV6]: Fix ipv6 address fetching in raw6_icmp_error().
[NET]: Return more appropriate error from eth_validate_addr().
[ISDN]: Do not validate ISDN net device address prior to interface-up
[NET]: Fix kernel-doc for skb_segment
[SOCK] sk_stamp: should be initialized to ktime_set(-1L, 0)
net: check for underlength tap writes
net: make struct tun_struct private to tun.c
[SCTP]: IPv4 vs IPv6 addresses mess in sctp_inet[6]addr_event.
[SCTP]: Fix compiler warning about const qualifiers
[SCTP]: Fix protocol violation when receiving an error lenght INIT-ACK
[SCTP]: Add check for hmac_algo parameter in sctp_verify_param()
[NET_SCHED] cls_u32: refcounting fix for u32_delete()
[DCCP]: Fix skb->cb conflicts with IP
[AX25]: Potential ax25_uid_assoc-s leaks on module unload.
...
I was asked about "why don't we perform a sk_net filtering in
bind_conflict calls, like we do in other sock lookup places"
for a couple of times.
Can we please add a comment about why we do not need one?
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Directly call IPv4 and IPv6 variants where the address family is
easily known.
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Connection tracking helpers (specifically FTP) need to be called
before NAT sequence numbers adjustments are performed to be able
to compare them against previously seen ones. We've introduced
two new hooks around 2.6.11 to maintain this ordering when NAT
modules were changed to get called from conntrack helpers directly.
The cost of netfilter hooks is quite high and sequence number
adjustments are only rarely needed however. Add a RCU-protected
sequence number adjustment function pointer and call it from
IPv4 conntrack after calling the helper.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Adding extensions to confirmed conntracks is not allowed to avoid races
on reallocation. Don't setup NAT for confirmed conntracks in case NAT
module is loaded late.
The has one side-effect, the connections existing before the NAT module
was loaded won't enter the bysource hash. The only case where this actually
makes a difference is in case of SNAT to a multirange where the IP before
NAT is also part of the range. Since old connections don't enter the
bysource hash the first new connection from the IP will have a new address
selected. This shouldn't matter at all.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Locally generated ICMP packets have a reference to the conntrack entry
of the original packet manually attached by icmp_send(). Therefore the
check for locally originated untracked ICMP redirects can never be
true.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Move the UDP-Lite conntrack checksum validation to a generic helper
similar to nf_checksum() and make it fall back to nf_checksum()
in case the full packet is to be checksummed and hardware checksums
are available. This is to be used by DCCP conntrack, which also
needs to verify partial checksums.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Move responsibility for setting the IP_NAT_RANGE_PROTO_SPECIFIED flag
to the NAT protocol, properly propagate errors and get rid of ugly
return value convention.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Move to nf_nat_proto_common and rename to nf_nat_proto_... since they're
also used by protocols that don't have port numbers.
Signed-off-by: Patrick McHardy <kaber@trash.net>
The port rover should not get overwritten when using random mode,
otherwise other rules will also use more or less random ports.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Rule dumping is performed in two steps: first userspace gets the
ruleset size using getsockopt(SO_GET_INFO) and allocates memory,
then it calls getsockopt(SO_GET_ENTRIES) to actually dump the
ruleset. When another process changes the ruleset in between the
sizes from the first getsockopt call doesn't match anymore and
the kernel aborts. Unfortunately it returns EAGAIN, as for multiple
other possible errors, so userspace can't distinguish this case
from real errors.
Return EAGAIN so userspace can retry the operation.
Fixes (with current iptables SVN version) netfilter bugzilla #104.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Commit 9335f047fe aka
"[NETFILTER]: ip_tables: per-netns FILTER, MANGLE, RAW"
added per-netns _view_ of iptables rules. They were shown to user, but
ignored by filtering code. Now that it's possible to at least ping loopback,
per-netns tables can affect filtering decisions.
netns is taken in case of
PRE_ROUTING, LOCAL_IN -- from in device,
POST_ROUTING, LOCAL_OUT -- from out device,
FORWARD -- from in device which should be equal to out device's netns.
This code is relatively new, so BUG_ON was plugged.
Wrappers were added to a) keep code the same from CONFIG_NET_NS=n users
(overwhelming majority), b) consolidate code in one place -- similar
changes will be done in ipv6 and arp netfilter code.
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Dump the mark value in log messages similar to nfnetlink_log. This
is useful for debugging complex setups where marks are used for
routing or traffic classification.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Consider we are putting a clusterip_config entry with the "entries"
count == 1, and on the other CPU there's a clusterip_config_find_get
in progress:
CPU1: CPU2:
clusterip_config_entry_put: clusterip_config_find_get:
if (atomic_dec_and_test(&c->entries)) {
/* true */
read_lock_bh(&clusterip_lock);
c = __clusterip_config_find(clusterip);
/* found - it's still in list */
...
atomic_inc(&c->entries);
read_unlock_bh(&clusterip_lock);
write_lock_bh(&clusterip_lock);
list_del(&c->list);
write_unlock_bh(&clusterip_lock);
...
dev_put(c->dev);
Oops! We have an entry returned by the clusterip_config_find_get,
which is a) not in list b) has a stale dev pointer.
The problems will happen when the CPU2 will release the entry - it
will remove it from the list for the 2nd time, thus spoiling it, and
will put a stale dev pointer.
The fix is to make atomic_dec_and_test under the clusterip_lock.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This expresses __skb_append in terms of __skb_queue_after, exploiting that
__skb_append(old, new, list) = __skb_queue_after(list, old, new).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace seq_open with seq_open_net and remove tcp_seq_release
completely. seq_release_net will do this job just fine.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to create seq_operations for each instance of 'netstat'.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_proc_register/tcp_proc_unregister are called with a static pointer only.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_reuse is declared as "unsigned char", but is set as type valbool in net/core/sock.c.
There is no other place in net/ where sk->sk_reuse is set to a value > 1, so the test
"sk_reuse > 1" can not be true.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'docs' of git://git.lwn.net/linux-2.6:
Add additional examples in Documentation/spinlocks.txt
Move sched-rt-group.txt to scheduler/
Documentation: move rpc-cache.txt to filesystems/
Documentation: move nfsroot.txt to filesystems/
Spell out behavior of atomic_dec_and_lock() in kerneldoc
Fix a typo in highres.txt
Fixes to the seq_file document
Fill out information on patch tags in SubmittingPatches
Add the seq_file documentation
Documentation/ is a little large, and filesystems/ seems an obvious
place for this file.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Copy the network namespace from the socket to the timewait socket.
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Mark Lord <mlord@pobox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The comparison in ip_route_input is a hot path, by recoding the C
"and" as bit operations, fewer conditional branches get generated
so the code should be faster. Maybe someday Gcc will be smart
enough to do this?
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid unneeded test in the case where object to be freed
has to be a leaf. Don't need to use the generic tnode_free()
function, instead just setup leaf to be freed.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The trie pointer is passed down to flush_list and flush_leaf
but never used.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow the use of SACK and window scaling when syncookies are used
and the client supports tcp timestamps. Options are encoded into
the timestamp sent in the syn-ack and restored from the timestamp
echo when the ack is received.
Based on earlier work by Glenn Griffin.
This patch avoids increasing the size of structs by encoding TCP
options into the least significant bits of the timestamp and
by not using any 'timestamp offset'.
The downside is that the timestamp sent in the packet after the synack
will increase by several seconds.
changes since v1:
don't duplicate timestamp echo decoding function, put it into ipv4/syncookie.c
and have ipv6/syncookies.c use it.
Feedback from Glenn Griffin: fix line indented with spaces, kill redundant if ()
Reviewed-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use vmalloc rather than alloc_pages to avoid wasting memory.
The problem is that tnode structure has a power of 2 sized array,
plus a header. So the current code wastes almost half the memory
allocated because it always needs the next bigger size to hold
that small header.
This is similar to an earlier patch by Eric, but instead of a list
and lock, I used a workqueue to handle the fact that vfree can't
be done in interrupt context.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No urgency on the rehash interval timer, so mark it as deferrable.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since route hash is a triple, use jhash_3words rather doing the mixing
directly. This should be as fast and give better distribution.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't mark functions that are large as inline, let compiler decide.
Also, use inline rather than __inline__.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes kernel bugzilla 10371.
As reported by M.Piechaczek@osmosys.tv, if we try to grab a
char sized socket option value, as in:
unsigned char ttl = 255;
socklen_t len = sizeof(ttl);
setsockopt(socket, IPPROTO_IP, IP_MULTICAST_TTL, &ttl, &len);
getsockopt(socket, IPPROTO_IP, IP_MULTICAST_TTL, &ttl, &len);
The ttl returned will be wrong on big-endian, and on both little-
endian and big-endian the next three bytes in userspace are written
with garbage.
It's because of this test in do_ip_getsockopt():
if (len < sizeof(int) && len > 0 && val>=0 && val<255) {
It should allow a 'val' of 255 to pass here, but it doesn't so it
copies a full 'int' back to userspace.
On little-endian that will write the correct value into the location
but it spams on the next three bytes in userspace. On big endian it
writes the wrong value into the location and spams the next three
bytes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Without this patch, the generic L3 tracker would kick in
if nf_conntrack_ipv4 was not loaded before nf_nat, which
would lead to translation problems with ICMP errors.
NAT does not make sense without IPv4 connection tracking
anyway, so just add a call to need_ipv4_conntrack().
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
MTU probe can cause some remedies for FRTO because the normal
packet ordering may be violated allowing FRTO to make a wrong
decision (it might not be that serious threat for anything
though). Thus it's safer to not run FRTO while MTU probe is
underway.
It seems that the basic FRTO variant should also look for an
skb at probe_seq.start to check if that's retransmitted one
but I didn't implement it now (plain seqno in window check
isn't robust against wraparounds).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes Bugzilla #10384
tcp_simple_retransmit does L increment without any checking
whatsoever for overflowing S+L when Reno is in use.
The simplest scenario I can currently think of is rather
complex in practice (there might be some more straightforward
cases though). Ie., if mss is reduced during mtu probing, it
may end up marking everything lost and if some duplicate ACKs
arrived prior to that sacked_out will be non-zero as well,
leading to S+L > packets_out, tcp_clean_rtx_queue on the next
cumulative ACK or tcp_fastretrans_alert on the next duplicate
ACK will fix the S counter.
More straightforward (but questionable) solution would be to
just call tcp_reset_reno_sack() in tcp_simple_retransmit but
it would negatively impact the probe's retransmission, ie.,
the retransmissions would not occur if some duplicate ACKs
had arrived.
So I had to add reno sacked_out reseting to CA_Loss state
when the first cumulative ACK arrives (this stale sacked_out
might actually be the explanation for the reports of left_out
overflows in kernel prior to 2.6.23 and S+L overflow reports
of 2.6.24). However, this alone won't be enough to fix kernel
before 2.6.24 because it is building on top of the commit
1b6d427bb7 ([TCP]: Reduce sacked_out with reno when purging
write_queue) to keep the sacked_out from overflowing.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Reported-by: Alessandro Suardi <alessandro.suardi@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes a long-standing bug which makes NewReno recovery crippled.
With GSO the whole head skb was marked as LOST which is in
violation of NewReno procedure that only wants to mark one packet
and ended up breaking our TCP code by causing counter overflow
because our code was built on top of assumption about valid
NewReno procedure. This manifested as triggering a WARN_ON for
the overflow in a number of places.
It seems relatively safe alternative to just do nothing if
tcp_fragment fails due to oom because another duplicate ACK is
likely to be received soon and the fragmentation will be retried.
Special thanks goes to Soeren Sonnenburg <kernel@nn7.de> who was
lucky enough to be able to reproduce this so that the warning
for the overflow was hit. It's not as easy task as it seems even
if this bug happens quite often because the amount of outstanding
data is pretty significant for the mismarkings to lead to an
overflow.
Because it's very late in 2.6.25-rc cycle (if this even makes in
time), I didn't want to touch anything with SACK enabled here.
Fragmenting might be useful for it as well but it's more or less
a policy decision rather than mandatory fix. Thus there's no need
to rush and we can postpone considering tcp_fragment with SACK
for 2.6.26.
In 2.6.24 and earlier, this very same bug existed but the effect
is slightly different because of a small changes in the if
conditions that fit to the patch's context. With them nothing
got lost marker and thus no retransmissions happened.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
The fast retransmission can be forced locally to the rfc3517
branch in tcp_update_scoreboard instead of making such fragile
constructs deeper in tcp_mark_head_lost.
This is necessary for the next patch which must not have
loopholes for cnt > packets check. As one can notice,
readability got some improvements too because of this :-).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
uc_ttl is initialized in inet(6)_create and never changed except
setsockopt ioctl. Remove this assignment.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace sock_create_kern with inet_ctl_sock_create.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a generic requirement, so make inet_ctl_sock_create namespace
aware and create a inet_ctl_sock_destroy wrapper around
sk_release_kernel.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All upper protocol layers are already use sock internally.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This call is nothing common with INET connection sockets code. It
simply creates an unhashes kernel sockets for protocol messages.
Move the new call into af_inet.c after the rename.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace tcp_socket with tcp_sock. This is more effective (less
derefferences on fast paths). Additionally, the approach is unified to
one used in ICMP.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ICMP relookup path is only meant to modify behaviour when
appropriate IPsec policies are in place and marked as requiring
relookups. It is certainly not meant to modify behaviour when
IPsec policies don't exist at all.
However, due to an oversight on the error paths existing behaviour
may in fact change should one of the relookup steps fail.
This patch corrects this by redirecting all errors on relookup
failures to the previous code path. That is, if the initial
xfrm_lookup let the packet pass, we will stand by that decision
should the relookup fail due to an error.
This should be safe from a security point-of-view because compliant
systems must install a default deny policy so the packet would'nt
have passed in that case.
Many thanks to Julian Anastasov for pointing out this error.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Besides, now we can see per-net fragments statistics in the
same file, since this stats is already per-net.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently they live in init_net only, but now almost all the info
they can provide is available per-net.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This counter is about to become per-proto-and-per-net, so we'll need
two arguments to determine which cell in this "table" to work with.
All the places, but proc already pass proper net to it - proc will be
tuned a bit later.
Some indentation with spaces in proc files is done to keep the file
coding style consistent.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace seq_open with seq_open_net and remove udp_seq_release
completely. seq_release_net will do this job just fine.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to create seq_operations for each instance of 'netstat'.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
udp_proc_register/udp_proc_unregister are called with a static pointer only.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
An uppercut - do not use the pcounter on struct proto.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On Fri, 2008-03-28 at 03:24 -0700, Andrew Morton wrote:
> they should all be renamed.
Done for include/net and net
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 9af3912ec9 ("[NET] Move DF check
to ip_forward") added a new check to send ICMP fragmentation needed
for large packets.
Unlike the check in ip_finish_output(), it doesn't check for GSO.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This elliminates infamous race during module loading when one could lookup
proc entry without proc_fops assigned.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ESP does not account for the IV size when calling pskb_may_pull() to
ensure everything it accesses directly is within the linear part of a
potential fragment. This results in a BUG() being triggered when the
both the IPv4 and IPv6 ESP stack is fed with an skb where the first
fragment ends between the end of the esp header and the end of the IV.
This bug was found by Dirk Nehring <dnehring@gmx.net> .
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv6 BEET output function is incorrectly including the inner
header in the payload to be protected. This causes a crash as
the packet doesn't actually have that many bytes for a second
header.
The IPv4 BEET output on the other hand is broken when it comes
to handling an inner IPv6 header since it always assumes an
inner IPv4 header.
This patch fixes both by making sure that neither BEET output
function touches the inner header at all. All access is now
done through the protocol-independent cb structure. Two new
attributes are added to make this work, the IP header length
and the IPv4 option length. They're filled in by the inner
mode's output function.
Thanks to Joakim Koskela for finding this problem.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 8b7817f3a9 ([IPSEC]: Add ICMP host
relookup support) introduced some dst leaks on error paths: the rt
pointer can be forgotten to be put. Fix it bu going to a proper label.
Found after net namespace's lo refused to unregister :) Many thanks to
Den for valuable help during debugging.
Herbert pointed out, that xfrm_lookup() will put the rtable in case
of error itself, so the first goto fix is redundant.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This mostly re-uses the net, used in icmp netnsization patches from Denis.
After this ICMP sysctls are completely virtualized.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add some flesh to ipv4_sysctl_init_net and ipv4_sysctl_exit_net,
i.e. copy the table, alter .data pointers and register it per-net.
Other ipv4_table's sysctls are now global, but this is going to
change once sysctl permissions patches migrate from -mm tree to
mainline in 2.6.26 merge window :)
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialization is moved to icmp_sk_init, all the places, that
refer to them use init_net for now.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This includes adding pernet_operations, empty init and exit
hooks and a bit of changes in sysctl_ipv4_init just not to
have this part in next patches.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Optimize call routing between NATed endpoints: when an external
registrar sends a media description that contains an existing RTP
expectation from a different SNATed connection, the gatekeeper
is trying to route the call directly between the two endpoints.
We assume both endpoints can reach each other directly and
"un-NAT" the addresses, which makes the media stream go between
the two endpoints directly.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SDP connection addresses may be contained in the payload multiple
times (in the session description and/or once per media description),
currently only the session description is properly updated. Split up
SDP mangling so the function setting up expectations only updates the
media port, update connection addresses from media descriptions while
parsing them and at the end update the session description when the
final addresses are known.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create expectations for the RTCP connections in addition to RTP connections.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create expectations for incoming signalling connections when seeing
a REGISTER request. This is needed when the registrar uses a
different source port number for signalling messages and for receiving
incoming calls from other endpoints than the registrar.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SIP message may contain multiple Contact: addresses referring to
the NATed endpoint, translate all of them.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update maddr=, received= and rport= Via-header parameters refering to
the signalling connection.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Perform NAT last after parsing the packet. This makes no difference
currently, but is needed when dealing with registrations to make
sure we seen the unNATed addresses.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the URI parsing helper to get the numerical addresses and get rid of the
text based header translation.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce new function for SIP header parsing that properly deals with
continuation lines and whitespace in headers and use it.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The request URI is not a header and needs to be treated differently than
real SIP headers. Add a seperate function for parsing it and get rid of
the POS_REQ_URI/POS_REG_REQ_URI definitions.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
SDP and SIP headers are quite different, SIP can have continuation lines,
leading and trailing whitespace after the colon and is mostly case-insensitive
while SDP headers always begin on a new line and are followed by an equal
sign and the value, without any whitespace.
Introduce new SDP header parsing function and convert all users that used
the SIP header parsing function. This will allow to properly deal with the
special SIP cases in the SIP header parsing function later.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace sizeof/memcmp by strlen/strcmp. Use case-insensitive comparison
for SIP methods and the SIP/2.0 string, as specified in RFC 3261.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The conntrack reference and ctinfo can be derived from the packet.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
After mangling the packet, the pointer to the data and the length of the data
portion may change and need to be adjusted.
Use double data pointers and a pointer to the length everywhere and add a
helper function to the NAT helper for performing the adjustments.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to set up the destination NAT mapping before the source NAT
mapping, so the NAT core gets to see the final tuple and can decide
whether the source port needs to be remapped.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce expectation classes and policies. An expectation class
is used to distinguish different types of expectations by the
same helper (for example audio/video/t.120). The expectation
policy is used to hold the maximum number of expectations and
the initial timeout for each class.
The individual classes are isolated from each other, which means
that for example an audio expectation will only evict other audio
expectations.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
With nf_conntrack DUMP_TUPLE got renamed to NF_CT_DUMP_TUPLE, fix
CLUSTERIP to use the proper macro name.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce an inline net_eq() to compare two namespaces.
Without CONFIG_NET_NS, since no namespace other than &init_net
exists, it is always 1.
We do not need to convert 1) inline vs inline and
2) inline vs &init_net comparisons.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Without CONFIG_NET_NS, no namespace other than &init_net exists,
no need to store net in seq_net_private.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Introduce per-sock inlines: sock_net(), sock_net_set()
and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Introduce per-net_device inlines: dev_net(), dev_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Allow to create sockets in the namespace if the protocol ok with this.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP layer now can handle multiple namespaces normally. So, process such
packets normally and drop them only if the transport layer is not
aware about namespaces.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace all the reast of the init_net with a proper net on the socket
layer.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace all the rest of the init_net with a proper net on the IP layer.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_options_compile uses inet_addr_type which requires a namespace. The
packet argument is optional, so parameter is the only way to obtain
it. Pass the init_net there for now.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Seqfile operation showing /proc/net/arp are already namespace
aware. All we need is to register this file for each namespace.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Get namespace from a device and pass it to the routing engine. Enable
ARP packet processing and device notifiers after that.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP-Lite sockets are displayed in another files, rather than
UDP ones, so make the present in namespaces as well.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just introduce a helper to remove ifdefs from inside the
udplite4_register function. This will help to make the next patch
nicer.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the commit f40c8174d3 ([NETNS][IPV4]
tcp - make proc handle the network namespaces) it is now possible to make
this file present in newly created namespaces.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the commit a91275eff4 ([NETNS][IPV6]
udp - make proc handle the network namespace) it is now possible to make
this file present in newly created namespaces.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make /proc/net/fib_trie and /proc/net/fib_triestat display
all routing tables, not just local and main.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
the first u32 copied from syncookie_secret is overwritten by the
minute-counter four lines below. After adjusting the destination
address, the size of syncookie_secret can be reduced accordingly.
AFAICS, the only other user of syncookie_secret[] is the ipv6
syncookie support. Because ipv6 syncookies only grab 44 bytes from
syncookie_secret[], this shouldn't affect them in any way.
With fixes from Glenn Griffin.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Glenn Griffin <ggriffin.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This gets rid of a warning caused by the test in rcu_assign_pointer.
I tried to fix rcu_assign_pointer, but that devolved into a long set
of discussions about doing it right that came to no real solution.
Since the test in rcu_assign_pointer for constant NULL would never
succeed in fib_trie, just open code instead.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The route table parameters are set based on system memory and sysctl
values that almost never change. Also the genid only changes every
10 minutes.
RTprint is defined by never used.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sorry for the patch sequence confusion :| but I found that the similar
thing can be done for raw sockets easily too late.
Expand the proto.h union with the raw_hashinfo member and use it in
raw_prot and rawv6_prot. This allows to drop the protocol specific
versions of hash and unhash callbacks.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
After this we have only udp_lib_get_port to get the port and two
stubs for ipv4 and ipv6. No difference in udp and udplite except
for initialized h.udp_hash member.
I tried to find a graceful way to drop the only difference between
udp_v4_get_port and udp_v6_get_port (i.e. the rcv_saddr comparison
routine), but adding one more callback on the struct proto didn't
appear such :( Maybe later.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Inspired by the commit ab1e0a13 ([SOCK] proto: Add hashinfo member to
struct proto) from Arnaldo, I made similar thing for UDP/-Lite IPv4
and -v6 protocols.
The result is not that exciting, but it removes some levels of
indirection in udpxxx_get_port and saves some space in code and text.
The first step is to union existing hashinfo and new udp_hash on the
struct proto and give a name to this union, since future initialization
of tcpxxx_prot, dccp_vx_protinfo and udpxxx_protinfo will cause gcc
warning about inability to initialize anonymous member this way.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes code a bit more uniform and straigthforward.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_options->is_data is assigned only and never checked. The structure is
not a part of kernel interface to the userspace. So, it is safe to remove
this field.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is the only way to reach ip_options compile with opt != NULL:
ip_options_get_finish
opt->is_data = 1;
ip_options_compile(opt, NULL)
So, checking for is_data inside opt != NULL branch is not needed.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
While testing the virtio-net driver on KVM with TSO I noticed
that TSO performance with a 1500 MTU is significantly worse
compared to the performance of non-TSO with a 16436 MTU. The
packet dump shows that most of the packets sent are smaller
than a page.
Looking at the code this actually is quite obvious as it always
stop extending the packet if it's the first packet yet to be
sent and if it's larger than the MSS. Since each extension is
bound by the page size, this means that (given a 1500 MTU) we're
very unlikely to construct packets greater than a page, provided
that the receiver and the path is fast enough so that packets can
always be sent immediately.
The fix is also quite obvious. The push calls inside the loop
is just an optimisation so that we don't end up doing all the
sending at the end of the loop. Therefore there is no specific
reason why it has to do so at MSS boundaries. For TSO, the
most natural extension of this optimisation is to do the pushing
once the skb exceeds the TSO size goal.
This is what the patch does and testing with KVM shows that the
TSO performance with a 1500 MTU easily surpasses that of a 16436
MTU and indeed the packet sizes sent are generally larger than
16436.
I don't see any obvious downsides for slower peers or connections,
but it would be prudent to test this extensively to ensure that
those cases don't regress.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change TCP_DEFER_ACCEPT implementation so that it transitions a
connection to ESTABLISHED after handshake is complete instead of
leaving it in SYN-RECV until some data arrvies. Place connection in
accept queue when first data packet arrives from slow path.
Benefits:
- established connection is now reset if it never makes it
to the accept queue
- diagnostic state of established matches with the packet traces
showing completed handshake
- TCP_DEFER_ACCEPT timeouts are expressed in seconds and can now be
enforced with reasonable accuracy instead of rounding up to next
exponential back-off of syn-ack retry.
Signed-off-by: Patrick McManus <mcmanus@ducksong.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a socket in LISTEN that had completed its 3 way handshake, but not notified
userspace because of SO_DEFER_ACCEPT, would retransmit the already
acked syn-ack during the time it was waiting for the first data byte
from the peer.
Signed-off-by: Patrick McManus <mcmanus@ducksong.com>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
timeout associated with SO_DEFER_ACCEPT wasn't being honored if it was
less than the timeout allowed by the maximum syn-recv queue size
algorithm. Fix by using the SO_DEFER_ACCEPT value if the ack has
arrived.
Signed-off-by: Patrick McManus <mcmanus@ducksong.com>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commits f40c81 ([NETNS][IPV4] tcp - make proc handle the network
namespaces) and a91275 ([NETNS][IPV6] udp - make proc handle the
network namespace) both introduced bad checks on sockets and tw
buckets to belong to proper net namespace.
I.e. when checking for socket to belong to given net and family the
do {
sk = sk_next(sk);
} while (sk && sk->sk_net != net && sk->sk_family != family);
constructions were used. This is wrong, since as soon as the
sk->sk_net fits the net the socket is immediately returned, even if it
belongs to other family.
As the result four /proc/net/(udp|tcp)[6] entries show wrong info.
The udp6 entry even oopses when dereferencing inet6_sk(sk) pointer:
static void udp6_sock_seq_show(struct seq_file *seq, struct sock *sp, int bucket)
{
...
struct ipv6_pinfo *np = inet6_sk(sp);
...
dest = &np->daddr; /* will be NULL for AF_INET sockets */
...
seq_printf(...
dest->s6_addr32[0], dest->s6_addr32[1],
dest->s6_addr32[2], dest->s6_addr32[3],
...
Fix it by converting && to ||.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Been seeing occasional panics in my testing of 2.6.25-rc in ip_defrag.
Offending line in ip_defrag is here:
net = skb->dev->nd_net
where dev is NULL. Bisected the problem down to commit
ac18e7509e ([NETNS][FRAGS]: Make the
inet_frag_queue lookup work in namespaces).
Below patch (idea from Patrick McHardy) fixes the problem for me.
Signed-off-by: Phil Oester <kernel@linuxace.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The proc init/exit functions take a new network namespace parameter in
order to register/unregister /proc/net/udp6 for a namespace.
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch, like udp proc, makes the proc functions to take care of
which namespace the socket belongs.
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Copy the network namespace from the socket to the timewait socket.
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes the common udp proc functions to take care of which
socket they should show taking into account the namespace it belongs.
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update: My mailer ate one of Jarek's feedback mails... Fixed the
parameter in netif_set_gso_max_size() to be u32, not u16. Fixed the
whitespace issue due to a patch import botch. Changed the types from
u32 to unsigned int to be more consistent with other variables in the
area. Also brought the patch up to the latest net-2.6.26 tree.
Update: Made gso_max_size container 32 bits, not 16. Moved the
location of gso_max_size within netdev to be less hotpath. Made more
consistent names between the sock and netdev layers, and added a
define for the max GSO size.
Update: Respun for net-2.6.26 tree.
Update: changed max_gso_frame_size and sk_gso_max_size from signed to
unsigned - thanks Stephen!
This patch adds the ability for device drivers to control the size of
the TSO frames being sent to them, per TCP connection. By setting the
netdevice's gso_max_size value, the socket layer will set the GSO
frame size based on that value. This will propogate into the TCP
layer, and send TSO's of that size to the hardware.
This can be desirable to help tune the bursty nature of TSO on a
per-adapter basis, where one may have 1 GbE and 10 GbE devices
coexisting in a system, one running multiqueue and the other not, etc.
This can also be desirable for devices that cannot support full 64 KB
TSO's, but still want to benefit from some level of segmentation
offloading.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When selecting a new window, tcp_select_window() tries not to shrink
the offered window by using the maximum of the remaining offered window
size and the newly calculated window size. The newly calculated window
size is always a multiple of the window scaling factor, the remaining
window size however might not be since it depends on rcv_wup/rcv_nxt.
This means we're effectively shrinking the window when scaling it down.
The dump below shows the problem (scaling factor 2^7):
- Window size of 557 (71296) is advertised, up to 3111907257:
IP 172.2.2.3.33000 > 172.2.2.2.33000: . ack 3111835961 win 557 <...>
- New window size of 514 (65792) is advertised, up to 3111907217, 40 bytes
below the last end:
IP 172.2.2.3.33000 > 172.2.2.2.33000: . 3113575668:3113577116(1448) ack 3111841425 win 514 <...>
The number 40 results from downscaling the remaining window:
3111907257 - 3111841425 = 65832
65832 / 2^7 = 514
65832 % 2^7 = 40
If the sender uses up the entire window before it is shrunk, this can have
chaotic effects on the connection. When sending ACKs, tcp_acceptable_seq()
will notice that the window has been shrunk since tcp_wnd_end() is before
tp->snd_nxt, which makes it choose tcp_wnd_end() as sequence number.
This will fail the receivers checks in tcp_sequence() however since it
is before it's tp->rcv_wup, making it respond with a dupack.
If both sides are in this condition, this leads to a constant flood of
ACKs until the connection times out.
Make sure the window is never shrunk by aligning the remaining window to
the window scaling factor.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a rule using ipt_recent is created with a hit count greater than
ip_pkt_list_tot, the rule will never match as it cannot keep track
of enough timestamps. This patch makes ipt_recent refuse to create such
rules.
With ip_pkt_list_tot's default value of 20, the following can be used
to reproduce the problem.
nc -u -l 0.0.0.0 1234 &
for i in `seq 1 100`; do echo $i | nc -w 1 -u 127.0.0.1 1234; done
This limits it to 20 packets:
iptables -A OUTPUT -p udp --dport 1234 -m recent --set --name test \
--rsource
iptables -A OUTPUT -p udp --dport 1234 -m recent --update --seconds \
60 --hitcount 20 --name test --rsource -j DROP
While this is unlimited:
iptables -A OUTPUT -p udp --dport 1234 -m recent --set --name test \
--rsource
iptables -A OUTPUT -p udp --dport 1234 -m recent --update --seconds \
60 --hitcount 21 --name test --rsource -j DROP
With the patch the second rule-set will throw an EINVAL.
Reported-by: Sean Kennedy <skennedy@vcn.com>
Signed-off-by: Daniel Hokka Zakrisson <daniel@hozac.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
With TSO it was possible to send past the receiver window when the skb
to be sent was the last in the write queue while the receiver window
is the limiting factor. One can notice that there's a loophole in the
tcp_mss_split_point that lacked a receiver window check for the
tcp_write_queue_tail() if also cwnd was smaller than the full skb.
Noticed by Thomas Gleixner <tglx@linutronix.de> in form of "Treason
uncloaked! Peer ... shrinks window .... Repaired." messages (the peer
didn't actually shrink its window as the message suggests, we had just
sent something past it without a permission to do so).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit db1ed684f6 ("[IPV6]
UDP: Rename IPv6 UDP files."), commit
8be8af8fa4 ("[IPV4] UDP: Move
IPv4-specific bits to other file.") and commit
e898d4db27 ("[UDP]: Allow users to
configure UDP-Lite.").
First, udplite is of such small cost, and it is a core protocol just
like TCP and normal UDP are.
We spent enormous amounts of effort to make udplite share as much code
with core UDP as possible. All of that work is less valuable if we're
just going to slap a config option on udplite support.
It is also causing build failures, as reported on linux-next, showing
that the changeset was not tested very well. In fact, this is the
second build failure resulting from the udplite change.
Finally, the config options provided was a bool, instead of a modular
option. Meaning the udplite code does not even get build tested
by allmodconfig builds, and furthermore the user is not presented
with a reasonable modular build option which is particularly needed
by distribution vendors.
Signed-off-by: David S. Miller <davem@davemloft.net>
__FUNCTION__ is gcc-specific, use __func__
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(Anonymous) unions can help us to avoid ugly casts.
A common cast it the (struct rtable *)skb->dst one.
Defining an union like :
union {
struct dst_entry *dst;
struct rtable *rtable;
};
permits to use skb->rtable in place.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Stephen Hemminger <shemminger@linux-foundation.org>
Based upon a patch by Marcel Wappler:
This patch fixes a DHCP issue of the kernel: some DHCP servers
(i.e. in the Linksys WRT54Gv5) are very strict about the contents
of the DHCPDISCOVER packet they receive from clients.
Table 5 in RFC2131 page 36 requests the fields 'ciaddr' and
'siaddr' MUST be set to '0'. These DHCP servers ignore Linux
kernel's DHCP discovery packets with these two fields set to
'255.255.255.255' (in contrast to popular DHCP clients, such as
'dhclient' or 'udhcpc'). This leads to a not booting system.
Signed-off-by: David S. Miller <davem@davemloft.net>
Now the ESP uses the AEAD interface even for algorithms which are
not combined mode, we need to select CONFIG_CRYPTO_AUTHENC as
otherwise only combined mode algorithms will work.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Updated to incorporate Eric's suggestion of using a per cpu buffer
rather than allocating on the stack. Just a two line change, but will
resend in it's entirety.
Signed-off-by: Glenn Griffin <ggriffin.kernel@gmail.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
400 bytes allocated on stack might be a litle bit too much. Using a
per_cpu var is more friendly.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
There are some place, that calculate the ARP header length. These
calculations are correct, but
a) some operate with "magic" constants,
b) enlarge the code length (sometimes at the cost of coding style),
c) are not informative from the first glance.
The proposal is to introduce a helper, that includes all the good
sides of these calculations.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It makes fackets_out to grow too slowly compared with the
real write queue.
This shouldn't cause those BUG_TRAP(packets <= tp->packets_out)
to trigger but how knows how such inconsistent fackets_out
affects here and there around TCP when everything is nowadays
assuming accurate fackets_out. So lets see if this silences
them all.
Reported by Guillaume Chazarain <guichaz@gmail.com>.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>