linux/net/core
Eric Dumazet d3836f21b0 net: allow skb->head to be a page fragment
skb->head is currently allocated from kmalloc(). This is convenient but
has the drawback the data cannot be converted to a page fragment if
needed.

We have three spots were it hurts :

1) GRO aggregation

 When a linear skb must be appended to another skb, GRO uses the
frag_list fallback, very inefficient since we keep all struct sk_buff
around. So drivers enabling GRO but delivering linear skbs to network
stack aren't enabling full GRO power.

2) splice(socket -> pipe).

 We must copy the linear part to a page fragment.
 This kind of defeats splice() purpose (zero copy claim)

3) TCP coalescing.

 Recently introduced, this permits to group several contiguous segments
into a single skb. This shortens queue lengths and save kernel memory,
and greatly reduce probabilities of TCP collapses. This coalescing
doesnt work on linear skbs (or we would need to copy data, this would be
too slow)

Given all these issues, the following patch introduces the possibility
of having skb->head be a fragment in itself. We use a new skb flag,
skb->head_frag to carry this information.

build_skb() is changed to accept a frag_size argument. Drivers willing
to provide a page fragment instead of kmalloc() data will set a non zero
value, set to the fragment size.

Then, on situations we need to convert the skb head to a frag in itself,
we can check if skb->head_frag is set and avoid the copies or various
fallbacks we have.

This means drivers currently using frags could be updated to avoid the
current skb->head allocation and reduce their memory footprint (aka skb
truesize). (thats 512 or 1024 bytes saved per skb). This also makes
bpf/netfilter faster since the 'first frag' will be part of skb linear
part, no need to copy data.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30 21:35:11 -04:00
..
datagram.c net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
dev.c net: gro: GRO_MERGED_FREE consumes packets 2012-04-19 14:23:56 -04:00
dev_addr_lists.c net: addr_list: add exclusive dev_uc_add and dev_mc_add 2012-04-15 13:06:04 -04:00
drop_monitor.c drop_monitor: allow more events per second 2012-04-21 16:28:38 -04:00
dst.c net: Rename dst_get_neighbour{, _raw} to dst_get_neighbour_noref{, _raw}. 2011-12-05 15:20:19 -05:00
ethtool.c ethtool: Add a common function for drivers with transmit time stamping. 2012-04-04 05:28:46 -04:00
fib_rules.c fib_rules: Stop using NLA_PUT*(). 2012-04-02 04:33:44 -04:00
filter.c net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
flow.c net: Add a flow_cache_flush_deferred function 2011-12-21 16:48:08 -05:00
flow_dissector.c net: flow_dissector.c missing include linux/export.h 2012-01-24 16:03:33 -05:00
gen_estimator.c Remove all #inclusions of asm/system.h 2012-03-28 18:30:03 +01:00
gen_stats.c gen_stats: Stop using NLA_PUT*(). 2012-04-02 04:33:44 -04:00
iovec.c net: get rid of some pointless casts to sockaddr 2012-03-11 19:11:22 -07:00
link_watch.c net: linkwatch: allow vlans to get carrier changes faster 2011-09-15 15:36:34 -04:00
Makefile sock_diag: Move the sock_ code to net/core/ 2011-12-06 13:58:02 -05:00
neighbour.c net neighbour: Convert to use register_net_sysctl 2012-04-20 21:22:29 -04:00
net-sysfs.c net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
net-sysfs.h xps: Add CONFIG_XPS 2010-11-28 18:24:14 -08:00
net-traces.c net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules 2011-10-31 19:30:30 -04:00
net_namespace.c netns: do not leak net_generic data on failed init 2012-04-18 00:05:33 -04:00
netevent.c net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules 2011-10-31 19:30:30 -04:00
netpoll.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-02-19 16:03:15 -05:00
netprio_cgroup.c Merge branch 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2012-03-20 18:11:21 -07:00
pktgen.c net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
request_sock.c ipv4:correct description for tcp_max_syn_backlog 2011-12-06 13:02:28 -05:00
rtnetlink.c net: rtnetlink notify events for FDB NTF_SELF adds and deletes 2012-04-15 13:06:04 -04:00
scm.c Remove all #inclusions of asm/system.h 2012-03-28 18:30:03 +01:00
secure_seq.c net: fix some sparse errors 2012-01-17 10:31:12 -05:00
skbuff.c net: allow skb->head to be a page fragment 2012-04-30 21:35:11 -04:00
sock.c net: Fixed a coding style issue related to spaces. 2012-04-28 21:45:00 -04:00
sock_diag.c net: sock_diag_handler structs can be const 2012-04-25 20:46:59 -04:00
stream.c net: Fix the condition passed to sk_wait_event() 2010-10-03 20:41:32 -07:00
sysctl_net_core.c net: Delete all remaining instances of ctl_path 2012-04-20 21:22:30 -04:00
timestamping.c net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules 2011-10-31 19:30:30 -04:00
user_dma.c net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules 2011-10-31 19:30:30 -04:00
utils.c net: Fixed coding style issues relating to braces. 2012-04-12 16:35:48 -04:00