Commit Graph

276 Commits (02d13ed5f94af38c37d1abd53462fe48d78bcc9d)

Author SHA1 Message Date
Eric Dumazet 31e6d363ab net: correct off-by-one write allocations reports
commit 2b85a34e91
(net: No more expensive sock_hold()/sock_put() on each tx)
changed initial sk_wmem_alloc value.

We need to take into account this offset when reporting
sk_wmem_alloc to user, in PROC_FS files or various
ioctls (SIOCOUTQ/TIOCOUTQ)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-06-18 00:29:12 -07:00
Michał Mirosław a7b11d7382 genetlink: Introduce genl_register_family_with_ops()
This introduces genl_register_family_with_ops() that registers a genetlink
family along with operations from a table. This is used to kill copy'n'paste
occurrences in following patches.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-05-21 16:50:22 -07:00
David S. Miller 08abe18af1 Merge branch 'master' of /home/davem/src/GIT/linux-2.6/
Conflicts:
	drivers/net/wimax/i2400m/usb-notif.c
2009-03-26 15:23:24 -07:00
Linus Torvalds 562f477a54 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (29 commits)
  crypto: sha512-s390 - Add missing block size
  hwrng: timeriomem - Breaks an allyesconfig build on s390:
  nlattr: Fix build error with NET off
  crypto: testmgr - add zlib test
  crypto: zlib - New zlib crypto module, using pcomp
  crypto: testmgr - Add support for the pcomp interface
  crypto: compress - Add pcomp interface
  netlink: Move netlink attribute parsing support to lib
  crypto: Fix dead links
  hwrng: timeriomem - New driver
  crypto: chainiv - Use kcrypto_wq instead of keventd_wq
  crypto: cryptd - Per-CPU thread implementation based on kcrypto_wq
  crypto: api - Use dedicated workqueue for crypto subsystem
  crypto: testmgr - Test skciphers with no IVs
  crypto: aead - Avoid infinite loop when nivaead fails selftest
  crypto: skcipher - Avoid infinite loop when cipher fails selftest
  crypto: api - Fix crypto_alloc_tfm/create_create_tfm return convention
  crypto: api - crypto_alg_mod_lookup either tested or untested
  crypto: amcc - Add crypt4xx driver
  crypto: ansi_cprng - Add maintainer
  ...
2009-03-26 11:04:34 -07:00
Pablo Neira Ayuso 38938bfe34 netlink: add NETLINK_NO_ENOBUFS socket flag
This patch adds the NETLINK_NO_ENOBUFS socket flag. This flag can
be used by unicast and broadcast listeners to avoid receiving
ENOBUFS errors.

Generally speaking, ENOBUFS errors are useful to notify two things
to the listener:

a) You may increase the receiver buffer size via setsockopt().
b) You have lost messages, you may be out of sync.

In some cases, ignoring ENOBUFS errors can be useful. For example:

a) nfnetlink_queue: this subsystem does not have any sort of resync
method and you can decide to ignore ENOBUFS once you have set a
given buffer size.

b) ctnetlink: you can use this together with the socket flag
NETLINK_BROADCAST_SEND_ERROR to stop getting ENOBUFS errors as
you do not need to resync (packets whose event are not delivered
are drop to provide reliable logging and state-synchronization).

Moreover, the use of NETLINK_NO_ENOBUFS also reduces a "go up, go down"
effect in terms of performance which is due to the netlink congestion
control when the listener cannot back off. The effect is the following:

1) throughput rate goes up and netlink messages are inserted in the
receiver buffer.
2) Then, netlink buffer fills and overruns (set on nlk->state bit 0).
3) While the listener empties the receiver buffer, netlink keeps
dropping messages. Thus, throughput goes dramatically down.
4) Then, once the listener has emptied the buffer (nlk->state
bit 0 is set off), goto step 1.

This effect is easy to trigger with netlink broadcast under heavy
load, and it is more noticeable when using a big receiver buffer.
You can find some results in [1] that show this problem.

[1] http://1984.lsi.us.es/linux/netlink/

This patch also includes the use of sk_drop to account the number of
netlink messages drop due to overrun. This value is shown in
/proc/net/netlink.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-24 16:37:55 -07:00
David S. Miller b5bb14386e Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6 2009-03-24 13:24:36 -07:00
Pablo Neira Ayuso dd5b6ce6fd nefilter: nfnetlink: add nfnetlink_set_err and use it in ctnetlink
This patch adds nfnetlink_set_err() to propagate the error to netlink
broadcast listener in case of memory allocation errors in the
message building.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2009-03-23 13:21:06 +01:00
David S. Miller 508827ff0a Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/tokenring/tmspci.c
	drivers/net/ucc_geth_mii.c
2009-03-05 02:06:47 -08:00
Pablo Neira Ayuso 4843b93c96 netlink: invert error code in netlink_set_err()
The callers of netlink_set_err() currently pass a negative value
as parameter for the error code. However, sk->sk_err wants a
positive error value. Without this patch, skb_recv_datagram() called
by netlink_recvmsg() may return a positive value to report an error.

Another choice to fix this is to change callers to pass a positive
error value, but this seems a bit inconsistent and error prone
to me. Indeed, the callers of netlink_set_err() assumed that the
(usual) negative value for error codes was fine before this patch :).

This patch also includes some documentation in docbook format
for netlink_set_err() to avoid this sort of confusion.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-03 23:37:30 -08:00
Geert Uytterhoeven e9cc8bddae netlink: Move netlink attribute parsing support to lib
Netlink attribute parsing may be used even if CONFIG_NET is not set.
Move it from net/netlink to lib and control its inclusion based on the new
config symbol CONFIG_NLATTR, which is selected by CONFIG_NET.

Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2009-03-04 14:53:30 +08:00
Wei Yongjun 91744f6559 netlink: remove some pointless conditionals before kfree_skb()
Remove some pointless conditionals before kfree_skb().

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-26 23:07:34 -08:00
Pablo Neira Ayuso 1ce85fe402 netlink: change nlmsg_notify() return value logic
This patch changes the return value of nlmsg_notify() as follows:

If NETLINK_BROADCAST_ERROR is set by any of the listeners and
an error in the delivery happened, return the broadcast error;
else if there are no listeners apart from the socket that
requested a change with the echo flag, return the result of the
unicast notification. Thus, with this patch, the unicast
notification is handled in the same way of a broadcast listener
that has set the NETLINK_BROADCAST_ERROR socket flag.

This patch is useful in case that the caller of nlmsg_notify()
wants to know the result of the delivery of a netlink notification
(including the broadcast delivery) and take any action in case
that the delivery failed. For example, ctnetlink can drop packets
if the event delivery failed to provide reliable logging and
state-synchronization at the cost of dropping packets.

This patch also modifies the rtnetlink code to ignore the return
value of rtnl_notify() in all callers. The function rtnl_notify()
(before this patch) returned the error of the unicast notification
which makes rtnl_set_sk_err() reports errors to all listeners. This
is not of any help since the origin of the change (the socket that
requested the echoing) notices the ENOBUFS error if the notification
fails and should resync itself.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-24 23:18:28 -08:00
Pablo Neira Ayuso be0c22a46c netlink: add NETLINK_BROADCAST_ERROR socket option
This patch adds NETLINK_BROADCAST_ERROR which is a netlink
socket option that the listener can set to make netlink_broadcast()
return errors in the delivery to the caller. This option is useful
if the caller of netlink_broadcast() do something with the result
of the message delivery, like in ctnetlink where it drops a network
packet if the event delivery failed, this is used to enable reliable
logging and state-synchronization. If this socket option is not set,
netlink_broadcast() only reports ESRCH errors and silently ignore
ENOBUFS errors, which is what most netlink_broadcast() callers
should do.

This socket option is based on a suggestion from Patrick McHardy.
Patrick McHardy can exchange this patch for a beer from me ;).

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-20 01:01:08 -08:00
Pablo Neira Ayuso ff491a7334 netlink: change return-value logic of netlink_broadcast()
Currently, netlink_broadcast() reports errors to the caller if no
messages at all were delivered:

1) If, at least, one message has been delivered correctly, returns 0.
2) Otherwise, if no messages at all were delivered due to skb_clone()
   failure, return -ENOBUFS.
3) Otherwise, if there are no listeners, return -ESRCH.

With this patch, the caller knows if the delivery of any of the
messages to the listeners have failed:

1) If it fails to deliver any message (for whatever reason), return
   -ENOBUFS.
2) Otherwise, if all messages were delivered OK, returns 0.
3) Otherwise, if no listeners, return -ESRCH.

In the current ctnetlink code and in Netfilter in general, we can add
reliable logging and connection tracking event delivery by dropping the
packets whose events were not successfully delivered over Netlink. Of
course, this option would be settable via /proc as this approach reduces
performance (in terms of filtered connections per seconds by a stateful
firewall) but providing reliable logging and event delivery (for
conntrackd) in return.

This patch also changes some clients of netlink_broadcast() that
may report ENOBUFS errors via printk. This error handling is not
of any help. Instead, the userspace daemons that are listening to
those netlink messages should resync themselves with the kernel-side
if they hit ENOBUFS.

BTW, netlink_broadcast() clients include those that call
cn_netlink_send(), nlmsg_multicast() and genlmsg_multicast() since they
internally call netlink_broadcast() and return its error value.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-05 23:56:36 -08:00
Inaky Perez-Gonzalez 3efb40c2c6 genetlink: export genl_unregister_mc_group()
Add an EXPORT_SYMBOL() to genl_unregister_mc_group(), to allow
unregistering groups on the run. EXPORT_SYMBOL_GPL() is not used as
the rest of the functions exported by this module (eg:
genl_register_mc_group) are also not _GPL().

Cleanup is currently done when unregistering a family, but there is
no way to unregister a single multicast group due to that function not
being exported. Seems to be a mistake as it is documented as for
external consumption.

This is needed by the WiMAX stack to be able to cleanup unused mc
groups.

Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-07 10:00:17 -08:00
Patrick McHardy ea5693ccc5 netlink: allow empty nested attributes
validate_nla() currently doesn't allow empty nested attributes. This
makes userspace code unnecessarily complicated when starting and ending
the nested attribute is done by generic upper level code and the inner
attributes are dumped by a module.

Add a special case to accept empty nested attributes. When the nested
attribute is non empty, the same checks as before are performed.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-28 03:05:19 -08:00
Eric Dumazet 3755810ceb net: Make sure BHs are disabled in sock_prot_inuse_add()
There is still a call to sock_prot_inuse_add() in af_netlink
while in a preemptable section. Add explicit BH disable around
this call.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 14:05:22 -08:00
David S. Miller 6f756a8c36 net: Make sure BHs are disabled in sock_prot_inuse_add()
The rule of calling sock_prot_inuse_add() is that BHs must
be disabled.  Some new calls were added where this was not
true and this tiggers warnings as reported by Ilpo.

Fix this by adding explicit BH disabling around those call sites.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-23 17:34:03 -08:00
Eric Dumazet c1fd3b9455 net: af_netlink should update its inuse counter
In order to have relevant information for NETLINK protocol, in
/proc/net/protocols, we should use sock_prot_inuse_add() to
update a (percpu and pernamespace) counter of inuse sockets.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-23 15:48:22 -08:00
Patrick McHardy b057efd4d2 netlink: constify struct nlattr * arg to parsing functions
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-28 11:59:11 -07:00
Johannes Berg 95a5afca4a net: Remove CONFIG_KMOD from net/ (towards removing CONFIG_KMOD entirely)
Some code here depends on CONFIG_KMOD to not try to load
protocol modules or similar, replace by CONFIG_MODULES
where more than just request_module depends on CONFIG_KMOD
and and also use try_then_request_module in ebtables.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-16 15:24:51 -07:00
Alan Cox 113aa838ec net: Rationalise email address: Network Specific Parts
Clean up the various different email addresses of mine listed in the code
to a single current and valid address. As Dave says his network merges
for 2.6.28 are now done this seems a good point to send them in where
they won't risk disrupting real changes.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-13 19:01:08 -07:00
Ilpo Järvinen 547b792cac net: convert BUG_TRAP to generic WARN_ON
Removes legacy reinvent-the-wheel type thing. The generic
machinery integrates much better to automated debugging aids
such as kerneloops.org (and others), and is unambiguous due to
better naming. Non-intuively BUG_TRAP() is actually equal to
WARN_ON() rather than BUG_ON() though some might actually be
promoted to BUG_ON() but I left that to future.

I could make at least one BUILD_BUG_ON conversion.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-25 21:43:18 -07:00
David S. Miller ea2aca084b Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	Documentation/feature-removal-schedule.txt
	drivers/net/wan/hdlc_fr.c
	drivers/net/wireless/iwlwifi/iwl-4965.c
	drivers/net/wireless/iwlwifi/iwl3945-base.c
2008-07-05 23:08:07 -07:00
Wang Chen 8487460720 netlink: Unneeded local variable
We already have a variable, which has the same capability.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-07-01 19:55:09 -07:00
David S. Miller 1b63ba8a86 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/wireless/iwlwifi/iwl4965-base.c
2008-06-28 01:19:40 -07:00
Julius Volz 10b595aff1 netlink: Fix some doc comments in net/netlink/attr.c
Fix some doc comments to match function and attribute names in
net/netlink/attr.c.

Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27 20:02:14 -07:00
David S. Miller 0344f1c66b Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	net/mac80211/tx.c
2008-06-19 16:00:04 -07:00
Patrick McHardy 6d1a3fb567 netlink: genl: fix circular locking
genetlink has a circular locking dependency when dumping the registered
families:

- dump start:
genl_rcv()            : take genl_mutex
genl_rcv_msg()        : call netlink_dump_start() while holding genl_mutex
netlink_dump_start(),
netlink_dump()        : take nlk->cb_mutex
ctrl_dumpfamily()     : try to detect this case and not take genl_mutex a
                        second time

- dump continuance:
netlink_rcv()         : call netlink_dump
netlink_dump          : take nlk->cb_mutex
ctrl_dumpfamily()     : take genl_mutex

Register genl_lock as callback mutex with netlink to fix this. This slightly
widens an already existing module unload race, the genl ops used during the
dump might go away when the module is unloaded. Thomas Graf is working on a
seperate fix for this.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-18 02:07:07 -07:00
David S. Miller 65b53e4cc9 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:

	drivers/net/tg3.c
	drivers/net/wireless/rt2x00/rt2x00dev.c
	net/mac80211/ieee80211_i.h
2008-06-10 02:22:26 -07:00
Denis V. Lunev 9457afee85 netlink: Remove nonblock parameter from netlink_attachskb
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-05 11:23:39 -07:00
Thomas Graf bc3ed28caa netlink: Improve returned error codes
Make nlmsg_trim(), nlmsg_cancel(), genlmsg_cancel(), and
nla_nest_cancel() void functions.

Return -EMSGSIZE instead of -1 if the provided message buffer is not
big enough.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-03 16:36:54 -07:00
Eric Paris 2532386f48 Audit: collect sessionid in netlink messages
Previously I added sessionid output to all audit messages where it was
available but we still didn't know the sessionid of the sender of
netlink messages.  This patch adds that information to netlink messages
so we can audit who sent netlink messages.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-28 06:18:03 -04:00
Linus Torvalds 3925e6fc1f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
  security: fix up documentation for security_module_enable
  Security: Introduce security= boot parameter
  Audit: Final renamings and cleanup
  SELinux: use new audit hooks, remove redundant exports
  Audit: internally use the new LSM audit hooks
  LSM/Audit: Introduce generic Audit LSM hooks
  SELinux: remove redundant exports
  Netlink: Use generic LSM hook
  Audit: use new LSM hooks instead of SELinux exports
  SELinux: setup new inode/ipc getsecid hooks
  LSM: Introduce inode_getsecid and ipc_getsecid hooks
2008-04-18 18:18:30 -07:00
Ahmed S. Darwish 0ce784ca72 Netlink: Use generic LSM hook
Don't use SELinux exported selinux_get_task_sid symbol.
Use the generic LSM equivalent instead.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Paul Moore <paul.moore@hp.com>
2008-04-19 09:52:35 +10:00
YOSHIFUJI Hideaki 878628fbf2 [NET] NETNS: Omit namespace comparision without CONFIG_NET_NS.
Introduce an inline net_eq() to compare two namespaces.
Without CONFIG_NET_NS, since no namespace other than &init_net
exists, it is always 1.

We do not need to convert 1) inline vs inline and
2) inline vs &init_net comparisons.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-26 04:40:00 +09:00
YOSHIFUJI Hideaki 1218854afa [NET] NETNS: Omit seq_net_private->net without CONFIG_NET_NS.
Without CONFIG_NET_NS, no namespace other than &init_net exists,
no need to store net in seq_net_private.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-26 04:39:56 +09:00
YOSHIFUJI Hideaki 3b1e0a655f [NET] NETNS: Omit sock->sk_net without CONFIG_NET_NS.
Introduce per-sock inlines: sock_net(), sock_net_set()
and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2008-03-26 04:39:55 +09:00
Stephen Hemminger b1153f29ee netlink: make socket filters work on netlink
Make socket filters work for netlink unicast and notifications.
This is useful for applications like Zebra that get overrun with
messages that are then ignored.

Note: netlink messages are in host byte order, but packet filter
state machine operations are done as network byte order.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 15:46:12 -07:00
Denis V. Lunev edf0208702 [NET]: Make netlink_kernel_release publically available as sk_release_kernel.
This staff will be needed for non-netlink kernel sockets, which should
also not pin a namespace like tcp_socket and icmp_socket.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-29 11:18:32 -08:00
Denis V. Lunev 9dfbec1fb2 [NETLINK]: No need for a separate __netlink_release call.
Merge it to netlink_kernel_release.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-29 11:17:56 -08:00
Pavel Emelyanov 910d6c320c [GENETLINK]: Relax dances with genl_lock.
The genl_unregister_family() calls the genl_unregister_mc_groups(), 
which takes and releases the genl_lock and then locks and releases
this lock itself.

Relax this behavior, all the more so the genl_unregister_mc_groups() 
is called from genl_unregister_family() only.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-12 22:16:33 -08:00
Al Viro 0c11b9428f [PATCH] switch audit_get_loginuid() to task_struct *
all callers pass something->audit_context

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-01 14:04:59 -05:00
Pavel Emelyanov 23fe18669e [NETNS]: Fix race between put_net() and netlink_kernel_create().
The comment about "race free view of the set of network
namespaces" was a bit hasty. Look (there even can be only
one CPU, as discovered by Alexey Dobriyan and Denis Lunev):

put_net()
  if (atomic_dec_and_test(&net->refcnt))
    /* true */
      __put_net(net);
        queue_work(...);

/*
 * note: the net now has refcnt 0, but still in
 * the global list of net namespaces
 */

== re-schedule ==

register_pernet_subsys(&some_ops);
  register_pernet_operations(&some_ops);
    (*some_ops)->init(net);
      /*
       * we call netlink_kernel_create() here
       * in some places
       */
      netlink_kernel_create();
         sk_alloc();
            get_net(net); /* refcnt = 1 */
         /*
          * now we drop the net refcount not to
          * block the net namespace exit in the
          * future (or this can be done on the
          * error path)
          */
         put_net(sk->sk_net);
             if (atomic_dec_and_test(&...))
                   /*
                    * true. BOOOM! The net is
                    * scheduled for release twice
                    */

When thinking on this problem, I decided, that getting and
putting the net in init callback is wrong. If some init
callback needs to have a refcount-less reference on the struct
net, _it_ has to be careful himself, rather than relying on
the infrastructure to handle this correctly.

In case of netlink_kernel_create(), the problem is that the
sk_alloc() gets the given namespace, but passing the info
that we don't want to get it inside this call is too heavy.

Instead, I propose to crate the socket inside an init_net
namespace and then re-attach it to the desired one right
after the socket is created.

After doing this, we also have to be careful on error paths
not to drop the reference on the namespace, we didn't get
the one on.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Denis Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31 19:27:22 -08:00
Patrick McHardy 01480e1cf5 [NETLINK]: Add nla_append()
Used to append data to a message without a header or padding.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 15:11:09 -08:00
Denis V. Lunev 775516bfa2 [NETNS]: Namespace stop vs 'ip r l' race.
During network namespace stop process kernel side netlink sockets
belonging to a namespace should be closed. They should not prevent
namespace to stop, so they do not increment namespace usage
counter. Though this counter will be put during last sock_put.

The raplacement of the correct netns for init_ns solves the problem
only partial as socket to be stoped until proper stop is a valid
netlink kernel socket and can be looked up by the user processes. This
is not a problem until it resides in initial namespace (no processes
inside this net), but this is not true for init_net.

So, hold the referrence for a socket, remove it from lookup tables and
only after that change namespace and perform a last put.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Tested-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 15:08:08 -08:00
Denis V. Lunev b7c6ba6eb1 [NETNS]: Consolidate kernel netlink socket destruction.
Create a specific helper for netlink kernel socket disposal. This just
let the code look better and provides a ground for proper disposal
inside a namespace.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Tested-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 15:08:07 -08:00
Denis V. Lunev 869e58f870 [NETNS]: Double free in netlink_release.
Netlink protocol table is global for all namespaces. Some netlink
protocols have been virtualized, i.e. they have per/namespace netlink
socket. This difference can easily lead to double free if more than 1
namespace is started. Count the number of kernel netlink sockets to
track that this table is not used any more.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Tested-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 15:08:05 -08:00
Ilpo Järvinen 3f25252675 [NETLINK] af_netlink: kill some bloat
net/netlink/af_netlink.c:
  netlink_realloc_groups        |  -46
  netlink_insert                |  -49
  netlink_autobind              |  -94
  netlink_clear_multicast_users |  -48
  netlink_bind                  |  -55
  netlink_setsockopt            |  -54
  netlink_release               |  -86
  netlink_kernel_create         |  -47
  netlink_change_ngroups        |  -56
 9 functions changed, 535 bytes removed, diff: -535

net/netlink/af_netlink.c:
  netlink_table_ungrab |  +53
 1 function changed, 53 bytes added, diff: +53

net/netlink/af_netlink.o:
 10 functions changed, 53 bytes added, 535 bytes removed, diff: -482

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 15:01:50 -08:00
Eric Dumazet 9a429c4983 [NET]: Add some acquires/releases sparse annotations.
Add __acquires() and __releases() annotations to suppress some sparse
warnings.

example of warnings :

net/ipv4/udp.c:1555:14: warning: context imbalance in 'udp_seq_start' - wrong
count at exit
net/ipv4/udp.c:1571:13: warning: context imbalance in 'udp_seq_stop' -
unexpected unlock

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 15:00:31 -08:00
Eric Dumazet ea72912c88 [NETLINK]: kzalloc() conversion
nl_pid_hash_alloc() is renamed to nl_pid_hash_zalloc().
It is now returning zeroed memory to its callers.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:57:06 -08:00
Patrick McHardy 6ac552fdc6 [NETLINK]: af_netlink.c checkpatch cleanups
Fix large number of checkpatch errors.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:55:50 -08:00
Denis V. Lunev e372c41401 [NET]: Consolidate net namespace related proc files creation.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:28 -08:00
Denis V. Lunev 022cbae611 [NET]: Move unneeded data to initdata section.
This patch reverts Eric's commit 2b008b0a8e

It diets .text & .data section of the kernel if CONFIG_NET_NS is not set.
This is safe after list operations cleanup.

Signed-of-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 03:23:50 -08:00
Patrick McHardy c3d8d1e30c [NETLINK]: Fix unicast timeouts
Commit ed6dcf4a in the history.git tree broke netlink_unicast timeouts
by moving the schedule_timeout() call to a new function that doesn't
propagate the remaining timeout back to the caller. This means on each
retry we start with the full timeout again.

ipc/mqueue.c seems to actually want to wait indefinitely so this
behaviour is retained.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:15:12 -08:00
Pavel Emelyanov 6257ff2177 [NET]: Forget the zero_it argument of sk_alloc()
Finally, the zero_it argument can be completely removed from
the callers and from the function prototype.

Besides, fix the checkpatch.pl warnings about using the
assignments inside if-s.

This patch is rather big, and it is a part of the previous one.
I splitted it wishing to make the patches more readable. Hope 
this particular split helped.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 00:39:31 -07:00
Eric W. Biederman 2b008b0a8e [NET]: Marking struct pernet_operations __net_initdata was inappropriate
It is not safe to to place struct pernet_operations in a special section.
We need struct pernet_operations to last until we call unregister_pernet_subsys.
Which doesn't happen until module unload.

So marking struct pernet_operations is a disaster for modules in two ways.
- We discard it before we call the exit method it points to.
- Because I keep struct pernet_operations on a linked list discarding
  it for compiled in code removes elements in the middle of a linked
  list and does horrible things for linked insert.

So this looks safe assuming __exit_refok is not discarded
for modules.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-26 22:54:53 -07:00
Denis V. Lunev 5c58298c25 [NETLINK]: Fix ACK processing after netlink_dump_start
Revert to original netlink behavior. Do not reply with ACK if the
netlink dump has bees successfully started.

libnl has been broken by the cd40b7d398
The following command reproduce the problem:
   /nl-route-get 192.168.1.1

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-23 21:27:51 -07:00
Jesper Juhl f937f1f46b [NETLINK]: Don't leak 'listeners' in netlink_kernel_create()
The Coverity checker spotted that we'll leak the storage allocated
to 'listeners' in netlink_kernel_create() when the
  if (!nl_table[unit].registered)
check is false.

This patch avoids the leak.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:26:32 -07:00
Denis V. Lunev cd40b7d398 [NET]: make netlink user -> kernel interface synchronious
This patch make processing netlink user -> kernel messages synchronious.
This change was inspired by the talk with Alexey Kuznetsov about current
netlink messages processing. He says that he was badly wrong when introduced 
asynchronious user -> kernel communication.

The call netlink_unicast is the only path to send message to the kernel
netlink socket. But, unfortunately, it is also used to send data to the
user.

Before this change the user message has been attached to the socket queue
and sk->sk_data_ready was called. The process has been blocked until all
pending messages were processed. The bad thing is that this processing
may occur in the arbitrary process context.

This patch changes nlk->data_ready callback to get 1 skb and force packet
processing right in the netlink_unicast.

Kernel -> user path in netlink_unicast remains untouched.

EINTR processing for in netlink_run_queue was changed. It forces rtnl_lock
drop, but the process remains in the cycle until the message will be fully
processed. So, there is no need to use this kludges now.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 21:15:29 -07:00
Denis V. Lunev aed815601f [NET]: unify netlink kernel socket recognition
There are currently two ways to determine whether the netlink socket is a
kernel one or a user one. This patch creates a single inline call for
this purpose and unifies all the calls in the af_netlink.c

No similar calls are found outside af_netlink.c.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 21:14:32 -07:00
Denis V. Lunev 7ee015e0fa [NET]: cleanup 3rd argument in netlink_sendskb
netlink_sendskb does not use third argument. Clean it and save a couple of
bytes.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 21:14:03 -07:00
Denis V. Lunev 3b71535f35 [NET]: Make netlink processing routines semi-synchronious (inspired by rtnl) v2
The code in netfilter/nfnetlink.c and in ./net/netlink/genetlink.c looks
like outdated copy/paste from rtnetlink.c. Push them into sync with the
original.

Changes from v1:
- deleted comment in nfnetlink_rcv_msg by request of Patrick McHardy

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 21:13:32 -07:00
Pavel Emelyanov cf7732e4cc [NET]: Make core networking code use seq_open_private
This concerns the ipv4 and ipv6 code mostly, but also the netlink
and unix sockets.

The netlink code is an example of how to use the __seq_open_private()
call - it saves the net namespace on this private.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:55:33 -07:00
Pavel Emelyanov 4665079cbb [NETNS]: Move some code into __init section when CONFIG_NET_NS=n
With the net namespaces many code leaved the __init section,
thus making the kernel occupy more memory than it did before.
Since we have a config option that prohibits the namespace
creation, the functions that initialize/finalize some netns
stuff are simply not needed and can be freed after the boot.

Currently, this is almost not noticeable, since few calls
are no longer in __init, but when the namespaces will be
merged it will be possible to free more code. I propose to
use the __net_init, __net_exit and __net_initdata "attributes"
for functions/variables that are not used if the CONFIG_NET_NS
is not set to save more space in memory.

The exiting functions cannot just reside in the __exit section,
as noticed by David, since the init section will have
references on it and the compilation will fail due to modpost
checks. These references can exist, since the init namespace
never dies and the exit callbacks are never called. So I
introduce the __exit_refok attribute just like it is already
done with the __init_refok.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:54:58 -07:00
Denis Cheng 26ff5ddc5a [NETLINK]: the temp variable name max is ambiguous
with the macro max provided by <linux/kernel.h>, so changed its name
to a more proper one: limit

Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:51:25 -07:00
Denis Cheng 99406c885a [NETLINK]: use the macro min(x,y) provided by <linux/kernel.h> instead
Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:51:25 -07:00
Herbert Xu 0cfad07555 [NETLINK]: Avoid pointer in netlink_run_queue
I was looking at Patrick's fix to inet_diag and it occured
to me that we're using a pointer argument to return values
unnecessarily in netlink_run_queue.  Changing it to return
the value will allow the compiler to generate better code
since the value won't have to be memory-backed.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:51:24 -07:00
Eric W. Biederman 077130c0cf [NET]: Fix race when opening a proc file while a network namespace is exiting.
The problem:  proc_net files remember which network namespace the are
against but do not remember hold a reference count (as that would pin
the network namespace).   So we currently have a small window where
the reference count on a network namespace may be incremented when opening
a /proc file when it has already gone to zero.

To fix this introduce maybe_get_net and get_proc_net.

maybe_get_net increments the network namespace reference count only if it is
greater then zero, ensuring we don't increment a reference count after it
has gone to zero.

get_proc_net handles all of the magic to go from a proc inode to the network
namespace instance and call maybe_get_net on it.

PROC_NET the old accessor is removed so that we don't get confused and use
the wrong helper function.

Then I fix up the callers to use get_proc_net and handle the case case
where get_proc_net returns NULL.  In that case I return -ENXIO because
effectively the network namespace has already gone away so the files
we are trying to access don't exist anymore.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:49:22 -07:00
Thomas Graf 8f4c1f9b04 [NETLINK]: Introduce nested and byteorder flag to netlink attribute
This change allows the generic attribute interface to be used within
the netfilter subsystem where this flag was initially introduced.

The byte-order flag is yet unused, it's intended use is to
allow automatic byte order convertions for all atomic types.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:49:16 -07:00
Eric W. Biederman b4b510290b [NET]: Support multiple network namespaces with netlink
Each netlink socket will live in exactly one network namespace,
this includes the controlling kernel sockets.

This patch updates all of the existing netlink protocols
to only support the initial network namespace.  Request
by clients in other namespaces will get -ECONREFUSED.
As they would if the kernel did not have the support for
that netlink protocol compiled in.

As each netlink protocol is updated to be multiple network
namespace safe it can register multiple kernel sockets
to acquire a presence in the rest of the network namespaces.

The implementation in af_netlink is a simple filter implementation
at hash table insertion and hash table look up time.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:49:09 -07:00
Eric W. Biederman 1b8d7ae42d [NET]: Make socket creation namespace safe.
This patch passes in the namespace a new socket should be created in
and has the socket code do the appropriate reference counting.  By
virtue of this all socket create methods are touched.  In addition
the socket create methods are modified so that they will fail if
you attempt to create a socket in a non-default network namespace.

Failing if we attempt to create a socket outside of the default
network namespace ensures that as we incrementally make the network stack
network namespace aware we will not export functionality that someone
has not audited and made certain is network namespace safe.
Allowing us to partially enable network namespaces before all of the
exotic protocols are supported.

Any protocol layers I have missed will fail to compile because I now
pass an extra parameter into the socket creation code.

[ Integrated AF_IUCV build fixes from Andrew Morton... -DaveM ]

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:49:07 -07:00
Eric W. Biederman 457c4cbc5a [NET]: Make /proc/net per network namespace
This patch makes /proc/net per network namespace.  It modifies the global
variables proc_net and proc_net_stat to be per network namespace.
The proc_net file helpers are modified to take a network namespace argument,
and all of their callers are fixed to pass &init_net for that argument.
This ensures that all of the /proc/net files are only visible and
usable in the initial network namespace until the code behind them
has been updated to be handle multiple network namespaces.

Making /proc/net per namespace is necessary as at least some files
in /proc/net depend upon the set of network devices which is per
network namespace, and even more files in /proc/net have contents
that are relevant to a single network namespace.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:49:06 -07:00
Denis Cheng 32b21e034b [NETLINK]: use container_of instead
This could make future redesign of struct netlink_sock easier.

Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:48:35 -07:00
Thomas Graf 79d310d01e [GENETLINK]: Correctly report errors while registering a multicast group
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-24 15:34:53 -07:00
Thomas Graf 2c04ddb707 [GENETLINK]: Fix adjustment of number of multicast groups
The current calculation of the maximum number of genetlink
multicast groups seems odd, fix it.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-24 15:33:51 -07:00
Thomas Graf 79dc4386ae [GENETLINK]: Fix race in genl_unregister_mc_groups()
family->mcast_groups is protected by genl_lock so it must
be held while accessing the list in genl_unregister_mc_groups().
Requires adding a non-locking variant of genl_unregister_mc_group().

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-24 15:32:46 -07:00
Johannes Berg 2dbba6f773 [GENETLINK]: Dynamic multicast groups.
Introduce API to dynamically register and unregister multicast groups.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Patrick McHardy <kaber@trash.net>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-18 15:47:52 -07:00
Johannes Berg 84659eb529 [NETLIKN]: Allow removing multicast groups.
Allow kicking listeners out of a multicast group when necessary
(for example if that group is going to be removed.)

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Patrick McHardy <kaber@trash.net>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-18 15:47:05 -07:00
Johannes Berg b4ff4f0419 [NETLINK]: allocate group bitmaps dynamically
Allow changing the number of groups for a netlink family
after it has been created, use RCU to protect the listeners
bitmap keeping netlink_has_listeners() lock-free.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Patrick McHardy <kaber@trash.net>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-18 15:46:06 -07:00
Johannes Berg eb49653449 [NETLINK]: negative groups in netlink_setsockopt
Reading netlink_setsockopt it's not immediately clear why there isn't a
bug when you pass in negative numbers, the reason being that the >=
comparison is really unsigned although 'val' is signed because
nlk->ngroups is unsigned. Make 'val' unsigned too.

[ Update the get_user() cast to match.  --DaveM ]

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-18 02:07:51 -07:00
Philippe De Muyter 56b3d975bb [NET]: Make all initialized struct seq_operations const.
Make all initialized struct seq_operations in net/ const

Signed-off-by: Philippe De Muyter <phdm@macqel.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10 23:07:31 -07:00
Patrick McHardy 1092cb2197 [NETLINK]: attr: add nested compat attribute type
Add a nested compat attribute type that can be used to convert
attributes that contain a structure to nested attributes in a
backwards compatible way.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10 22:15:38 -07:00
Patrick McHardy ef7c79ed64 [NETLINK]: Mark netlink policies const
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-06-07 13:40:10 -07:00
Randy Dunlap e63340ae6b header cleaning: don't include smp_lock.h when not used
Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.

Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:07 -07:00
Patrick McHardy 9e71efcd6d [NETLINK]: Remove bogus BUG_ON
Remove bogus BUG_ON(mutex_is_locked(nlk_sk(sk)->cb_mutex)), when the
netlink_kernel_create caller specifies an external mutex it might
validly be locked.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-04 12:15:11 -07:00
Patrick McHardy 188ccb5583 [NETLINK]: Fix use after free in netlink_recvmsg
When the user passes in MSG_TRUNC the skb is used after getting freed.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-03 03:27:01 -07:00
Herbert Xu 3f660d66df [NETLINK]: Kill CB only when socket is unused
Since we can still receive packets until all references to the
socket are gone, we don't need to kill the CB until that happens.
This also aligns ourselves with the receive queue purging which
happens at that point.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-05-03 03:17:14 -07:00
Adrian Bunk 42bad1da50 [NETLINK]: Possible cleanups.
- make the following needlessly global variables static:
  - core/rtnetlink.c: struct rtnl_msg_handlers[]
  - netfilter/nf_conntrack_proto.c: struct nf_ct_protos[]
- make the following needlessly global functions static:
  - core/rtnetlink.c: rtnl_dump_all()
  - netlink/af_netlink.c: netlink_queue_skip()

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-26 00:57:41 -07:00
Patrick McHardy ffa4d7216e [NETLINK]: don't reinitialize callback mutex
Don't reinitialize the callback mutex the netlink_kernel_create caller
handed in, it is supposed to already be initialized and could already
be held by someone.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:06 -07:00
Patrick McHardy af65bdfce9 [NETLINK]: Switch cb_lock spinlock to mutex and allow to override it
Switch cb_lock to mutex and allow netlink kernel users to override it
with a subsystem specific mutex for consistent locking in dump callbacks.
All netlink_dump_start users have been audited not to rely on any
side-effects of the previously used spinlock.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:03 -07:00
Johannes Berg d30045a0bc [NETLINK]: introduce NLA_BINARY type
This patch introduces a new NLA_BINARY attribute policy type with the
verification of simply checking the maximum length of the payload.

It also fixes a small typo in the example.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:05 -07:00
Thomas Graf c702e8047f [NETLINK]: Directly return -EINTR from netlink_dump_start()
Now that all users of netlink_dump_start() use netlink_run_queue()
to process the receive queue, it is possible to return -EINTR from
netlink_dump_start() directly, therefore simplying the callers.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:33 -07:00
Thomas Graf 1d00a4eb42 [NETLINK]: Remove error pointer from netlink message handler
The error pointer argument in netlink message handlers is used
to signal the special case where processing has to be interrupted
because a dump was started but no error happened. Instead it is
simpler and more clear to return -EINTR and have netlink_run_queue()
deal with getting the queue right.

nfnetlink passed on this error pointer to its subsystem handlers
but only uses it to signal the start of a netlink dump. Therefore
it can be removed there as well.

This patch also cleans up the error handling in the affected
message handlers to be consistent since it had to be touched anyway.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:30 -07:00
Thomas Graf 45e7ae7f71 [NETLINK]: Ignore control messages directly in netlink_run_queue()
Changes netlink_rcv_skb() to skip netlink controll messages and don't
pass them on to the message handler.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:29 -07:00
Thomas Graf d35b685640 [NETLINK]: Ignore !NLM_F_REQUEST messages directly in netlink_run_queue()
netlink_rcv_skb() is changed to skip messages which don't have the
NLM_F_REQUEST bit to avoid every netlink family having to perform this
check on their own.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:29 -07:00
Thomas Graf 33a0543cd9 [NETLINK]: Remove unused groups variable
Leftover from dynamic multicast groups allocation work.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:27 -07:00
Arnaldo Carvalho de Melo b529ccf279 [NETLINK]: Introduce nlmsg_hdr() helper
For the common "(struct nlmsghdr *)skb->data" sequence, so that we reduce the
number of direct accesses to skb->data and for consistency with all the other
cast skb member helpers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:34 -07:00
Arnaldo Carvalho de Melo 4305b54135 [SK_BUFF]: Convert skb->end to sk_buff_data_t
Now to convert the last one, skb->data, that will allow many simplifications
and removal of some of the offset helpers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:29 -07:00
Arnaldo Carvalho de Melo 27a884dc3c [SK_BUFF]: Convert skb->tail to sk_buff_data_t
So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes
on 64bit architectures, allowing us to combine the 4 bytes hole left by the
layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4
64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN...
:-)

Many calculations that previously required that skb->{transport,network,
mac}_header be first converted to a pointer now can be done directly, being
meaningful as offsets or pointers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:28 -07:00
Arnaldo Carvalho de Melo badff6d01a [SK_BUFF]: Introduce skb_reset_transport_header(skb)
For the common, open coded 'skb->h.raw = skb->data' operation, so that we can
later turn skb->h.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple cases:

skb->h.raw = skb->data;
skb->h.raw = {skb_push|[__]skb_pull}()

The next ones will handle the slightly more "complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:15 -07:00
Adrian Bunk cb69cc5236 [TCP/DCCP/RANDOM]: Remove unused exports.
This patch removes the following not or no longer used exports:
- drivers/char/random.c: secure_tcp_sequence_number
- net/dccp/options.c: sysctl_dccp_feat_sequence_window
- net/netlink/af_netlink.c: netlink_set_err

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:03 -07:00
David S. Miller b558ff7999 [NETLINK]: Mirror UDP MSG_TRUNC semantics.
If the user passes MSG_TRUNC in via msg_flags, return
the full packet size not the truncated size.

Idea from Herbert Xu and Thomas Graf.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:35 -07:00
Denis Lunev ac57b3a9ce [NETLINK]: Don't attach callback to a going-away netlink socket
There is a race between netlink_dump_start() and netlink_release()
that can lead to the situation when a netlink socket with non-zero
callback is freed.

Here it is:

CPU1:                           CPU2
netlink_release():              netlink_dump_start():

                                sk = netlink_lookup(); /* OK */

netlink_remove();

spin_lock(&nlk->cb_lock);
if (nlk->cb) { /* false */
  ...
}
spin_unlock(&nlk->cb_lock);

                                spin_lock(&nlk->cb_lock);
                                if (nlk->cb) { /* false */
                                         ...
                                }
                                nlk->cb = cb;
                                spin_unlock(&nlk->cb_lock);
                                ...
sock_orphan(sk);
/*
 * proceed with releasing
 * the socket
 */

The proposal it to make sock_orphan before detaching the callback
in netlink_release() and to check for the sock to be SOCK_DEAD in
netlink_dump_start() before setting a new callback.

Signed-off-by: Denis Lunev <den@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-18 17:05:58 -07:00
Arjan van de Ven da7071d7e3 [PATCH] mark struct file_operations const 8
Many struct file_operations in the kernel can be "const".  Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data.  In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:46 -08:00
YOSHIFUJI Hideaki 746fac4dcd [NET] NETLINK: Fix whitespace errors.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-02-10 23:19:58 -08:00
Mariusz Kozlowski 5e7c001c62 [AF_NETLINK]: module_put cleanup
This patch removes redundant argument check for module_put().

Signed-off-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-01-03 18:38:15 -08:00
Josef Sipek 6db5fc5d53 [PATCH] struct path: convert netlink
Signed-off-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:48 -08:00
Jamal Hadi Salim 48d4ed7a86 [GENETLINK]: Fix misplaced command flags.
The command flags for dump and do were swapped..

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-06 20:06:25 -08:00
Jamal Hadi Salim 334c29a645 [GENETLINK]: Move command capabilities to flags.
This patch moves command capabilities to command flags. Other than
being cleaner, saves several bytes.
We increment the nlctrl version so as to signal to user space that
to not expect the attributes. We will try to be careful
not to do this too often ;->

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-06 18:38:41 -08:00
Jamal Hadi Salim a4d1366d50 [GENETLINK]: Add cmd dump completion.
Remove assumption that generic netlink commands cannot have dump
completion callbacks.

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:32:09 -08:00
Thomas Graf 4e9b826935 [NETLINK]: Remove unused dst_pid field in netlink_skb_parms
The destination PID is passed directly to netlink_unicast()
respectively netlink_multicast().

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:30:43 -08:00
Thomas Graf e94ef68205 [GENETLINK] ctrl: Avoid empty CTRL_ATTR_OPS attribute when dumping
Based on Jamal's patch but compiled and even tested. :-)

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:30:23 -08:00
Thomas Graf 17c157c889 [GENL]: Add genlmsg_put_reply() to simplify building reply headers
By modyfing genlmsg_put() to take a genl_family and by adding
genlmsg_put_reply() the process of constructing the netlink
and generic netlink headers is simplified.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:22:42 -08:00
Thomas Graf 81878d27fd [GENL]: Add genlmsg_reply() to simply unicast replies to requests
A generic netlink user has no interest in knowing how to
address the source of the original request.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:22:41 -08:00
Thomas Graf 339bf98ffc [NETLINK]: Do precise netlink message allocations where possible
Account for the netlink message header size directly in nlmsg_new()
instead of relying on the caller calculate it correctly.

Replaces error handling of message construction functions when
constructing notifications with bug traps since a failure implies
a bug in calculating the size of the skb.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:22:11 -08:00
Heiko Carstens a27b58fed9 [NET]: fix uaccess handling
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-10-30 15:24:41 -08:00
Thomas Graf eb328111ef [GENL]: Provide more information to userspace about registered genl families
Additionaly exports the following information when providing
the list of registered generic netlink families:
  - protocol version
  - header size
  - maximum number of attributes
  - list of available operations including
      - id
      - flags
      - avaiability of policy and doit/dumpit function

libnl HEAD provides a utility to read this new information:

	0x0010 nlctrl version 1
	    hdrsize 0 maxattr 6
	      op GETFAMILY (0x03) [POLICY,DOIT,DUMPIT]
	0x0011 NLBL_MGMT version 1
	    hdrsize 0 maxattr 0
	      op unknown (0x02) [DOIT]
	      op unknown (0x03) [DOIT]
	      ....

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-22 15:18:51 -07:00
Thomas Graf 5176f91ea8 [NETLINK]: Make use of NLA_STRING/NLA_NUL_STRING attribute validation
Converts existing NLA_STRING attributes to use the new
validation features, saving a couple of temporary buffers.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-22 15:18:25 -07:00
Thomas Graf a5531a5d85 [NETLINK]: Improve string attribute validation
Introduces a new attribute type NLA_NUL_STRING to support NUL
terminated strings. Attributes of this kind require to carry
a terminating NUL within the maximum specified in the policy.

The `old' NLA_STRING which is not required to be NUL terminated
is extended to provide means to specify a maximum length of the
string.

Aims at easing the pain with using nla_strlcpy() on temporary
buffers.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-22 15:18:24 -07:00
YOSHIFUJI Hideaki ef047f5e10 [NET]: Use BUILD_BUG_ON() for checking size of skb->cb.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-22 15:18:15 -07:00
Thomas Graf d387f6ad10 [NETLINK]: Add notification message sending interface
Adds nlmsg_notify() implementing proper notification logic. The
message is multicasted to all listeners in the group. The
applications the requests orignates from can request a unicast
back report in which case said socket will be excluded from the
multicast to avoid duplicated notifications.

nlmsg_multicast() is extended to take allocation flags to
allow notification in atomic contexts.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-22 14:54:49 -07:00
Thomas Graf bf8b79e444 [NETLINK]: Convert core netlink handling to new netlink api
Fixes a theoretical memory and locking leak when the size of
the netlink header would exceed the skb tailroom.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-22 14:53:44 -07:00
Thomas Graf fe4944e59c [NETLINK]: Extend netlink messaging interface
Adds:
 nlmsg_get_pos()                 return current position in message
 nlmsg_trim()                    trim part of message
 nla_reserve_nohdr(skb, len)     reserve room for an attribute w/o hdr
 nla_put_nohdr(skb, len, data)   add attribute w/o hdr
 nla_find_nested()               find attribute in nested attributes

Fixes nlmsg_new() to take allocation flags and consider size.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-22 14:53:43 -07:00
Akinobu Mita fab2caf62e [NETLINK]: Call panic if nl_table allocation fails
This patch makes crash happen if initialization of nl_table fails
in initcalls. It is better than getting use after free crash later.

Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-08-29 21:22:18 -07:00
Panagiotis Issaris 0da974f4f3 [NET]: Conversions from kmalloc+memset to k(z|c)alloc.
Signed-off-by: Panagiotis Issaris <takis@issaris.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-07-21 14:51:30 -07:00
Arjan van de Ven 6abd219c6e [PATCH] bcm43xx: netlink deadlock fix
reported by Jure Repinc:

> > http://bugzilla.kernel.org/show_bug.cgi?id=6773

> > checked out dmesg output and found the message
> >
> > ======================================================
> > [ BUG: hard-safe -> hard-unsafe lock order detected! ]
> > ------------------------------------------------------
> >
> > starting at line 660 of the dmesg.txt that I will attach.

The patch below should fix the deadlock, albeit I suspect it's not the
"right" fix; the right fix may well be to move the rx processing in bcm43xx
to softirq context.  [it's debatable, ipw2200 hit this exact same bug; at
some point it's better to bite the bullet and move this to the common layer
as my patch below does]

Make the nl_table_lock irq-safe; it's taken for read in various netlink
functions, including functions that several wireless drivers (ipw2200,
bcm43xx) want to call from hardirq context.

The deadlock was found by the lock validator.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Michael Buesch <mb@bu3sch.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: Jeff Garzik <jeff@garzik.org>
Acked-by: "David S. Miller" <davem@davemloft.net>
Cc: jamal <hadi@cyberus.ca>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:26:58 -07:00
Jörn Engel 6ab3d5624e Remove obsolete #include <linux/config.h>
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 19:25:36 +02:00
Darrel Goeddel c7bdb545d2 [NETLINK]: Encapsulate eff_cap usage within security framework.
This patch encapsulates the usage of eff_cap (in netlink_skb_params) within
the security framework by extending security_netlink_recv to include a required
capability parameter and converting all direct usage of eff_caps outside
of the lsm modules to use the interface.  It also updates the SELinux
implementation of the security_netlink_send and security_netlink_recv
hooks to take advantage of the sid in the netlink_skb_params struct.
This also enables SELinux to perform auditing of netlink capability checks.
Please apply, for 2.6.18 if possible.

Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by:  James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-29 16:57:55 -07:00
Linus Torvalds 532f57da40 Merge branch 'audit.b10' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b10' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
  [PATCH] Audit Filter Performance
  [PATCH] Rework of IPC auditing
  [PATCH] More user space subject labels
  [PATCH] Reworked patch for labels on user space messages
  [PATCH] change lspp ipc auditing
  [PATCH] audit inode patch
  [PATCH] support for context based audit filtering, part 2
  [PATCH] support for context based audit filtering
  [PATCH] no need to wank with task_lock() and pinning task down in audit_syscall_exit()
  [PATCH] drop task argument of audit_syscall_{entry,exit}
  [PATCH] drop gfp_mask in audit_log_exit()
  [PATCH] move call of audit_free() into do_exit()
  [PATCH] sockaddr patch
  [PATCH] deal with deadlocks in audit_free()
2006-05-01 21:43:05 -07:00
Steve Grubb e7c3497013 [PATCH] Reworked patch for labels on user space messages
The below patch should be applied after the inode and ipc sid patches.
This patch is a reworking of Tim's patch that has been updated to match
the inode and ipc patches since its similar.

[updated:
>  Stephen Smalley also wanted to change a variable from isec to tsec in the
>  user sid patch.                                                              ]

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:09:58 -04:00
Soyoung Park 09493abfdb [NETLINK]: cleanup unused macro in net/netlink/af_netlink.c
1 line removal, of unused macro.
ran 'egrep -r' from linux-2.6.16/ for Nprintk and
didn't see it anywhere else but here, in #define...

Signed-off-by: Soyoung Park <speattle@yahoo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-04-29 18:33:13 -07:00
Alan Stern e041c68341 [PATCH] Notifier chain update: API changes
The kernel's implementation of notifier chains is unsafe.  There is no
protection against entries being added to or removed from a chain while the
chain is in use.  The issues were discussed in this thread:

    http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2

We noticed that notifier chains in the kernel fall into two basic usage
classes:

	"Blocking" chains are always called from a process context
	and the callout routines are allowed to sleep;

	"Atomic" chains can be called from an atomic context and
	the callout routines are not allowed to sleep.

We decided to codify this distinction and make it part of the API.  Therefore
this set of patches introduces three new, parallel APIs: one for blocking
notifiers, one for atomic notifiers, and one for "raw" notifiers (which is
really just the old API under a new name).  New kinds of data structures are
used for the heads of the chains, and new routines are defined for
registration, unregistration, and calling a chain.  The three APIs are
explained in include/linux/notifier.h and their implementation is in
kernel/sys.c.

With atomic and blocking chains, the implementation guarantees that the chain
links will not be corrupted and that chain callers will not get messed up by
entries being added or removed.  For raw chains the implementation provides no
guarantees at all; users of this API must provide their own protections.  (The
idea was that situations may come up where the assumptions of the atomic and
blocking APIs are not appropriate, so it should be possible for users to
handle these things in their own way.)

There are some limitations, which should not be too hard to live with.  For
atomic/blocking chains, registration and unregistration must always be done in
a process context since the chain is protected by a mutex/rwsem.  Also, a
callout routine for a non-raw chain must not try to register or unregister
entries on its own chain.  (This did happen in a couple of places and the code
had to be changed to avoid it.)

Since atomic chains may be called from within an NMI handler, they cannot use
spinlocks for synchronization.  Instead we use RCU.  The overhead falls almost
entirely in the unregister routine, which is okay since unregistration is much
less frequent that calling a chain.

Here is the list of chains that we adjusted and their classifications.  None
of them use the raw API, so for the moment it is only a placeholder.

  ATOMIC CHAINS
  -------------
arch/i386/kernel/traps.c:		i386die_chain
arch/ia64/kernel/traps.c:		ia64die_chain
arch/powerpc/kernel/traps.c:		powerpc_die_chain
arch/sparc64/kernel/traps.c:		sparc64die_chain
arch/x86_64/kernel/traps.c:		die_chain
drivers/char/ipmi/ipmi_si_intf.c:	xaction_notifier_list
kernel/panic.c:				panic_notifier_list
kernel/profile.c:			task_free_notifier
net/bluetooth/hci_core.c:		hci_notifier
net/ipv4/netfilter/ip_conntrack_core.c:	ip_conntrack_chain
net/ipv4/netfilter/ip_conntrack_core.c:	ip_conntrack_expect_chain
net/ipv6/addrconf.c:			inet6addr_chain
net/netfilter/nf_conntrack_core.c:	nf_conntrack_chain
net/netfilter/nf_conntrack_core.c:	nf_conntrack_expect_chain
net/netlink/af_netlink.c:		netlink_chain

  BLOCKING CHAINS
  ---------------
arch/powerpc/platforms/pseries/reconfig.c:	pSeries_reconfig_chain
arch/s390/kernel/process.c:		idle_chain
arch/x86_64/kernel/process.c		idle_notifier
drivers/base/memory.c:			memory_chain
drivers/cpufreq/cpufreq.c		cpufreq_policy_notifier_list
drivers/cpufreq/cpufreq.c		cpufreq_transition_notifier_list
drivers/macintosh/adb.c:		adb_client_list
drivers/macintosh/via-pmu.c		sleep_notifier_list
drivers/macintosh/via-pmu68k.c		sleep_notifier_list
drivers/macintosh/windfarm_core.c	wf_client_list
drivers/usb/core/notify.c		usb_notifier_list
drivers/video/fbmem.c			fb_notifier_list
kernel/cpu.c				cpu_chain
kernel/module.c				module_notify_list
kernel/profile.c			munmap_notifier
kernel/profile.c			task_exit_notifier
kernel/sys.c				reboot_notifier_list
net/core/dev.c				netdev_chain
net/decnet/dn_dev.c:			dnaddr_chain
net/ipv4/devinet.c:			inetaddr_chain

It's possible that some of these classifications are wrong.  If they are,
please let us know or submit a patch to fix them.  Note that any chain that
gets called very frequently should be atomic, because the rwsem read-locking
used for blocking chains is very likely to incur cache misses on SMP systems.
(However, if the chain's callout routines may sleep then the chain cannot be
atomic.)

The patch set was written by Alan Stern and Chandra Seetharaman, incorporating
material written by Keith Owens and suggestions from Paul McKenney and Andrew
Morton.

[jes@sgi.com: restructure the notifier chain initialization macros]
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:50 -08:00
Ingo Molnar 14cc3e2b63 [PATCH] sem2mutex: misc static one-file mutexes
Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jens Axboe <axboe@suse.de>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Cc: Greg KH <greg@kroah.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Adam Belay <ambx1@neo.rr.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:56:55 -08:00
Patrick McHardy 4277a083ec [NETLINK]: Add netlink_has_listeners for avoiding unneccessary event message generation
Keep a bitmask of multicast groups with subscribed listeners to let
netlink users check for listeners before generating multicast
messages.

Queries don't perform any locking, which may result in false
positives, it is guaranteed however that any new subscriptions are
visible before bind() or setsockopt() return.

Signed-off-by: Patrick McHardy <kaber@trash.net>
ACKed-by: Jamal Hadi Salim<hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-20 18:52:01 -08:00
Patrick McHardy cc9a06cd8d [NETLINK]: Fix use-after-free in netlink_recvmsg
The skb given to netlink_cmsg_recv_pktinfo is already freed, move it up
a few lines.

Coverity #948

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-03-12 20:39:38 -08:00
Jamal Hadi Salim e200bd8065 [NETLINK] genetlink: Fix bugs spotted by Andrew Morton.
- panic() doesn't return.

- Don't forget to unlock on genl_register_family() error path

- genl_rcv_msg() is called via pointer so there's no point in declaring it
  `inline'.

Notes:

genl_ctrl_event() ignores the genlmsg_multicast() return value.

lots of things ignore the genl_ctrl_event() return value.

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-13 15:51:24 -08:00
Alexey Kuznetsov a70ea994a0 [NETLINK]: Fix a severe bug
netlink overrun was broken while improvement of netlink.
Destination socket is used in the place where it was meant to be source socket,
so that now overrun is never sent to user netlink sockets, when it should be,
and it even can be set on kernel socket, which results in complete deadlock
of rtnetlink.

Suggested fix is to restore status quo passing source socket as additional
argument to netlink_attachskb().

A little explanation: overrun is set on a socket, when it failed
to receive some message and sender of this messages does not or even
have no way to handle this error. This happens in two cases:
1. when kernel sends something. Kernel never retransmits and cannot
   wait for buffer space.
2. when user sends a broadcast and the message was not delivered
   to some recipients.

Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-02-09 16:43:38 -08:00
Per Liden 23b0ca5bf5 [PATCH] genetlink: don't touch module ref count
Increasing the module ref count at registration will block the module from
ever being unloaded. In fact, genetlink should not care about the owner at
all. This patch removes the owner field from the struct registered with
genetlink.

Signed-off-by: Per Liden <per.liden@ericsson.com>
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-13 13:06:40 -08:00
Randy Dunlap 4fc268d24c [PATCH] capable/capability.h (net/)
net: Use <linux/capability.h> where capable() is used.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-11 18:42:14 -08:00
Martin Murray ad8e4b75c8 [AF_NETLINK]: Fix DoS in netlink_rcv_skb()
From: Martin Murray <murrayma@citi.umich.edu>

Sanity check nlmsg_len during netlink_rcv_skb.  An nlmsg_len == 0 can
cause infinite loop in kernel, effectively DoSing machine.  Noted by
Matin Murray.

Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-10 13:02:29 -08:00
Kirill Korotaev 14591de147 [PATCH] netlink oops fix due to incorrect error code
Fixed oops after failed netlink socket creation.

Wrong parathenses in if() statement caused err to be 1,
instead of negative value.

Trivial fix, not trivial to find though.

Signed-Off-By: Dmitry Mishin <dim@sw.ru>
Signed-Off-By: Kirill Korotaev <dev@openvz.org>
Signed-Off-By: Linus Torvalds <torvalds@osdl.org>
2006-01-09 09:36:52 -08:00
Per Liden b461d2f218 [NETLINK] genetlink: fix cmd type in genl_ops to be consistent to u8
Signed-off-by: Per Liden <per.liden@ericsson.com>
ACKed-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 14:13:29 -08:00
Eric Dumazet 90ddc4f047 [NET]: move struct proto_ops to const
I noticed that some of 'struct proto_ops' used in the kernel may share
a cache line used by locks or other heavily modified data. (default
linker alignement is 32 bytes, and L1_CACHE_LINE is 64 or 128 at
least)

This patch makes sure a 'struct proto_ops' can be declared as const,
so that all cpus can share all parts of it without false sharing.

This is not mandatory : a driver can still use a read/write structure
if it needs to (and eventually a __read_mostly)

I made a global stubstitute to change all existing occurences to make
them const.

This should reduce the possibility of false sharing on SMP, and
speedup some socket system calls.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-03 13:11:15 -08:00
Herbert Xu c27bd492fd [NETLINK]: Use tgid instead of pid for nlmsg_pid
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-22 14:41:50 -08:00
Thomas Graf 482a8524f8 [NETLINK]: Generic netlink family
The generic netlink family builds on top of netlink and provides
simplifies access for the less demanding netlink users. It solves
the problem of protocol numbers running out by introducing a so
called controller taking care of id management and name resolving.

Generic netlink modules register themself after filling out their
id card (struct genl_family), after successful registration the
modules are able to register callbacks to command numbers by
filling out a struct genl_ops and calling genl_register_op(). The
registered callbacks are invoked with attributes parsed making
life of simple modules a lot easier.

Although generic netlink modules can request static identifiers,
it is recommended to use GENL_ID_GENERATE and to let the controller
assign a unique identifier to the module. Userspace applications
will then ask the controller and lookup the idenfier by the module
name.

Due to the current multicast implementation of netlink, the number
of generic netlink modules is restricted to 1024 to avoid wasting
memory for the per socket multiacst subscription bitmask.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 02:26:41 +01:00
Thomas Graf 82ace47a72 [NETLINK]: Generic netlink receive queue processor
Introduces netlink_run_queue() to handle the receive queue of
a netlink socket in a generic way. Processes as much as there
was in the queue upon entry and invokes a callback function
for each netlink message found. The callback function may
refuse a message by returning a negative error code but setting
the error pointer to 0 in which case netlink_run_queue() will
return with a qlen != 0.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 02:26:40 +01:00
Thomas Graf a8f74b2288 [NETLINK]: Make netlink_callback->done() optional
Most netlink families make no use of the done() callback, making
it optional gets rid of all unnecessary dummy implementations.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 02:26:40 +01:00
Thomas Graf bfa83a9e03 [NETLINK]: Type-safe netlink messages/attributes interface
Introduces a new type-safe interface for netlink message and
attributes handling. The interface is fully binary compatible
with the old interface towards userspace. Besides type safety,
this interface features attribute validation capabilities,
simplified message contstruction, and documentation.

The resulting netlink code should be smaller, less error prone
and easier to understand.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 02:26:40 +01:00
Linus Torvalds 236fa08168 Merge master.kernel.org:/pub/scm/linux/kernel/git/acme/net-2.6.15 2005-10-28 08:50:37 -07:00