This patch enhances the synchronization of the closing connections
between the master and the backup director. It prevents the closed
connections to expire with the 15 min timeout of the ESTABLISHED
state on the backup and makes them expire as they would do on the
master with much shorter timeouts.
Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that kthread_stop() can wake up the thread and we don't have to wait one
second in the worst case for the daemon to actually stop.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
Instead of doing an endless loop with sleeping for one second, we now put the
backup thread onto the mcast socket wait queue and it gets woken up as soon as
we have data to process.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
This also moves the setup code out of the daemons, so that we're able to
return proper error codes to user space. The current code will return success
to user space when the daemon is started with an invald mcast interface. With
these changes we get an appropriate "No such device" error.
We longer need our own completion to be sure the daemons are actually running,
because they no longer contain code that can fail and kthread_run() takes care
of the rest.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
The additional information we now return to the caller is currently not used,
but will be used to return errors to user space.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
There's no need to do it at runtime, the values are constant.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
This patch removes CVS keywords that weren't updated for a long time
from comments.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes bug http://bugzilla.kernel.org/show_bug.cgi?id=10556
where conn templates with protocol=IPPROTO_IP can oops backup box.
Result from ip_vs_proto_get() should be checked because
protocol value can be invalid or unsupported in backup. But
for valid message we should not fail for templates which use
IPPROTO_IP. Also, add checks to validate message limits and
connection state. Show state NONE for templates using IPPROTO_IP.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
__FUNCTION__ is gcc-specific, use __func__
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No available servers is more an error message than something informational. It
should also be rate-limited, else we're going to flood our logs on a busy
director, if all real servers are out of order with a weight of zero.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Needed to propagate it down to the ip_route_output_flow.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch extends the inet_addr_type and inet_dev_addr_type with the
network namespace pointer. That allows to access the different tables
relatively to the network namespace.
The modification of the signature function is reported in all the
callers of the inet_addr_type using the pointer to the well known
init_net.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The feature of ipvs ctls is that the net/ipv4/vs path
is common for core ipvs ctls and for two schedulers,
so I make it exported and re-use it in modules.
Two other .c files required linux/sysctl.h to make the
extern declaration of this path compile well.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both try_module_get/module_put already handle the module == NULL
case, so no need in manual checking.
This patch fits both net-2.6 and net-2.6.25.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this patch the synced connections are created with their real state,
which can be changed on the next synchronizations if necessary. This way
on fail-over all the connections will be treated according to their actual
state, causing no scheduling problems (the active and the nonactive
connections have different weights in the schedulers).
The backwards compatibility is preserved and the existing tools will show
the true connection states even on the backup director.
Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch labels the sync-created connections with IP_VS_CONN_F_SYNC
flag and creates /proc/net/ip_vs_conn_sync to enable monitoring of the
origin of the connections, if they are local or created by the
synchronization.
Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv4 and IPv6 hook values are identical, yet some code tries to figure
out the "correct" value by looking at the address family. Introduce NF_INET_*
values for both IPv4 and IPv6. The old values are kept in a #ifndef __KERNEL__
section for userspace compatibility.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most callers of the LOCAL_OUT chain will set the IP packet length and
header checksum before doing so. They also share the same output
function dst_output.
This patch creates a new function called ip_local_out which does all
of that and converts the appropriate users over to it.
Apart from removing duplicate code, it will also help in merging the
IPsec output path once the same thing is done for IPv6.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many-many code in the kernel initialized the timer->function
and timer->data together with calling init_timer(timer). There
is already a helper for this. Use it for networking code.
The patch is HUGE, but makes the code 130 lines shorter
(98 insertions(+), 228 deletions(-)).
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The register_ip_vs_scheduler() checks for the scheduler with the
same name under the read-locked __ip_vs_sched_lock, then drops,
takes it for writing and puts the scheduler in list.
This is racy, since we can have a race window between the lock
being re-locked for writing.
The fix is to search the scheduler with the given name right under
the write-locked __ip_vs_sched_lock.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case we load lblc or lblcr module we can leak some sysctl
tables if the call to register_ip_vs_scheduler() fails.
I've looked at the register_ip_vs_scheduler() code and saw, that
the only reason to fail is the name collision, so I think that
with some 3rd party schedulers this becomes a relevant issue. No?
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is silly, but I have turned the CONFIG_IP_VS to m,
to check the compilation of one (recently sent) fix
and set all the CONFIG_IP_VS_PROTO_XXX options to n to
speed up the compilation.
In this configuration the compiler warns me about
CC [M] net/ipv4/ipvs/ip_vs_proto.o
net/ipv4/ipvs/ip_vs_proto.c:49: warning: 'register_ip_vs_protocol' defined but not used
Indeed. With no protocols selected there are no
calls to this function - all are compiled out with
ifdefs.
Maybe the best fix would be to surround this call with
ifdef-s or tune the Kconfig dependences, but I think that
marking this register function as __used is enough. No?
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch the remaining IPVS sysctl entries over to to use CTL_UNNUMBERED,
I stronly doubt that anyone is using the sys_sysctl interface to
these variables.
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
sysctl table check failed: /net/ipv4/vs/lblc_expiration .3.5.21.19 Missing strategy
[...]
sysctl table check failed: /net/ipv4/vs/lblcr_expiration .3.5.21.20 Missing strategy
Switch these entried over to use CTL_UNNUMBERED as clearly
the sys_syscal portion wasn't working.
This is along the same lines as Christian Borntraeger's patch that fixes
up entries with no stratergy in net/ipv4/ipvs/ip_vs_ctl.c
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Running the latest git code I get the following messages during boot:
sysctl table check failed: /net/ipv4/vs/drop_entry .3.5.21.4 Missing strategy
[...]
sysctl table check failed: /net/ipv4/vs/drop_packet .3.5.21.5 Missing strategy
[...]
sysctl table check failed: /net/ipv4/vs/secure_tcp .3.5.21.6 Missing strategy
[...]
sysctl table check failed: /net/ipv4/vs/sync_threshold .3.5.21.24 Missing strategy
I removed the binary sysctl handler for those messages and also removed
the definitions in ip_vs.h. The alternative would be to implement a
proper strategy handler, but syscall sysctl is deprecated.
There are other sysctl definitions that are commented out or work with
the default sysctl_data strategy. I did not touch these.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the following unused EXPORT_SYMBOL's:
- ip_vs_try_bind_dest
- ip_vs_find_dest
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes the master daemon to sync the connection when it is about
to close. This makes the connections on the backup to close or timeout
according their state. Before the sync was performed only if the
connection is in ESTABLISHED state which always made the connections to
timeout in the hard coded 3 minutes. However the Andy Gospodarek's patch
([IPVS]: use proper timeout instead of fixed value) effectively did nothing
more than increasing this to 15 minutes (Established state timeout). So
this patch makes use of proper timeout since it syncs the connections on
status changes to FIN_WAIT (2min timeout) and CLOSE (10sec timeout).
However if the backup misses CLOSE hopefully it did not miss FIN_WAIT.
Otherwise we will just have to wait for the ESTABLISHED state timeout. As
it is without this patch. This way the number of the hanging connections
on the backup is kept to minimum. And very few of them will be left to
timeout with a long timeout.
This is important if we want to make use of the fix for the real server
overcommit on master/backup fail-over.
Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the problem with node overload on director fail-over.
Given the scenario: 2 nodes each accepting 3 connections at a time and 2
directors, director failover occurs when the nodes are fully loaded (6
connections to the cluster) in this case the new director will assign
another 6 connections to the cluster, If the same real servers exist
there.
The problem turned to be in not binding the inherited connections to
the real servers (destinations) on the backup director. Therefore:
"ipvsadm -l" reports 0 connections:
root@test2:~# ipvsadm -l
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP test2.local:5999 wlc
-> node473.local:5999 Route 1000 0 0
-> node484.local:5999 Route 1000 0 0
while "ipvs -lnc" is right
root@test2:~# ipvsadm -lnc
IPVS connection entries
pro expire state source virtual destination
TCP 14:56 ESTABLISHED 192.168.0.10:39164 192.168.0.222:5999
192.168.0.51:5999
TCP 14:59 ESTABLISHED 192.168.0.10:39165 192.168.0.222:5999
192.168.0.52:5999
So the patch I am sending fixes the problem by binding the received
connections to the appropriate service on the backup director, if it
exists, else the connection will be handled the old way. So if the
master and the backup directors are synchronized in terms of real
services there will be no problem with server over-committing since
new connections will not be created on the nonexistent real services
on the backup. However if the service is created later on the backup,
the binding will be performed when the next connection update is
received. With this patch the inherited connections will show as
inactive on the backup:
root@test2:~# ipvsadm -l
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP test2.local:5999 wlc
-> node473.local:5999 Route 1000 0 1
-> node484.local:5999 Route 1000 0 1
rumen@test2:~$ cat /proc/net/ip_vs
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP C0A800DE:176F wlc
-> C0A80033:176F Route 1000 0 1
-> C0A80032:176F Route 1000 0 1
Regards,
Rumen Bogdanovski
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
It's under CONFIG_IP_VS_LBLCR_DEBUG option which never existed.
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using the default timeout of 3 minutes, this uses the timeout
specific to the protocol used for the connection. The 3 minute timeout
seems somewhat arbitrary (though I know it is used other places in the
ipvs code) and when failing over it would be much nicer to use one of
the configured timeout values.
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some places, the result of skb_headroom() is compared to an unsigned
integer, and in others, the result is compared to a signed integer. Make
the comparisons consistent and correct.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sync_master_pid and sync_backup_pid are set in set_sync_pid() and are
used later for set/not-set checks and in printk. So it is safe to use the
global pid value in this case.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.
The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.
[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With all the users of the double pointers removed, this patch mops up by
finally replacing all occurances of sk_buff ** in the netfilter API by
sk_buff *.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the IPVS-specific version of skb_make_writable and
replaces it with the netfilter one.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that ip_frag always returns the packet given to it on input, we can
change it to return an integer indicating error instead. This patch does
that and updates all its callers accordingly.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This concerns the ipv4 and ipv6 code mostly, but also the netlink
and unix sockets.
The netlink code is an example of how to use the __seq_open_private()
call - it saves the net namespace on this private.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that multiple loopback devices are becoming possible it makes
the code a little cleaner and more maintainable to test if a deivice
is th a loopback device by testing dev->flags & IFF_LOOPBACK instead
of dev == loopback_dev.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces all occurences to the static variable
loopback_dev to a pointer loopback_dev. That provides the
mindless, trivial, uninteressting change part for the dynamic
allocation for the loopback.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-By: Kirill Korotaev <dev@sw.ru>
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes most of the generic device layer network
namespace safe. This patch makes dev_base_head a
network namespace variable, and then it picks up
a few associated variables. The functions:
dev_getbyhwaddr
dev_getfirsthwbytype
dev_get_by_flags
dev_get_by_name
__dev_get_by_name
dev_get_by_index
__dev_get_by_index
dev_ioctl
dev_ethtool
dev_load
wireless_process_ioctl
were modified to take a network namespace argument, and
deal with it.
vlan_ioctl_set and brioctl_set were modified so their
hooks will receive a network namespace argument.
So basically anthing in the core of the network stack that was
affected to by the change of dev_base was modified to handle
multiple network namespaces. The rest of the network stack was
simply modified to explicitly use &init_net the initial network
namespace. This can be fixed when those components of the network
stack are modified to handle multiple network namespaces.
For now the ifindex generator is left global.
Fundametally ifindex numbers are per namespace, or else
we will have corner case problems with migration when
we get that far.
At the same time there are assumptions in the network stack
that the ifindex of a network device won't change. Making
the ifindex number global seems a good compromise until
the network stack can cope with ifindex changes when
you change namespaces, and the like.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes /proc/net per network namespace. It modifies the global
variables proc_net and proc_net_stat to be per network namespace.
The proc_net file helpers are modified to take a network namespace argument,
and all of their callers are fixed to pass &init_net for that argument.
This ensures that all of the /proc/net files are only visible and
usable in the initial network namespace until the code behind them
has been updated to be handle multiple network namespaces.
Making /proc/net per namespace is necessary as at least some files
in /proc/net depend upon the set of network devices which is per
network namespace, and even more files in /proc/net have contents
that are relevant to a single network namespace.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So I've had a deadlock reported to me. I've found that the sequence of
events goes like this:
1) process A (modprobe) runs to remove ip_tables.ko
2) process B (iptables-restore) runs and calls setsockopt on a netfilter socket,
increasing the ip_tables socket_ops use count
3) process A acquires a file lock on the file ip_tables.ko, calls remove_module
in the kernel, which in turn executes the ip_tables module cleanup routine,
which calls nf_unregister_sockopt
4) nf_unregister_sockopt, seeing that the use count is non-zero, puts the
calling process into uninterruptible sleep, expecting the process using the
socket option code to wake it up when it exits the kernel
4) the user of the socket option code (process B) in do_ipt_get_ctl, calls
ipt_find_table_lock, which in this case calls request_module to load
ip_tables_nat.ko
5) request_module forks a copy of modprobe (process C) to load the module and
blocks until modprobe exits.
6) Process C. forked by request_module process the dependencies of
ip_tables_nat.ko, of which ip_tables.ko is one.
7) Process C attempts to lock the request module and all its dependencies, it
blocks when it attempts to lock ip_tables.ko (which was previously locked in
step 3)
Theres not really any great permanent solution to this that I can see, but I've
developed a two part solution that corrects the problem
Part 1) Modifies the nf_sockopt registration code so that, instead of using a
use counter internal to the nf_sockopt_ops structure, we instead use a pointer
to the registering modules owner to do module reference counting when nf_sockopt
calls a modules set/get routine. This prevents the deadlock by preventing set 4
from happening.
Part 2) Enhances the modprobe utilty so that by default it preforms non-blocking
remove operations (the same way rmmod does), and add an option to explicity
request blocking operation. So if you select blocking operation in modprobe you
can still cause the above deadlock, but only if you explicity try (and since
root can do any old stupid thing it would like.... :) ).
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
For architectures that don't have a volatile atomic_ts constructs like
while (atomic_read(&something)); might result in endless loops since a
barrier() is missing which forces the compiler to generate code that
actually reads memory contents.
Fix this in ipvs by using the IP_VS_WAIT_WHILE macro which resolves to
while (expr) { cpu_relax(); }
(why isn't this open coded btw?)
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch cleans up duplicate includes in
net/ipv4/
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>