Every file should include the headers containing the prototypes for
it's global functions.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Many struct file_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace appropriate pairs of "kmem_cache_alloc()" + "memset(0)" with the
corresponding "kmem_cache_zalloc()" call.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Roland McGrath <roland@redhat.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Greg KH <greg@kroah.com>
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch stops the dlm_recv workqueue from busy-waiting when a node
disconnects. This can cause soft lockup errors on debug systems and bad
performance generally.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
A new lvb for a userland lock wasn't being initialized to zero.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
On Sun, Jan 28, 2007 at 11:08:18AM +0100, Jiri Slaby wrote:
> Andrew Morton napsal(a):
> >Temporarily at
> >
> > http://userweb.kernel.org/~akpm/2.6.20-rc6-mm1/
>
> Unable to select IPV6. Menuconfig doesn't offer it when INET is selected.
> When it's not it appears in the menu, but after state change it gets away.
> The same behaviour in xconfig, gconfig.
>
> $ mkdir ../a/tst
> $ make O=../a/tst menuconfig
> HOSTCC scripts/basic/fixdep
> [...]
> HOSTLD scripts/kconfig/mconf
> scripts/kconfig/mconf arch/i386/Kconfig
> Warning! Found recursive dependency: INET GFS2_FS_LOCKING_DLM SYSFS
> OCFS2_FS INET
>
> Maybe this is the problem?
Yes, patch below.
> regards,
cu
Adrian
<-- snip -->
This patch fixes a circular dependency by letting GFS2_FS_LOCKING_DLM
and DLM depend on instead of select SYSFS.
Since SYSFS depends on EMBEDDED this change shouldn't cause any problems
for users.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
With CONFIG_DLM=m, CONFIG_PROC_FS=n, and CONFIG_SYSFS=n, kernel build
fails with:
WARNING: "kernel_subsys" [fs/gfs2/locking/dlm/lock_dlm.ko] undefined!
WARNING: "kernel_subsys" [fs/dlm/dlm.ko] undefined!
WARNING: "kernel_subsys" [fs/configfs/configfs.ko] undefined!
make[1]: *** [__modpost] Error 1
make: *** [modules] Error 2
Since fs/dlm/lockspace.c and fs/gfs2/locking/dlm/sysfs.c use
kernel_subsys, they should either DEPEND on it or SELECT it.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
A long, complicated sequence of events, beginning with the RESEND flag not
being cleared on an lkb, can result in an unlock never completing.
- lkb on waiters list for remote lookup
- the remote node is both the dir node and the master node, so
it optimizes the lookup into a request and sends a request
reply back
- the request reply is saved on the requestqueue to be processed
after recovery
- recovery runs dlm_recover_waiters_pre() which sets RESEND flag
so the lookup will be resent after recovery
- end of recovery: process_requestqueue takes saved request reply
which removes the lkb off the waitesr list, _without_ clearing
the RESEND flag
- end of recovery: dlm_recover_waiters_post() doesn't do anything
with the now completed lookup lkb (would usually clear RESEND)
- later, the node unmounts, unlocks this lkb that still has RESEND
flag set
- the lkb is on the waiters list again, now for unlock, when recovery
occurs, dlm_recover_waiters_pre() shows the lkb for unlock with RESEND
set, doesn't do anything since the master still exists
- end of recovery: dlm_recover_waiters_post() takes this lkb off
the waiters list because it has the RESEND flag set, then reports
an error because unlocks are never supposed to be handled in
recover_waiters_post().
- later, the unlock reply is received, doesn't find the lkb on
the waiters list because recover_waiters_post() has wrongly
removed it.
- the unlock operation has been lost, and we're left with a
stray granted lock
- unmount spins waiting for the unlock to complete
The visible evidence of this problem will be a node where gfs umount is
spinning, the dlm waiters list will be empty, and the dlm locks list will
show a granted lock.
The fix is simply to clear the RESEND flag when taking an lkb off the
waiters list.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
dlm_receive_message() returns 0 instead of returning 'error'. What would
happen is that process_requestqueue would take a saved message off the
requestqueue and call receive_message on it. receive_message would then
see that recovery had been aborted, set error to EINTR, and 'goto out',
expecting that the error would be returned. Instead, 0 was always
returned, so process_requestqueue would think that the message had been
processed and delete it instead of saving it to process next time. This
means the message (usually an unlock in my tests) would be lost.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Now that there can be multiple dlm_recv threads running we need to prevent two
recvs running for the same connection - it's unlikely but it can happen and it
causes message corruption.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes a bug whereby data on a newly accepted connection would be
ignored if it arrived soon after the accept.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch removes some redundant fields from the connection structure and adds
some lockdep annotation to remove spurious warnings.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
If master recovery happens on an rsb in one recovery sequence, then that
sequence is aborted before lock recovery happens, then in the next
sequence, we rely on the previous master recovery (which may now be
invalid due to another node ignoring a lookup result) and go on do to the
lock recovery where we get stuck due to an invalid master value.
recovery cycle begins: master of rsb X has left
nodes A and B send node C an rcom lookup for X to find the new master
C gets lookup from B first, sets B as new master, and sends reply back to B
C gets lookup from A next, and sends reply back to A saying B is master
A gets lookup reply from C and sets B as the new master in the rsb
recovery cycle on A, B and C is aborted to start a new recovery
B gets lookup reply from C and ignores it since there's a new recovery
recovery cycle begins: some other node has joined
B doesn't think it's the master of X so it doesn't rebuild it in the directory
C looks up the master of X, no one is master, so it becomes new master
B looks up the master of X, finds it's C
A believes that B is the master of X, so it sends its lock to B
B sends an error back to A
A resends
this repeats forever, the incorrect master value on A is never corrected
The fix is to do master recovery on an rsb that still has the NEW_MASTER
flag set from an earlier recovery sequence, and therefore didn't complete
lock recovery.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When a user process exits, we clear all the locks it holds. There is a
problem, though, with locks that the process had begun unlocking before it
exited. We couldn't find the lkb's that were in the process of being
unlocked remotely, to flag that they are DEAD. To solve this, we move
lkb's being unlocked onto a new list in the per-process structure that
tracks what locks the process is holding. We can then go through this
list to flag the necessary lkb's when clearing locks for a process when it
exits.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch converts the DLM TCP lowcomms to use workqueues rather than using its
own daemon functions. Simultaneously removing a lot of code and making it more
scalable on multi-processor machines.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Make the dlm_config_info values readable and writeable via configfs
entries.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Add a new dlm_config_info field to enable log_debug output and change
log_debug() to use it.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Add a "ci_" prefix to the fields in the dlm_config_info struct so that we
can use macros to add configfs functions to access them (in a later
patch). No functional changes in this patch, just naming changes.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Some common, non-error messages should use log_debug instead of log_error
so they can be turned off.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
I just noticed this message when testing some other changes I'd made to
lowcomms (to use workqueues) but the problem seems to be in the current
git trees too. I'm amazed no-one has seen it.
BUG: spinlock already unlocked on CPU#1, dlm_recoverd/16868
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
I was a little over-enthusiastic turning schedule() calls int cond_sched() when fixing the DLM for Andrew Morton.
These four should really be calls to schedule() or the dlm can busy-wait.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Remove the following unused functions:
- lowcomms_send_message()
- lowcomms_max_buffer_size()
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When the dlm fakes an unlock/cancel reply from a failed node using a stub
message struct, it wasn't setting the flags in the stub message. So, in
the process of receiving the fake message the lkb flags would be updated
and cleared from the zero flags in the message. The problem observed in
tests was the loss of the USER flag which caused the dlm to think a user
lock was a kernel lock and subsequently fail an assertion checking the
validity of the ast/callback field.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
LVB's are not sent as part of new requests, but the code receiving the
request was copying data into the lvb anyway. The space in the message
where it mistakenly thought the lvb lived actually contained the resource
name, so it wound up incorrectly copying this name data into the lvb. Fix
is to just create the lvb, not copy junk into it.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The send_args() function is used to copy parameters into a message for a
number different message types. Only some of those types are set up
beforehand (in create_message) to include space for sending lvb data.
send_args was wrongly copying the lvb for all message types as long as the
lock had an lvb. This means that the lvb data was being written past the
end of the message into unknown space.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Check if we receive a message from another lockspace member running a
version of the dlm with an incompatible inter-node message protocol.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
A reply to a recovery message will often be received after the relevant
recovery sequence has aborted and the next recovery sequence has begun.
We need to ignore replies to these old messages from the previous
recovery. There's already a way to do this for synchronous recovery
requests using the rc_id number, but not for async.
Each recovery sequence already has a locally unique sequence number
associated with it. This patch adds a field to the rcom (recovery
message) structure where this recovery sequence number can be placed,
rc_seq. When a node sends a reply to a recovery request, it copies the
rc_seq number it received into rc_seq_reply. When the first node receives
the reply to its recovery message, it will check whether rc_seq_reply
matches the current recovery sequence number, ls_recover_seq, and if not
then it ignores the old reply.
An old, inadequate approach to filtering out old replies (checking if the
current stage of recovery has moved back to the start) has been removed
from two spots.
The protocol version number is changed to reflect the different rcom
structures.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There's a chance the new master of resource hasn't learned it's the new
master before another node sends it a lock during recovery. The node
sending the lock needs to resend if this happens.
- A sends a master lookup for resource R to C
- B sends a master lookup for resource R to C
- C receives A's lookup, assigns A to be master of R and
sends a reply back to A
- C receives B's lookup and sends a reply back to B saying
that A is the master
- B receives lookup reply from C and sends its lock for R to A
- A receives lock from B, doesn't think it's the master of R
and sends an error back to B
- A receives lookup reply from C and becomes master of R
- B gets error back from A and resends its lock back to A
(this resending is what this patch does)
- A receives lock from B, it now sees it's the master of R
and takes the lock
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes a compile warning in lowcomms-tcp.c indicating that
kmem_cache_t is deprecated.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* master.kernel.org:/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (73 commits)
[DLM] Clean up lowcomms
[GFS2] Change gfs2_fsync() to use write_inode_now()
[GFS2] Fix indent in recovery.c
[GFS2] Don't flush everything on fdatasync
[GFS2] Add a comment about reading the super block
[GFS2] Mount problem with the GFS2 code
[GFS2] Remove gfs2_check_acl()
[DLM] fix format warnings in rcom.c and recoverd.c
[GFS2] lock function parameter
[DLM] don't accept replies to old recovery messages
[DLM] fix size of STATUS_REPLY message
[GFS2] fs/gfs2/log.c:log_bmap() fix printk format warning
[DLM] fix add_requestqueue checking nodes list
[GFS2] Fix recursive locking in gfs2_getattr
[GFS2] Fix recursive locking in gfs2_permission
[GFS2] Reduce number of arguments to meta_io.c:getbuf()
[GFS2] Move gfs2_meta_syncfs() into log.c
[GFS2] Fix journal flush problem
[GFS2] mark_inode_dirty after write to stuffed file
[GFS2] Fix glock ordering on inode creation
...
Replace all uses of kmem_cache_t with struct kmem_cache.
The patch was generated using the following script:
#!/bin/sh
#
# Replace one string by another in all the kernel sources.
#
set -e
for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
quilt add $file
sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
mv /tmp/$$ $file
quilt refresh
done
The script was run like this
sh replace kmem_cache_t "struct kmem_cache"
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This fixes up most of the things pointed out by akpm and Pavel Machek
with comments below indicating why some things have been left:
Andrew Morton wrote:
>
>> +static struct nodeinfo *nodeid2nodeinfo(int nodeid, gfp_t alloc)
>> +{
>> + struct nodeinfo *ni;
>> + int r;
>> + int n;
>> +
>> + down_read(&nodeinfo_lock);
>
> Given that this function can sleep, I wonder if `alloc' is useful.
>
> I see lots of callers passing in a literal "0" for `alloc'. That's in fact
> a secret (GFP_ATOMIC & ~__GFP_HIGH). I doubt if that's what you really
> meant. Particularly as the code could at least have used __GFP_WAIT (aka
> GFP_NOIO) which is much, much more reliable than "0". In fact "0" is the
> least reliable mode possible.
>
> IOW, this is all bollixed up.
When 0 is passed into nodeid2nodeinfo the function does not try to allocate a
new structure at all. it's an indication that the caller only wants the nodeinfo
struct for that nodeid if there actually is one in existance.
I've tidied the function itself so it's more obvious, (and tidier!)
>> +/* Data received from remote end */
>> +static int receive_from_sock(void)
>> +{
>> + int ret = 0;
>> + struct msghdr msg;
>> + struct kvec iov[2];
>> + unsigned len;
>> + int r;
>> + struct sctp_sndrcvinfo *sinfo;
>> + struct cmsghdr *cmsg;
>> + struct nodeinfo *ni;
>> +
>> + /* These two are marginally too big for stack allocation, but this
>> + * function is (currently) only called by dlm_recvd so static should be
>> + * OK.
>> + */
>> + static struct sockaddr_storage msgname;
>> + static char incmsg[CMSG_SPACE(sizeof(struct sctp_sndrcvinfo))];
>
> whoa. This is globally singly-threaded code??
Yes. it is only ever run in the context of dlm_recvd.
>>
>> +static void initiate_association(int nodeid)
>> +{
>> + struct sockaddr_storage rem_addr;
>> + static char outcmsg[CMSG_SPACE(sizeof(struct sctp_sndrcvinfo))];
>
> Another static buffer to worry about. Globally singly-threaded code?
Yes. Only ever called by dlm_sendd.
>> +
>> +/* Send a message */
>> +static int send_to_sock(struct nodeinfo *ni)
>> +{
>> + int ret = 0;
>> + struct writequeue_entry *e;
>> + int len, offset;
>> + struct msghdr outmsg;
>> + static char outcmsg[CMSG_SPACE(sizeof(struct sctp_sndrcvinfo))];
>
> Singly-threaded?
Yep.
>>
>> +static void dealloc_nodeinfo(void)
>> +{
>> + int i;
>> +
>> + for (i=1; i<=max_nodeid; i++) {
>> + struct nodeinfo *ni = nodeid2nodeinfo(i, 0);
>> + if (ni) {
>> + idr_remove(&nodeinfo_idr, i);
>
> Didn't that need locking?
Not. it's only ever called at DLM shutdown after all the other threads
have been stopped.
>>
>> +static int write_list_empty(void)
>> +{
>> + int status;
>> +
>> + spin_lock_bh(&write_nodes_lock);
>> + status = list_empty(&write_nodes);
>> + spin_unlock_bh(&write_nodes_lock);
>> +
>> + return status;
>> +}
>
> This function's return value is meaningless. As soon as the lock gets
> dropped, the return value can get out of sync with reality.
>
> Looking at the caller, this _might_ happen to be OK, but it's a nasty and
> dangerous thing. Really the locking should be moved into the caller.
It's just an optimisation to allow the caller to schedule if there is no work
to do. if something arrives immediately afterwards then it will get picked up
when the process re-awakes (and it will be woken by that arrival).
The 'accepting' atomic has gone completely. as Andrew pointed out it didn't
really achieve much anyway. I suspect it was a plaster over some other
startup or shutdown bug to be honest.
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Pavel Machek <pavel@ucw.cz>
This fixes the following gcc warnings generated on
the architectures where uint64_t != unsigned long long (e.g. ppc64).
fs/dlm/rcom.c:154: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'uint64_t'
fs/dlm/rcom.c:154: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'uint64_t'
fs/dlm/recoverd.c:48: warning: format '%llx' expects type 'long long unsigned int', but argument 3 has type 'uint64_t'
fs/dlm/recoverd.c:202: warning: format '%llx' expects type 'long long unsigned int', but argument 3 has type 'uint64_t'
fs/dlm/recoverd.c:210: warning: format '%llx' expects type 'long long unsigned int', but argument 3 has type 'uint64_t'
Signed-off-by: Ryusuke Konishi <ryusuke@osrg.net>
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We often abort a recovery after sending a status request to a remote node.
We want to ignore any potential status reply we get from the remote node.
If we get one of these unwanted replies, we've often moved on to the next
recovery message and incremented the message sequence counter, so the
reply will be ignored due to the seq number. In some cases, we've not
moved on to the next message so the seq number of the reply we want to
ignore is still correct, causing the reply to be accepted. The next
recovery message will then mistake this old reply as a new one.
To fix this, we add the flag RCOM_WAIT to indicate when we can accept a
new reply. We clear this flag if we abort recovery while waiting for a
reply. Before the flag is set again (to allow new replies) we know that
any old replies will be rejected due to their sequence number. We also
initialize the recovery-message sequence number to a random value when a
lockspace is first created. This makes it clear when messages are being
rejected from an old instance of a lockspace that has since been
recreated.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When the not_ready routine sends a "fake" status reply with blank status
flags, it needs to use the correct size for a normal STATUS_REPLY by
including the size of the would-be config parameters. We also fill in the
non-existant config parameters with an invalid lvblen value so it's easier
to notice if these invalid paratmers are ever being used.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Requests that arrive after recovery has started are saved in the
requestqueue and processed after recovery is done. Some of these requests
are purged during recovery if they are from nodes that have been removed.
We move the purging of the requests (dlm_purge_requestqueue) to later in
the recovery sequence which allows the routine saving requests
(dlm_add_requestqueue) to avoid filtering out requests by nodeid since the
same will be done by the purge. The current code has add_requestqueue
filtering by nodeid but doesn't hold any locks when accessing the list of
current nodes. This also means that we need to call the purge routine
when the lockspace is being shut down since the add routine will not be
rejecting requests itself any more.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The attached patch fixes the DLM config so that it selects the chosen network
transport. It should fix the bug where DLM can be left selected when NET gets
unselected. This incorporates all the comments received about this patch.
Cc: Adrian Bunk <bunk@stusta.de>
Cc: Andrew Morton <akpm@osdl.org>
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
RH BZ 211622
The ALTMODE flag can be set in the lock master's copy of the lock but
never cleared, so ALTMODE will also be returned in a subsequent conversion
of the lock when it shouldn't be. This results in lock_dlm incorrectly
switching to the alternate lock mode when returning the result to gfs
which then asserts when it sees the wrong lock state. The fix is to
propagate the cleared sbflags value to the master node when the lock is
requested. QA's d_rwrandirectlarge test triggers this bug very quickly.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Red Hat BZ 211914
The previous patch "[DLM] fix aborted recovery during
node removal" was incomplete as discovered with further testing. It set
the bit for the RS_LOCKS barrier but did not then wait for the barrier.
This is often ok, but sometimes it will cause yet another recovery hang.
If it's a new node that also has the lowest nodeid that skips the barrier
wait, then it misses the important step of collecting and reporting the
barrier status from the other nodes (which is the job of the low nodeid in
the barrier wait routine).
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Red Hat BZ 211914
When many nodes are joining a lockspace simultaneously, the dlm gets a
quick sequence of stop/start events, a pair for adding each node.
dlm_controld in user space sends dlm_recoverd in the kernel each stop and
start event. dlm_controld will sometimes send the stop before
dlm_recoverd has had a chance to take up the previously queued start. The
stop aborts the processing of the previous start by setting the
RECOVERY_STOP flag. dlm_recoverd is erroneously clearing this flag and
ignoring the stop/abort if it happens to take up the start after the stop
meant to abort it. The fix is to check the sequence number that's
incremented for each stop/start before clearing the flag.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Red Hat BZ 211914
With the new cluster infrastructure, dlm recovery for a node removal can
be aborted and restarted for a node addition. When this happens, the
restarted recovery isn't aware that it's doing recovery for the earlier
removal as well as the addition. So, it then skips the recovery steps
only required when nodes are removed. This can result in locks not being
purged for failed/removed nodes. The fix is to check for removed nodes
for which recovery has not been completed at the start of a new recovery
sequence.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Red Hat BZ 211914
There's a race between dlm_recoverd (1) enabling locking and (2) clearing
out the requestqueue, and dlm_recvd (1) checking if locking is enabled and
(2) adding a message to the requestqueue. An order of recoverd(1),
recvd(1), recvd(2), recoverd(2) will result in a message being left on the
requestqueue. The fix is to have dlm_recvd check if dlm_recoverd has
enabled locking after taking the mutex for the requestqueue and if it has
processing the message instead of queueing it.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Red Hat BZ 213682
If two nodes leave the lockspace (while unmounting the fs in the case of
gfs) after one has sent a STATUS message to the other, STATUS/STATUS_REPLY
messages will then ping-pong between the nodes when neither of them can
find the lockspace in question any longer. We kill this by not sending
another STATUS message when we get a STATUS_REPLY for an unknown
lockspace.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Red Hat BZ 213684
If a node sends an lkb to the new master (RCOM_LOCK message) during
recovery and recovery is then aborted on both nodes before it gets a
reply, the res_recover_locks_count needs to be reset to 0 so that when the
subsequent recovery comes along and sends the lkb to the new master again
the assertion doesn't trigger that checks that counter is zero.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The following patch adds a TCP based communications layer
to the DLM which is compile time selectable. The existing SCTP
layer gives the advantage of allowing multihoming, whereas
the TCP layer has been heavily tested in previous versions of
the DLM and is known to be robust and therefore can be used as
a baseline for performance testing.
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Now that the lockspace struct is freed when the last sysfs object is released
this patch prevents use of that lockspace by sysfs. We attempt to re-get the
lockspace from the lockspace list and fail the request if it has been removed.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes the recounting on the lockspace kobject. Previously the lockspace was freed while userspace could have had a
reference to one of its sysfs files, causing an oops in kref_put.
Now the lockspace kfree is moved into the kobject release() function
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
I didn't spot that the msg_iovlen was set to 2 if there
were two elements in the iovec but left at zero if not :(
I think this might be why bob was still seeing trouble.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>