Commit Graph

8770 Commits (9c37066d888bf6e1b96ad12304971b3ddeabbad0)

Author SHA1 Message Date
Jeff Layton 06e02d66fa NFS: don't let nfs_callback_svc exit on unexpected svc_recv errors (try #2)
When svc_recv returns an unexpected error, nfs_callback_svc will print a
warning and exit. This problematic for several reasons. In particular,
it will cause the reference counts for the thread to be wrong, and no
new thread will be started until all nfs4 mounts are unmounted.

Rather than exiting on error from svc_recv, have the thread do a 1s
sleep and then retry the loop. This is unlikely to cause any harm, and
if the error turns out to be something temporary then it may be able to
recover.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:42 -04:00
Olga Kornievskaia ff7d9756b5 nfsd: use static memory for callback program and stats
There's no need to dynamically allocate this memory, and doing so may
create the possibility of races on shutdown of the rpc client.  (We've
witnessed it only after adding rpcsec_gss support to the server, after
which the rpc code can send destroys calls that expect to still be able
to access the rpc_stats structure after it has been destroyed.)

Such races are in theory possible if the module containing this "static"
memory is removed very quickly after an rpc client is destroyed, but
we haven't seen that happen.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:42 -04:00
J. Bruce Fields e1ba1ab76e nfsd: fix comment
Obvious comment nit.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:42 -04:00
J. Bruce Fields 3c61eecb60 lockd: Fix stale nlmsvc_unlink_block comment
As of 5996a298da ("NLM: don't unlock on
cancel requests") we no longer unlock in this case, so the comment is no
longer accurate.

Thanks to Stuart Friedberg for pointing out the inconsistency.

Cc: Stuart Friedberg <sfriedberg@hp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:42 -04:00
Chuck Lever 1a448fdb3c NFSD: Remove NFSv4 dependency on NFSv3
Clean up: Because NFSD_V4 "depends on" NFSD_V3, it appears as a child of
the NFSD_V3 menu entry, and is not visible if NFSD_V3 is unselected.

Replace the dependency on NFSD_V3 with a "select NFSD_V3".  This makes
NFSD_V4 look and work just like NFS_V3, while ensuring that NFSD_V3 is
enabled if NFSD_V4 is.

Sam Ravnborg adds:

"This use of select is questionable. In general it is bad to select
a symbol with dependencies.

In this case the dependencies of NFSD_V3 are duplicated for NFSD_V4
so we will not se erratic configurations but do you remember to
update NFSD_V4 when you add a depends on NFSD_V3?

But I see no other clean way to do it right now."

Later he said:

"My comment was more to say we have things to address in kconfig.
This is abuse in the acceptable range."

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:41 -04:00
Chuck Lever 3329ba0523 SUNRPC: Remove PROC_FS dependency
Recently, commit 440bcc59 added a reverse dependency to fs/Kconfig to
ensure that PROC_FS was enabled if SUNRPC_GSS was enabled.

Apparently this isn't necessary because the auth_gss components under
net/sunrpc will build correctly even if PROC_FS is disabled, though
RPCSEC_GSS will not work without /proc.

It also violates the guideline in Documentation/kbuild/kconfig-language.txt
that states "In general use select only for non-visible symbols (no prompts
anywhere) and for symbols with no dependencies."

To address these issues, remove the dependency.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:41 -04:00
Chuck Lever 6ffd4be633 NFSD: Use "depends on" for PROC_FS dependency
Recently, commit 440bcc59 added a reverse dependency to fs/Kconfig to
ensure that PROC_FS was enabled if NFSD_V4 was enabled.

There is a guideline in Documentation/kbuild/kconfig-language.txt that
states "In general use select only for non-visible symbols (no prompts
anywhere) and for symbols with no dependencies."

A quick grep around other Kconfig files reveals that no entry currently
uses "select PROC_FS" -- every one uses "depends on".  Thus CONFIG_NFSD_V4
should use "depends on PROC_FS" as well.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:41 -04:00
J. Bruce Fields 03550fac06 nfsd: move most of fh_verify to separate function
Move the code that actually parses the filehandle and looks up the
dentry and export to a separate function.  This simplifies the reference
counting a little and moves fh_verify() a little closer to the kernel
ideal of small, minimally-indentended functions.  Clean up a few other
minor style sins along the way.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Neil Brown <neilb@suse.de>
2008-04-23 16:13:41 -04:00
Felix Blyakher 9167f501c6 nfsd: initialize lease type in nfs4_open_delegation()
While lease is correctly checked by supplying the type argument to
vfs_setlease(), it's stored with fl_type uninitialized. This breaks the
logic when checking the type of the lease.  The fix is to initialize
fl_type.

The old code still happened to function correctly since F_RDLCK is zero,
and we only implement read delegations currently (nor write
delegations).  But that's no excuse for not fixing this.

Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:40 -04:00
Jeff Layton a277e33cbe NFS: convert nfs4 callback thread to kthread API
There's a general push to convert kernel threads to use the (much
cleaner) kthread API. This patch converts the NFSv4 callback kernel
thread to the kthread API. In addition to being generally cleaner this
also removes the dependency on signals when shutting down the thread.

Note that this patch depends on the recent patches to svc_recv() to
make it check kthread_should_stop() periodically. Those patches are
in Bruce's tree at the moment and are slated for 2.6.26 along with
the lockd conversion, so this conversion is probably also appropriate
for 2.6.26.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:40 -04:00
Harvey Harrison 3ba1514815 nfsd: fix sparse warning in vfs.c
fs/nfsd/vfs.c:991:27: warning: Using plain integer as NULL pointer

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:39 -04:00
Harvey Harrison a254b246ee nfsd: fix sparse warnings
Add extern to nfsd/nfsd.h
fs/nfsd/nfssvc.c:146:5: warning: symbol 'nfsd_nrthreads' was not declared. Should it be static?
fs/nfsd/nfssvc.c:261:5: warning: symbol 'nfsd_nrpools' was not declared. Should it be static?
fs/nfsd/nfssvc.c:269:5: warning: symbol 'nfsd_get_nrthreads' was not declared. Should it be static?
fs/nfsd/nfssvc.c:281:5: warning: symbol 'nfsd_set_nrthreads' was not declared. Should it be static?
fs/nfsd/export.c:1534:23: warning: symbol 'nfs_exports_op' was not declared. Should it be static?

Add include of auth.h
fs/nfsd/auth.c:27:5: warning: symbol 'nfsd_setuser' was not declared. Should it be static?

Make static, move forward declaration closer to where it's needed.
fs/nfsd/nfs4state.c:1877:1: warning: symbol 'laundromat_main' was not declared. Should it be static?

Make static, forward declaration was already marked static.
fs/nfsd/nfs4idmap.c:206:1: warning: symbol 'idtoname_parse' was not declared. Should it be static?
fs/nfsd/vfs.c:1156:1: warning: symbol 'nfsd_create_setattr' was not declared. Should it be static?

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:39 -04:00
J. Bruce Fields d842120212 lockd: convert nsm_mutex to a spinlock
There's no reason for a mutex here, except to allow an allocation under
the lock, which we can avoid with the usual trick of preallocating
memory for the new object and freeing it if it turns out to be
unnecessary.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:39 -04:00
J. Bruce Fields a95e56e72c lockd: clean up __nsm_find()
Use list_for_each_entry().  Also, in keeping with kernel style, make the
normal case (kzalloc succeeds) unindented and handle the abnormal case
with a goto.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:39 -04:00
J. Bruce Fields 164f98adbb lockd: fix race in nlm_release()
The sm_count is decremented to zero but left on the nsm_handles list.
So in the space between decrementing sm_count and acquiring nsm_mutex,
it is possible for another task to find this nsm_handle, increment the
use count and then enter nsm_release itself.

Thus there's nothing to prevent the nsm being freed before we acquire
nsm_mutex here.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:39 -04:00
Harvey Harrison 93245d11fc lockd: fix sparse warning in svcshare.c
fs/lockd/svcshare.c:74:50: warning: Using plain integer as NULL pointer

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:39 -04:00
Adrian Bunk f2b0dee2ec make nfsd_create_setattr() static
This patch makes the needlessly global nfsd_create_setattr() static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:38 -04:00
Chuck Lever 6aaa67b5f3 NFSD: Remove redundant "select" clauses in fs/Kconfig
As far as I can tell, selecting the CRYPTO and CRYPTO_MD5 entries under
CONFIG_NFSD is redundant, since CONFIG_NFSD_V4 already selects
RPCSEC_GSS_KRB5, which selects these entries.

Testing with "make menuconfig" shows that the entries under CRYPTO still
properly reflect "Y" or "M" based on the setting of CONFIG_NFSD after this
change is applied.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:38 -04:00
Chuck Lever 78dd0992a3 NFSD: Move "select NFSD_V2_ACL if NFSD_V3_ACL"
Clean up: since NFSD_V2_ACL is a boolean, it can be selected safely
under the NFSD_V3_ACL entry (also a boolean).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:38 -04:00
Chuck Lever 892069552e NFSD: Move "select FS_POSIX_ACL if NFSD_V4"
Clean up: since FS_POSIX_ACL is a non-visible boolean entry, it can be
selected safely under the NFSD_V4 entry (also a boolean).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:38 -04:00
Chuck Lever d24455b5ff NFSD: Update help text for CONFIG_NFSD
Clean up: refresh the help text for Kconfig items related to the NFS
server.  Remove obsolete URLs, and make the language consistent among
the options.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:38 -04:00
Chuck Lever 5ea0dd61f2 NFSD: Remove NFSD_TCP kernel build option
Likewise, distros usually leave CONFIG_NFSD_TCP enabled.

TCP support in the Linux NFS server is stable enough that we can leave it
on always.  CONFIG_NFSD_TCP adds about 10 lines of code, and defaults to
"Y" anyway.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:38 -04:00
J. Bruce Fields c0ce6ec87c nfsd: clarify readdir/mountpoint-crossing code
The code here is difficult to understand; attempt to clarify somewhat by
pulling out one of the more mystifying conditionals into a separate
function.

While we're here, also add lease_time to the list of attributes that we
don't really need to cross a mountpoint to fetch.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Peter Staubach <staubach@redhat.com>
2008-04-23 16:13:38 -04:00
J. Bruce Fields 6a85fa3add nfsd4: kill unnecessary check in preprocess_stateid_op
This condition is always true.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:37 -04:00
J. Bruce Fields 0836f58725 nfsd4: simplify stateid sequencing checks
Pull this common code into a separate function.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:37 -04:00
J. Bruce Fields f3362737be nfsd4: remove unnecessary CHECK_FH check in preprocess_seqid_op
Every caller sets this flag, so it's meaningless.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:37 -04:00
J. Bruce Fields 065f30ec14 nfs: remove unnecessary NFS_NEED_* defines
Thanks to Robert Day for pointing out that these two defines are unused.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Trond Myklebust <trond@netapp.com>Trond Myklebust <trond@netapp.com>
Cc: Neil Brown <neilb@suse.de>
Cc: "Robert P. J. Day" <rpjday@crashcourse.ca>
2008-04-23 16:13:37 -04:00
Aurélien Charbon f15364bd4c IPv6 support for NFS server export caches
This adds IPv6 support to the interfaces that are used to express nfsd
exports.  All addressed are stored internally as IPv6; backwards
compatibility is maintained using mapped addresses.

Thanks to Bruce Fields, Brian Haley, Neil Brown and Hideaki Joshifuji
for comments

Signed-off-by: Aurelien Charbon <aurelien.charbon@bull.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Brian Haley <brian.haley@hp.com>
Cc:  YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:36 -04:00
Jeff Layton d751a7cd06 NLM: Convert lockd to use kthreads
Have lockd_up start lockd using kthread_run. With this change,
lockd_down now blocks until lockd actually exits, so there's no longer
need for the waitqueue code at the end of lockd_down. This also means
that only one lockd can be running at a time which simplifies the code
within lockd's main loop.

This also adds a check for kthread_should_stop in the main loop of
nlmsvc_retry_blocked and after that function returns. There's no sense
continuing to retry blocks if lockd is coming down anyway.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:36 -04:00
NeilBrown 1447d25eb3 knfsd: Remove NLM_HOST_MAX and associated logic.
Lockd caches information about hosts that have recently held locks to
expedite the taking of further locks.

It periodically discards this information for hosts that have not been
used for a few minutes.

lockd currently has a value NLM_HOST_MAX, and changes the 'garbage
collection' behaviour when the number of hosts exceeds this threshold.

However its behaviour is strange, and likely not what was intended.
When the number of hosts exceeds the max, it scans *less* often (every
2 minutes vs every minute) and allows unused host information to
remain around longer (5 minutes instead of 2).

Having this limit is of dubious value anyway, and we have not
suffered from the code not getting the limit right, so remove the
limit altogether.  We go with the larger values (discard 5 minute old
hosts every 2 minutes) as they are probably safer.

Maybe the periodic garbage collection should be replace to with
'shrinker' handler so we just respond to memory pressure....

Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:35 -04:00
David Woodhouse 2c61cb250c [JFFS2] Introduce dbg_readinode2 log level, use it to shut read_dnode() up
We haven't seen bugs in this for a while now, since the rewrite. No need
to be _quite_ so verbose...

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-04-23 16:43:15 +01:00
David Woodhouse 422b120238 [JFFS2] Fix jffs2_reserve_space() when all blocks are pending erasure.
When _all_ the blocks were on the erase_pending_list, we could't find a
block to GC from but there was no _actually_ free space, and
jffs2_reserve_space() would get a little unhappy.

Handle this case by returning -EAGAIN from jffs2_garbage_collect_pass().
There are two callers of that function -- jffs2_flush_wbuf_gc(), which
will interpret it as an error and flush the writebuffer by other means,
and jffs2_reserve_space(), which we modify to respond to -EAGAIN with an
immediate call to jffs2_erase_pending_blocks() and another run round the
loop.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-04-23 16:01:37 +01:00
David Woodhouse e2bc322bf0 [JFFS2] Add erase_checking_list to hold blocks being marked.
Just to keep the debug code happy when it's adding all the blocks up.
Otherwise, they disappear for a while while the locks are dropped to
check them and write the cleanmarker.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-04-23 14:15:24 +01:00
Anders Grafström 8a0f572397 [JFFS2] Return values of jffs2_block_check_erase error paths
It looks the error paths in jffs2_block_check_erase() have wrong return 
values. A block that failed to be erased never gets marked as bad.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-04-23 10:06:46 +01:00
Miklos Szeredi 97e7e0f71d [patch 7/7] vfs: mountinfo: show dominating group id
Show peer group ID of nearest dominating group that has intersection
with the mount's namespace.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-23 00:05:09 -04:00
Ram Pai 2d4d4864ac [patch 6/7] vfs: mountinfo: add /proc/<pid>/mountinfo
[mszeredi@suse.cz] rewrite and split big patch into managable chunks

/proc/mounts in its current form lacks important information:

 - propagation state
 - root of mount for bind mounts
 - the st_dev value used within the filesystem
 - identifier for each mount and it's parent

It also suffers from the following problems:

 - not easily extendable
 - ambiguity of mountpoints within a chrooted environment
 - doesn't distinguish between filesystem dependent and independent options
 - doesn't distinguish between per mount and per super block options

This patch introduces /proc/<pid>/mountinfo which attempts to address
all these deficiencies.

Code shared between /proc/<pid>/mounts and /proc/<pid>/mountinfo is
extracted into separate functions.

Thanks to Al Viro for the help in getting the design right.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-23 00:05:03 -04:00
Miklos Szeredi a1a2c409b6 [patch 5/7] vfs: mountinfo: allow using process root
Allow /proc/<pid>/mountinfo to use the root of <pid> to calculate
mountpoints.

 - move definition of 'struct proc_mounts' to <linux/mnt_namespace.h>
 - add the process's namespace and root to this structure
 - pass a pointer to 'struct proc_mounts' into seq_operations

In addition the following cleanups are made:

 - use a common open function for /proc/<pid>/{mounts,mountstat}
 - surround namespace.c part of these proc files with #ifdef CONFIG_PROC_FS
 - make the seq_operations structures const

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-23 00:04:57 -04:00
Miklos Szeredi 719f5d7f0b [patch 4/7] vfs: mountinfo: add mount peer group ID
Add a unique ID to each peer group using the IDR infrastructure.  The
identifiers are reused after the peer group dissolves.

The IDR structures are protected by holding namepspace_sem for write
while allocating or deallocating IDs.

IDs are allocated when a previously unshared vfsmount becomes the
first member of a peer group.  When a new member is added to an
existing group, the ID is copied from one of the old members.

IDs are freed when the last member of a peer group is unshared.

Setting the MNT_SHARED flag on members of a subtree is done as a
separate step, after all the IDs have been allocated.  This way an
allocation failure can be cleaned up easilty, without affecting the
propagation state.

Based on design sketch by Al Viro.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-23 00:04:51 -04:00
Miklos Szeredi 73cd49ecdd [patch 3/7] vfs: mountinfo: add mount ID
Add a unique ID to each vfsmount using the IDR infrastructure.  The
identifiers are reused after the vfsmount is freed.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-23 00:04:45 -04:00
Miklos Szeredi 9d1bc60138 [patch 2/7] vfs: mountinfo: add seq_file_root()
Add a new function:

  seq_file_root()

This is similar to seq_path(), but calculates the path relative to the
given root, instead of current->fs->root.  If the path was unreachable
from root, then modify the root parameter to reflect this.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-23 00:04:38 -04:00
Ram Pai 6092d04818 [patch 1/7] vfs: mountinfo: add dentry_path()
[mszeredi@suse.cz] split big patch into managable chunks

Add the following functions:

  dentry_path()
  seq_dentry()

These are similar to d_path() and seq_path().  But instead of
calculating the path within a mount namespace, they calculate the path
from the root of the filesystem to a given dentry, ignoring mounts
completely.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-23 00:04:32 -04:00
Al Viro 934b25c597 [PATCH] remove unused label in xattr.c (noise from ro-bind)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-23 00:04:04 -04:00
Linus Torvalds 94bc891b00 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  [PATCH] get rid of __exit_files(), __exit_fs() and __put_fs_struct()
  [PATCH] proc_readfd_common() race fix
  [PATCH] double-free of inode on alloc_file() failure exit in create_write_pipe()
  [PATCH] teach seq_file to discard entries
  [PATCH] umount_tree() will unhash everything itself
  [PATCH] get rid of more nameidata passing in namespace.c
  [PATCH] switch a bunch of LSM hooks from nameidata to path
  [PATCH] lock exclusively in collect_mounts() and drop_collected_mounts()
  [PATCH] move a bunch of declarations to fs/internal.h
2008-04-22 18:28:34 -07:00
David Woodhouse 19e56ceae7 [JFFS2] Finally remove redundant ref->__totlen field.
Haven't had any complaints about it recently, despite having the test
code enabled to verify that the calculated length is correct.

Kill it off, just by #undef TEST_TOTLEN for now; removing it for real
can come a little later.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-04-23 01:26:12 +01:00
David Woodhouse 27e6b8e388 [JFFS2] Honour TEST_TOTLEN macro in debugging code. ref->__totlen is going!
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-04-23 01:25:33 +01:00
David Woodhouse 85a62db624 [JFFS2] Add paranoia debugging for superblock counts
The problem fixed in commit 014b164e13
(space leak with in-band cleanmarkers) would have been caught a lot
quicker if our paranoid debugging mode had included adding up the size
counts from all the eraseblocks and comparing the totals with the counts
in the superblock. Add that.

Make jffs2_mark_erased_block() file the newly-erased block on the
free_list before calling the debug function, to make it happy.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-04-23 01:24:17 +01:00
Al Viro 9b4f526cdc [PATCH] proc_readfd_common() race fix
Since we drop the rcu_read_lock inside the loop, we can't assume
that files->fdt will remain unchanged (and not freed) between
iterations.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-22 19:55:03 -04:00
Al Viro ed15243717 [PATCH] double-free of inode on alloc_file() failure exit in create_write_pipe()
Duh...  Fortunately, the bug is quite recent (post-2.6.25) and, embarrassingly,
mine ;-/

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-22 19:54:57 -04:00
David Woodhouse 014b164e13 [JFFS2] Fix free space leak with in-band cleanmarkers
We were accounting for the cleanmarker by calling jffs2_link_node_ref()
(without locking!), which adjusted both superblock and per-eraseblock
accounting, subtracting the size of the cleanmarker from {jeb,c}->free_size
and adding it to {jeb,c}->used_size.

But only _then_ were we adding the size of the newly-erased block back
to the superblock counts, and we were adding each of jeb->{free,used}_size
to the corresponding superblock counts. Thus, the size of the cleanmarker
was effectively subtracted from the superblock's free_size _twice_.

Fix this, by always adding a full eraseblock size to c->free_size when
we've erased a block. And call jffs2_link_node_ref() under the proper
lock, while we're at it.

Thanks to Alexander Yurchenko and/or Damir Shayhutdinov for (almost)
pinpointing the problem.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-04-22 23:54:38 +01:00
David Woodhouse cf9d1e428c [JFFS2] Self-sufficient #includes in jffs2_fs_i.h: include <linux/mutex.h>
... instead of <linux/semaphore.h> which we don't need any more anyway.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-04-22 23:53:26 +01:00
David Sterba 16abef0e9e fs: use loff_t type instead of long long
Use offset type consistently.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-22 15:17:11 -07:00
Linus Torvalds 03b883840c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: linux/{dlm,dlm_device}.h: cleanup for userspace
  dlm: common max length definitions
  dlm: move plock code from gfs2
  dlm: recover nodes that are removed and re-added
  dlm: save master info after failed no-queue request
  dlm: make dlm_print_rsb() static
  dlm: match signedness between dlm_config_info and cluster_set
2008-04-22 13:44:23 -07:00
Linus Torvalds 62429f4340 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6: (41 commits)
  udf: use crc_itu_t from lib instead of udf_crc
  udf: Fix compilation warnings when UDF debug is on
  udf: Fix bug in VAT mapping code
  udf: Add read-only support for 2.50 UDF media
  udf: Fix handling of multisession media
  udf: Mount filesystem read-only if it has pseudooverwrite partition
  udf: Handle VAT packed inside inode properly
  udf: Allow loading of VAT inode
  udf: Fix detection of VAT version
  udf: Silence warning about accesses beyond end of device
  udf: Improve anchor block detection
  udf: Cleanup anchor block detection.
  udf: Move processing of virtual partitions
  udf: Move filling of partition descriptor info into a separate function
  udf: Improve error recovery on mount
  udf: Cleanup volume descriptor sequence processing
  udf: fix anchor point detection
  udf: Remove declarations of arrays of size UDF_NAME_LEN (256 bytes)
  udf: Remove checking of existence of filename in udf_add_entry()
  udf: Mark udf_process_sequence() as noinline
  ...
2008-04-22 13:40:47 -07:00
James Bottomley 0f4238958d [SCSI] sysfs: make group is_valid return a mode_t
We have a problem in scsi_transport_spi in that we need to customise
not only the visibility of the attributes, but also their mode.  Fix
this by making the is_visible() callback return a mode, with 0
indicating is not visible.

Also add a sysfs_update_group() API to allow us to change either the
visibility or mode of the files at any time on the fly.

Acked-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2008-04-22 15:16:31 -05:00
David Woodhouse ced2207036 [JFFS2] semaphore->mutex conversion
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-04-22 15:13:40 +01:00
michael cca1584171 [JFFS2] add write verify on dataflash.
Add the write verification buffer to the dataflash.  The mtd_dataflash has
the CONFIG_DATAFLASH_WRITE_VERIFY so is better a change to Kconfig.

Signed-off-by: Michael Trimarchi <trimarchimichael@yahoo.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-04-22 12:35:50 +01:00
David Woodhouse 25dc30b4cd [JFFS2] fix sparse warnings in gc.c
fs/jffs2/gc.c:1147:29: warning: symbol 'jeb' shadows an earlier one
fs/jffs2/gc.c:1084:89: originally declared here
fs/jffs2/gc.c:1197:29: warning: symbol 'jeb' shadows an earlier one
fs/jffs2/gc.c:1084:89: originally declared here

Rename the unused 'jeb' argument to avoid this. We could potentially
remove the argument, but GCC should be doing that anyway.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-04-22 12:35:47 +01:00
Harvey Harrison bf66737ca8 [JFFS2] fix sparse warning in write.c
fs/jffs2/write.c:585:28: warning: symbol 'fd' shadows an earlier one
fs/jffs2/write.c:536:27: originally declared here

No need to redeclare fd, use the original one, after this point,
fd is always reassigned before it used again.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-04-22 12:35:46 +01:00
David Woodhouse 8ca646abb4 [JFFS2] Fix sparse warning in nodemgmt.c
fs/jffs2/nodemgmt.c:60:8: warning: symbol 'ret' shadows an earlier one
fs/jffs2/nodemgmt.c:45:6: originally declared here

(reported by Harvey Harrison)

Just remove the offending declaration of 'int ret' and use the earlier one.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-04-22 12:35:44 +01:00
Harvey Harrison f876a59dae [JFFS2] include function prototype for jffs2_ioctl
fs/jffs2/ioctl.c:14:5: warning: symbol 'jffs2_ioctl' was not declared.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2008-04-22 12:35:42 +01:00
David Woodhouse f838bad1b3 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 2008-04-22 12:34:25 +01:00
Al Viro 521b5d0c40 [PATCH] teach seq_file to discard entries
Allow ->show() return SEQ_SKIP; that will discard all
output from that element and move on.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-21 23:14:02 -04:00
Al Viro 4e1b36fb48 [PATCH] umount_tree() will unhash everything itself
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-21 23:13:54 -04:00
Al Viro 8c3ee42e80 [PATCH] get rid of more nameidata passing in namespace.c
Further reduction of stack footprint (sys_pivot_root());
lose useless BKL in there, while we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-21 23:13:47 -04:00
Al Viro b5266eb4c8 [PATCH] switch a bunch of LSM hooks from nameidata to path
Namely, ones from namespace.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-21 23:13:23 -04:00
Al Viro 1a60a28077 [PATCH] lock exclusively in collect_mounts() and drop_collected_mounts()
Taking namespace_sem shared there isn't worth the trouble, especially with
vfsmount ID allocation about to be added.  That way we know that umount_tree(),
copy_tree() and clone_mnt() are _always_ serialized by namespace_sem.
umount_tree() still needs vfsmount_lock (it manipulates hash chains, among
other things), but that's a separate story.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-21 23:11:09 -04:00
Al Viro 6d59e7f582 [PATCH] move a bunch of declarations to fs/internal.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-21 23:11:01 -04:00
Linus Torvalds 8a32272688 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6:
  [SPARC]: Remove SunOS and Solaris binary support.
2008-04-21 17:20:53 -07:00
Linus Torvalds e9b62693ae Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/juhl/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/juhl/trivial: (24 commits)
  DOC:  A couple corrections and clarifications in USB doc.
  Generate a slightly more informative error msg for bad HZ
  fix typo "is" -> "if" in Makefile
  ext*: spelling fix prefered -> preferred
  DOCUMENTATION:  Use newer DEFINE_SPINLOCK macro in docs.
  KEYS:  Fix the comment to match the file name in rxrpc-type.h.
  RAID: remove trailing space from printk line
  DMA engine: typo fixes
  Remove unused MAX_NODES_SHIFT
  MAINTAINERS: Clarify access to OCFS2 development mailing list.
  V4L: Storage class should be before const qualifier (sn9c102)
  V4L: Storage class should be before const qualifier
  sonypi: Storage class should be before const qualifier
  intel_menlow: Storage class should be before const qualifier
  DVB: Storage class should be before const qualifier
  arm: Storage class should be before const qualifier
  ALSA: Storage class should be before const qualifier
  acpi: Storage class should be before const qualifier
  firmware_sample_driver.c: fix coding style
  MAINTAINERS: Add ati_remote2 driver
  ...

Fixed up trivial conflicts in firmware_sample_driver.c
2008-04-21 16:36:46 -07:00
Linus Torvalds 548453fd10 Merge branch 'for-2.6.26' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.26' of git://git.kernel.dk/linux-2.6-block:
  block: fix blk_register_queue() return value
  block: fix memory hotplug and bouncing in block layer
  block: replace remaining __FUNCTION__ occurrences
  Kconfig: clean up block/Kconfig help descriptions
  cciss: fix warning oops on rmmod of driver
  cciss: Fix race between disk-adding code and interrupt handler
  block: move the padding adjustment to blk_rq_map_sg
  block: add bio_copy_user_iov support to blk_rq_map_user_iov
  block: convert bio_copy_user to bio_copy_user_iov
  loop: manage partitions in disk image
  cdrom: use kmalloced buffers instead of buffers on stack
  cdrom: make unregister_cdrom() return void
  cdrom: use list_head for cdrom_device_info list
  cdrom: protect cdrom_device_info list by mutex
  cdrom: cleanup hardcoded error-code
  cdrom: remove ifdef CONFIG_SYSCTL
2008-04-21 16:03:40 -07:00
Linus Torvalds e80ab411e5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6: (36 commits)
  SCSI: convert struct class_device to struct device
  DRM: remove unused dev_class
  IB: rename "dev" to "srp_dev" in srp_host structure
  IB: convert struct class_device to struct device
  memstick: convert struct class_device to struct device
  driver core: replace remaining __FUNCTION__ occurrences
  sysfs: refill attribute buffer when reading from offset 0
  PM: Remove destroy_suspended_device()
  Firmware: add iSCSI iBFT Support
  PM: Remove legacy PM (fix)
  Kobject: Replace list_for_each() with list_for_each_entry().
  SYSFS: Explicitly include required header file slab.h.
  Driver core: make device_is_registered() work for class devices
  PM: Convert wakeup flag accessors to inline functions
  PM: Make wakeup flags available whenever CONFIG_PM is set
  PM: Fix misuse of wakeup flag accessors in serial core
  Driver core: Call device_pm_add() after bus_add_device() in device_add()
  PM: Handle device registrations during suspend/resume
  block: send disk "change" event for rescan_partitions()
  sysdev: detect multiple driver registrations
  ...

Fixed trivial conflict in include/linux/memory.h due to semaphore header
file change (made irrelevant by the change to mutex).
2008-04-21 15:49:58 -07:00
Benoit Boissinot 1cc8dcf569 ext*: spelling fix prefered -> preferred
Spelling fix: prefered -> preferred

Signed-off-by: Benoit Boissinot <benoit.boissinot@ens-lyon.org>
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
2008-04-21 22:45:55 +00:00
Linus Torvalds 429f731dea Merge branch 'semaphore' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc
* 'semaphore' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc:
  Deprecate the asm/semaphore.h files in feature-removal-schedule.
  Convert asm/semaphore.h users to linux/semaphore.h
  security: Remove unnecessary inclusions of asm/semaphore.h
  lib: Remove unnecessary inclusions of asm/semaphore.h
  kernel: Remove unnecessary inclusions of asm/semaphore.h
  include: Remove unnecessary inclusions of asm/semaphore.h
  fs: Remove unnecessary inclusions of asm/semaphore.h
  drivers: Remove unnecessary inclusions of asm/semaphore.h
  net: Remove unnecessary inclusions of asm/semaphore.h
  arch: Remove unnecessary inclusions of asm/semaphore.h
2008-04-21 15:41:27 -07:00
Pavel Machek f5264481c8 trivial: small cleanups
These are small cleanups all over the tree.

Trivial style and comment changes to
  fs/select.c, kernel/signal.c, kernel/stop_machine.c & mm/pdflush.c

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
2008-04-21 22:15:06 +00:00
David S. Miller ec98c6b9b4 [SPARC]: Remove SunOS and Solaris binary support.
As per Documentation/feature-removal-schedule.txt

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-04-21 15:10:15 -07:00
David Teigland 3d564fa347 dlm: common max length definitions
Add central definitions for max lockspace name length and max resource
name length.  The lack of central definitions has resulted in scattered
private definitions which we can now clean up, including an unused one
in dlm_device.h.

Signed-off-by: David Teigland <teigland@redhat.com>
2008-04-21 11:22:29 -05:00
David Teigland 2402211a83 dlm: move plock code from gfs2
Move the code that handles cluster posix locks from gfs2 into the dlm
so that it can be used by both gfs2 and ocfs2.

Signed-off-by: David Teigland <teigland@redhat.com>
2008-04-21 11:22:28 -05:00
David Teigland d44e0fc704 dlm: recover nodes that are removed and re-added
If a node is removed from a lockspace, and then added back before the
dlm is notified of the removal, the dlm will not detect the removal
and won't clear the old state from the node.  This is fixed by using a
list of added nodes so the membership recovery can detect when a newly
added node is already in the member list.

Signed-off-by: David Teigland <teigland@redhat.com>
2008-04-21 11:18:01 -05:00
David Teigland 761b9d3ffc dlm: save master info after failed no-queue request
When a NOQUEUE request fails, the rsb res_master field is unnecessarily
reset to -1, instead of leaving the valid master setting in place.  We
want to save the looked-up master values while the rsb is on the "toss
list" so that another lookup can be avoided if the rsb is soon reused.
The fix is to simply leave res_master value alone.

Signed-off-by: David Teigland <teigland@redhat.com>
2008-04-21 11:18:01 -05:00
Adrian Bunk 170e19ab29 dlm: make dlm_print_rsb() static
dlm_print_rsb() can now become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David Teigland <teigland@redhat.com>
2008-04-21 11:18:01 -05:00
Harvey Harrison 5416b704ae dlm: match signedness between dlm_config_info and cluster_set
cluster_set is only called from the macro CLUSTER_ATTR which defines read/write
access functions.  Make the signedness match to avoid sparse warnings every time
CLUSTER_ATTR is used (lines 149-159) all of the form:

fs/dlm/config.c:149:1: warning: incorrect type in argument 3 (different signedness)
fs/dlm/config.c:149:1:    expected unsigned int *info_field
fs/dlm/config.c:149:1:    got int extern [toplevel] *<noident>

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2008-04-21 11:18:01 -05:00
FUJITA Tomonori c5dec1c303 block: convert bio_copy_user to bio_copy_user_iov
This patch enables bio_copy_user to take struct sg_iovec (renamed
bio_copy_user_iov). bio_copy_user uses bio_copy_user_iov internally as
bio_map_user uses bio_map_user_iov.

The major changes are:

- adds sg_iovec array to struct bio_map_data

- adds __bio_copy_iov that copy data between bio and
sg_iovec. bio_copy_user_iov and bio_uncopy_user use it.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-04-21 09:50:08 +02:00
Dan Williams 2424b5dd06 sysfs: refill attribute buffer when reading from offset 0
Requiring userspace to close and re-open sysfs attributes has been the
policy since before 2.6.12.  It allows userspace to get a consistent
snapshot of kernel state and consume it with incremental reads and seeks.

Now, if the file position is zero the kernel assumes userspace wants to see
the new value.  The application for this change is to allow a userspace
RAID metadata handler to check the state of an array without causing any
memory allocations.  Thus not causing writeback to a raid array that might
be blocked waiting for userspace to take action.

Cc: Neil Brown <neilb@suse.de>
Acked-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-04-19 19:10:29 -07:00
Robert P. J. Day c6f8773382 SYSFS: Explicitly include required header file slab.h.
After an experimental deletion of the unnecessary inclusion of
<linux/slab.h> from the header file <linux/percpu.h>, the following
files under fs/sysfs were exposed as needing to explicitly include
<linux/slab.h>.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-04-19 19:10:27 -07:00
Kay Sievers 6bcf19d02a block: send disk "change" event for rescan_partitions()
Userspace likes to get notified that the disk may have changed, when
rescan_partitions() is called after partitioning or media change. It will
make it possible to update the state of the disk with the "change" event,
before the following partition "add" events are handled.

Cc: David Zeuthen <david@fubar.dk>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-04-19 19:10:24 -07:00
Adrian Bunk a3dab29353 make nfs_automount_list static
nfs_automount_list can now become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:55:29 -04:00
Jeff Layton daa7da5fd3 NFS: remove duplicate flags assignment from nfs_validate_mount_data
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:55:25 -04:00
Cyrill Gorcunov 63649bd708 NFS - fix potential NULL pointer dereference v2
There is possible NULL pointer dereference if kstr[n]dup failed.
So fix them for safety.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:55:22 -04:00
Trond Myklebust a2b2bb8822 NFSv4: Attempt to use machine credentials in SETCLIENTID calls
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:54:59 -04:00
Trond Myklebust 7c67db3a8a NFSv4: Reintroduce machine creds
We need to try to ensure that we always use the same credentials whenever
we re-establish the clientid on the server. If not, the server won't
recognise that we're the same client, and so may not allow us to recover
state.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:54:56 -04:00
Trond Myklebust 78ea323be6 NFSv4: Don't use cred->cr_ops->cr_name in nfs4_proc_setclientid()
With the recent change to generic creds, we can no longer use
cred->cr_ops->cr_name to distinguish between RPCSEC_GSS principals and
AUTH_SYS/AUTH_NULL identities. Replace it with the rpc_authops->au_name
instead...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:54:53 -04:00
Fred Isaman 4410924157 nfs: fix printout of multiword bitfields
Benny points out that zero-padding of multiword bitfields is necessary,
and that delimiting each word is nice to avoid endianess confusion.

bhalevy: without zero padding output can be ambiguous. Also,
since the printed array of two 32-bit unsigned integers is not a
64-bit number, delimiting the output with a semicolon makes more sense.

Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:54:50 -04:00
Benny Halevy 856dff3d38 nfs: return negative error value from nfs{,4}_stat_to_errno
All use sites for nfs{,4}_stat_to_errno negate their return value.
It's more efficient to return a negative error from the stat_to_errno convertors
rather than negating its return value everywhere. This also produces slightly
smaller code.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:54:47 -04:00
Trond Myklebust d11d10cc05 NLM/lockd: Ensure client locking calls use correct credentials
Now that we've added the 'generic' credentials (that are independent of the
rpc_client) to the nfs_open_context, we can use those in the NLM client to
ensure that the lock/unlock requests are authenticated to whoever
originally opened the file.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:54:43 -04:00
Trond Myklebust c4d7c402b7 NFS: Remove the buggy lock-if-signalled case from do_setlk()
Both NLM and NFSv4 should be able to clean up adequately in the case where
the user interrupts the RPC call...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:52 -04:00
Trond Myklebust 5f50c0c6d6 NLM/lockd: Fix a race when cancelling a blocking lock
We shouldn't remove the lock from the list of blocked locks until the
CANCEL call has completed since we may be racing with a GRANTED callback.

Also ensure that we send an UNLOCK if the CANCEL request failed. Normally
that should only happen if the process gets hit with a fatal signal.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:49 -04:00
Trond Myklebust 6b4b3a752b NLM/lockd: Ensure that nlmclnt_cancel() returns results of the CANCEL call
Currently, it returns success as long as the RPC call was sent. We'd like
to know if the CANCEL operation succeeded on the server.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:45 -04:00
Trond Myklebust 8ec7ff7444 NLM: Remove the signal masking in nlmclnt_proc/nlmclnt_cancel
The signal masks have been rendered obsolete by the preceding patch.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:42 -04:00
Trond Myklebust dc9d8d0481 NLM/lockd: convert __nlm_async_call to use rpc_run_task()
Peter Staubach comments:

> In the course of investigating testing failures in the locking phase of
> the Connectathon testsuite, I discovered a couple of things.  One was
> that one of the tests in the locking tests was racy when it didn't seem
> to need to be and two, that the NFS client asynchronously releases locks
> when a process is exiting.
...
> The Single UNIX Specification Version 3 specifies that:  "All locks
> associated with a file for a given process shall be removed when a file
> descriptor for that file is closed by that process or the process holding
> that file descriptor terminates.".
>
> This does not specify whether those locks must be released prior to the
> completion of the exit processing for the process or not.  However,
> general assumptions seem to be that those locks will be released.  This
> leads to more deterministic behavior under normal circumstances.

The following patch converts the NFSv2/v3 locking code to use the same
mechanism as NFSv4 for sending asynchronous RPC calls and then waiting for
them to complete. This ensures that the UNLOCK and CANCEL RPC calls will
complete even if the user interrupts the call, yet satisfies the
above request for synchronous behaviour on process exit.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:39 -04:00
Trond Myklebust 5e7f37a76f NLM/lockd: Add a reference counter to struct nlm_rqst
When we replace the existing synchronous RPC calls with asynchronous calls,
the reference count will be needed in order to allow us to examine the
result of the RPC call.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:36 -04:00
Trond Myklebust 536ff0f809 NFSv4: Ensure we don't corrupt fl->fl_flags in nfs4_proc_unlck
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:33 -04:00
Trond Myklebust 4a9af59fee NLM/lockd: Ensure we don't corrupt fl->fl_flags in nlmclnt_unlock()
Also fix up nlmclnt_lock() so that it doesn't pass modified versions of
fl->fl_flags to nlmclnt_cancel() and other helpers.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:30 -04:00
Trond Myklebust c1d519312d NFSv4: Only increment the sequence id if the server saw it
It is quite possible that the OPEN, CLOSE, LOCK, LOCKU,... compounds fail
before the actual stateful operation has been executed (for instance in the
PUTFH call). There is no way to tell from the overall status result which
operations were executed from the COMPOUND.

The fix is to move incrementing of the sequence id into the XDR layer,
so that we do it as we process the results from the stateful operation.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:15 -04:00
Trond Myklebust 35d05778e2 NFSv4: Remove bogus call to nfs4_drop_state_owner() in _nfs4_open_expired()
There should be no need to invalidate a perfectly good state owner just
because of a stale filehandle. Doing so can cause the state recovery code
to break, since nfs4_get_renew_cred() and nfs4_get_setclientid_cred() rely
on finding active state owners.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:12 -04:00
Trond Myklebust dbae4c73f0 NFS: Ensure that rpc_run_task() errors are propagated back to the caller
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:08 -04:00
Trond Myklebust c9d8f89d98 NFS: Ensure that the write code cleans up properly when rpc_run_task() fails
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:05 -04:00
Trond Myklebust fdd1e74c89 NFS: Ensure that the read code cleans up properly when rpc_run_task() fails
In the case of readpage() we need to ensure that the pages get unlocked,
and that the error is flagged.

In the case of O_DIRECT, we need to ensure that the pages are all released.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:53:01 -04:00
Trond Myklebust 73e3302f60 NFS: Fix nfs_wb_page() to always exit with an error or a clean page
It is possible for nfs_wb_page() to sometimes exit with 0 return value, yet
the page is left in a dirty state.
For instance in the case where the server rebooted, and the COMMIT request
failed, then all the previously "clean" pages which were cached by the
server, but were not guaranteed to have been writted out to disk,
have to be redirtied and resent to the server.
The fix is to have nfs_wb_page_priority() check that the page is clean
before it exits...

This fixes a condition that triggers the BUG_ON(PagePrivate(page)) in
nfs_create_request() when we're in the nfs_readpage() path.

Also eliminate a redundant BUG_ON(!PageLocked(page)) while we're at it. It
turns out that clear_page_dirty_for_io() has the exact same test.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2008-04-19 16:52:58 -04:00
Dave Hansen ad775f5a8f [PATCH] r/o bind mounts: debugging for missed calls
There have been a few oopses caused by 'struct file's with NULL f_vfsmnts.
There was also a set of potentially missed mnt_want_write()s from
dentry_open() calls.

This patch provides a very simple debugging framework to catch these kinds of
bugs.  It will WARN_ON() them, but should stop us from having any oopses or
mnt_writer count imbalances.

I'm quite convinced that this is a good thing because it found bugs in the
stuff I was working on as soon as I wrote it.

[hch: made it conditional on a debug option.
      But it's still a little bit too ugly]

[hch: merged forced remount r/o fix from Dave and akpm's fix for the fix]

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:29:28 -04:00
Dave Hansen 2e4b7fcd92 [PATCH] r/o bind mounts: honor mount writer counts at remount
Originally from: Herbert Poetzl <herbert@13thfloor.at>

This is the core of the read-only bind mount patch set.

Note that this does _not_ add a "ro" option directly to the bind mount
operation.  If you require such a mount, you must first do the bind, then
follow it up with a 'mount -o remount,ro' operation:

If you wish to have a r/o bind mount of /foo on bar:

	mount --bind /foo /bar
	mount -o remount,ro /bar

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:29:27 -04:00
Dave Hansen 3d733633a6 [PATCH] r/o bind mounts: track numbers of writers to mounts
This is the real meat of the entire series.  It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time.  Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.

This uses a statically-allocated percpu variable.  All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances.  Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two

Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.

I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.

http://sr71.net/~dave/linux/openbench.c

The code in here is a a worst-possible case for this patch.  It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely.  This worst-case scenario causes a 3%
degredation in the benchmark.

I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory.  In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.

(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable.  So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)

[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:29:27 -04:00
Dave Hansen 2c463e9548 [PATCH] r/o bind mounts: check mnt instead of superblock directly
If we depend on the inodes for writeability, we will not catch the r/o mounts
when implemented.

This patches uses __mnt_want_write().  It does not guarantee that the mount
will stay writeable after the check.  But, this is OK for one of the checks
because it is just for a printk().

The other two are probably unnecessary and duplicate existing checks in the
VFS.  This won't make them better checks than before, but it will make them
detect r/o mounts.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:29:27 -04:00
Dave Hansen ec82687f29 [PATCH] r/o bind mounts: elevate count for xfs timestamp updates
Elevate the write count during the xfs m/ctime updates.

XFS has to do it's own timestamp updates due to an unfortunate VFS
design limitation, so it will have to track writers by itself aswell.

[hch: split out from the touch_atime patch as it's not related to it at all]

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:29:26 -04:00
Dave Hansen 2f676cbc0d [PATCH] r/o bind mounts: make access() use new r/o helper
It is OK to let access() go without using a mnt_want/drop_write() pair because
it doesn't actually do writes to the filesystem, and it is inherently racy
anyway.  This is a rare case when it is OK to use __mnt_is_readonly()
directly.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:29:26 -04:00
Dave Hansen 9ac9b8474c [PATCH] r/o bind mounts: write counts for truncate()
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:29:25 -04:00
Dave Hansen 2af482a7ed [PATCH] r/o bind mounts: elevate write count for chmod/chown callers
chown/chmod,etc...  don't call permission in the same way that the normal
"open for write" calls do.  They still write to the filesystem, so bump the
write count during these operations.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:29:25 -04:00
Dave Hansen 4a3fd211cc [PATCH] r/o bind mounts: elevate write count for open()s
This is the first really tricky patch in the series.  It elevates the writer
count on a mount each time a non-special file is opened for write.

We used to do this in may_open(), but Miklos pointed out that __dentry_open()
is used as well to create filps.  This will cover even those cases, while a
call in may_open() would not have.

There is also an elevated count around the vfs_create() call in open_namei().
See the comments for more details, but we need this to fix a 'create, remount,
fail r/w open()' race.

Some filesystems forego the use of normal vfs calls to create
struct files.   Make sure that these users elevate the mnt
writer count because they will get __fput(), and we need
to make sure they're balanced.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:29:25 -04:00
Dave Hansen 42a74f206b [PATCH] r/o bind mounts: elevate write count for ioctls()
Some ioctl()s can cause writes to the filesystem.  Take these, and make them
use mnt_want/drop_write() instead.

[AV: updated]

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:29:24 -04:00
Dave Hansen 20ddee2c75 [PATCH] r/o bind mounts: write count for file_update_time()
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:29:24 -04:00
Dave Hansen 74f9fdfa1f [PATCH] r/o bind mounts: elevate write count for do_utimes()
Now includes fix for oops seen by akpm.

"never let a libc developer write your kernel code" - hch

"nor, apparently, a kernel developer" - akpm

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:29:24 -04:00
Dave Hansen cdb70f3f74 [PATCH] r/o bind mounts: write counts for touch_atime()
Remove handling of NULL mnt while we are at it - that can't happen these days.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:29:23 -04:00
Dave Hansen a761a1c03a [PATCH] r/o bind mounts: elevate write count for ncp_ioctl()
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:29:23 -04:00
Dave Hansen 18f335aff8 [PATCH] r/o bind mounts: elevate write count for xattr_permission() callers
This basically audits the callers of xattr_permission(), which calls
permission() and can perform writes to the filesystem.

[AV: add missing parts - removexattr() and nfsd posix acls, plug for a leak
spotted by Miklos]

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:29:15 -04:00
Dave Hansen 9079b1eb17 [PATCH] r/o bind mounts: get write access for vfs_rename() callers
This also uses the little helper in the NFS code to make an if() a little bit
less ugly.  We introduced the helper at the beginning of the series.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:25:34 -04:00
Dave Hansen 75c3f29de7 [PATCH] r/o bind mounts: write counts for link/symlink
[AV: add missing nfsd pieces]

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:25:34 -04:00
Dave Hansen 463c319726 [PATCH] r/o bind mounts: get callers of vfs_mknod/create/mkdir()
This takes care of all of the direct callers of vfs_mknod().
Since a few of these cases also handle normal file creation
as well, this also covers some calls to vfs_create().

So that we don't have to make three mnt_want/drop_write()
calls inside of the switch statement, we move some of its
logic outside of the switch and into a helper function
suggested by Christoph.

This also encapsulates a fix for mknod(S_IFREG) that Miklos
found.

[AV: merged mkdir handling, added missing nfsd pieces]

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:25:34 -04:00
Dave Hansen 0622753b80 [PATCH] r/o bind mounts: elevate write count for rmdir and unlink.
Elevate the write count during the vfs_rmdir() and vfs_unlink().

[AV: merged rmdir and unlink parts, added missing pieces in nfsd]

Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:25:33 -04:00
Dave Hansen 49e0d02cf0 [PATCH] r/o bind mounts: drop write during emergency remount
The emergency remount code forcibly removes FMODE_WRITE from
filps.  The r/o bind mount code notices that this was done
without a proper mnt_drop_write() and properly gives a
warning.

This patch does a mnt_drop_write() to keep everything
balanced.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:25:33 -04:00
Dave Hansen aceaf78da9 [PATCH] r/o bind mounts: create helper to drop file write access
If someone decides to demote a file from r/w to just
r/o, they can use this same code as __fput().

NFS does just that, and will use this in the next
patch.

AV: drop write access in __fput() only after we evict from file list.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J Bruce Fields" <bfields@fieldses.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:25:32 -04:00
Dave Hansen 8366025eb8 [PATCH] r/o bind mounts: stub functions
This patch adds two function mnt_want_write() and mnt_drop_write().  These are
used like a lock pair around and fs operations that might cause a write to the
filesystem.

Before these can become useful, we must first cover each place in the VFS
where writes are performed with a want/drop pair.  When that is complete, we
can actually introduce code that will safely check the counts before allowing
r/w<->r/o transitions to occur.

Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:25:32 -04:00
Christoph Hellwig a70e65df88 [PATCH] merge open_namei() and do_filp_open()
open_namei() will, in the future, need to take mount write counts
over its creation and truncation (via may_open()) operations.  It
needs to keep these write counts until any potential filp that is
created gets __fput()'d.

This gets complicated in the error handling and becomes very murky
as to how far open_namei() actually got, and whether or not that
mount write count was taken.  That makes it a bad interface.

All that the current do_filp_open() really does is allocate the
nameidata on the stack, then call open_namei().

So, this merges those two functions and moves filp_open() over
to namei.c so it can be close to its buddy: do_filp_open().  It
also gets a kerneldoc comment in the process.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:25:32 -04:00
Dave Hansen d57999e152 [PATCH] do namei_flags calculation inside open_namei()
My end goal here is to make sure all users of may_open()
return filps.  This will ensure that we properly release
mount write counts which were taken for the filp in
may_open().

This patch moves the sys_open flags to namei flags
calculation into fs/namei.c.  We'll shortly be moving
the nameidata_to_filp() calls into namei.c, and this
gets the sys_open flags to a place where we can get
at them when we need them.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-04-19 00:25:31 -04:00
Matthew Wilcox 6188e10d38 Convert asm/semaphore.h users to linux/semaphore.h
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
2008-04-18 22:22:54 -04:00
Matthew Wilcox cb688371e2 fs: Remove unnecessary inclusions of asm/semaphore.h
None of these files use any of the functionality promised by
asm/semaphore.h.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
2008-04-18 22:16:44 -04:00
Linus Torvalds 334d094504 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.26
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.26: (1090 commits)
  [NET]: Fix and allocate less memory for ->priv'less netdevices
  [IPV6]: Fix dangling references on error in fib6_add().
  [NETLABEL]: Fix NULL deref in netlbl_unlabel_staticlist_gen() if ifindex not found
  [PKT_SCHED]: Fix datalen check in tcf_simp_init().
  [INET]: Uninline the __inet_inherit_port call.
  [INET]: Drop the inet_inherit_port() call.
  SCTP: Initialize partial_bytes_acked to 0, when all of the data is acked.
  [netdrvr] forcedeth: internal simplifications; changelog removal
  phylib: factor out get_phy_id from within get_phy_device
  PHY: add BCM5464 support to broadcom PHY driver
  cxgb3: Fix __must_check warning with dev_dbg.
  tc35815: Statistics cleanup
  natsemi: fix MMIO for PPC 44x platforms
  [TIPC]: Cleanup of TIPC reference table code
  [TIPC]: Optimized initialization of TIPC reference table
  [TIPC]: Remove inlining of reference table locking routines
  e1000: convert uint16_t style integers to u16
  ixgb: convert uint16_t style integers to u16
  sb1000.c: make const arrays static
  sb1000.c: stop inlining largish static functions
  ...
2008-04-18 18:02:35 -07:00
Steve French 076d8423a9 [CIFS] Fix UNC path prefix on QueryUnixPathInfo to have correct slash
When a share was in DFS and the server was Unix/Linux, we were sending paths of the form
    \\server\share/dir/file
rather than
    //server/share/dir/file

There was some discussion between me and jra over whether we should use
    /server/share/dir/file
as MS sometimes says - but the documentation for this claims it should be
doubleslash for this type of UNC-like path format and that works, so leaving
it as doubleslash but converting the \ to / in the the //server/share portion.

This gets Samba to now correctly return STATUS_PATH_NOT_COVERED when it is
supposed to (Windows already did since the direction of the slash was not an issue
for them).  Still need another minor change to fully enable DFS (need to finish
some chages to SMBGetDFSRefer

Signed-off-by: Steve French <sfrench@us.ibm.com>
2008-04-18 23:26:26 +00:00
Linus Torvalds e675349e2b Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: (64 commits)
  ocfs2/net: Add debug interface to o2net
  ocfs2: Only build ocfs2/dlm with the o2cb stack module
  ocfs2/cluster: Get rid of arguments to the timeout routines
  ocfs2: Put tree in MAINTAINERS
  ocfs2: Use BUG_ON
  ocfs2: Convert ocfs2 over to unlocked_ioctl
  ocfs2: Improve rename locking
  fs/ocfs2/aops.c: test for IS_ERR rather than 0
  ocfs2: Add inode stealing for ocfs2_reserve_new_inode
  ocfs2: Add ac_alloc_slot in ocfs2_alloc_context
  ocfs2: Add a new parameter for ocfs2_reserve_suballoc_bits
  ocfs2: Enable cross extent block merge.
  ocfs2: Add support for cross extent block
  ocfs2: Move /sys/o2cb to /sys/fs/o2cb
  sysfs: Allow removal of symlinks in the sysfs root
  ocfs2:  Reconnect after idle time out.
  ocfs2/dlm: Cleanup lockres print
  ocfs2/dlm: Fix lockname in lockres print function
  ocfs2/dlm: Move dlm_print_one_mle() from dlmmaster.c to dlmdebug.c
  ocfs2/dlm: Dumps the purgelist into a debugfs file
  ...
2008-04-18 10:15:22 -07:00
Linus Torvalds ef38ff9d37 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (49 commits)
  [GFS2] fix assertion in log_refund()
  [GFS2] fix GFP_KERNEL misuses
  [GFS2] test for IS_ERR rather than 0
  [GFS2] Invalidate cache at correct point
  [GFS2] fs/gfs2/recovery.c: suppress warnings
  [GFS2] Faster gfs2_bitfit algorithm
  [GFS2] Streamline quota lock/check for no-quota case
  [GFS2] Remove drop of module ref where not needed
  [GFS2] gfs2_adjust_quota has broken unstuffing code
  [GFS2] possible null pointer dereference fixup
  [GFS2] Need to ensure that sector_t is 64bits for GFS2
  [GFS2] re-support special inode
  [GFS2] remove gfs2_dev_iops
  [GFS2] fix file_system_type leak on gfs2meta mount
  [GFS2] Allow bmap to allocate extents
  [GFS2] Fix a page lock / glock deadlock
  [GFS2] proper extern for gfs2/locking/dlm/mount.c:gdlm_ops
  [GFS2] gfs2/ops_file.c should #include "ops_inode.h"
  [GFS2] be*_add_cpu conversion
  [GFS2] Fix bug where we called drop_bh incorrectly
  ...
2008-04-18 10:02:46 -07:00
Steve French 2302aca850 [CIFS] Reserve new proxy cap for WAFS
New WAFS filer uses ioctls which are shown to be available
on a share by querying this info level

Acked-by: Sam Liddicott <sam@liddicott.com>
Signed-off-by: Stevef French <sfrench@us.ibm.com>
2008-04-18 16:40:32 +00:00
Sunil Mushran 2309e9e040 ocfs2/net: Add debug interface to o2net
This patch exposes o2net information via debugfs. The information includes
the list of sockets (sock_containers) as well as the list of outstanding
messages (send_tracking). Useful for o2dlm debugging.

(This patch is derived from an earlier one written by Zach Brown that
exposed the same information via /proc.)

[Mark: checkpatch fixes]

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:20 -07:00
Mark Fasheh 93b06edb51 ocfs2: Only build ocfs2/dlm with the o2cb stack module
fs/ocfs2/dlm/ocfs2_dlm.ko and fs/ocfs2/dlm/ocfs2_dlmfs.ko get built if
CONFIG_FS_OCFS2 is specified. This isn't quite how it should happen any more
- the "o2cb" dlm modules should only be built if CONFIG_FS_OCFS2_O2CB is
set, so update the dlm Makefile accordingly.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
2008-04-18 08:56:12 -07:00
Jeff Mahoney 409753bf6d ocfs2/cluster: Get rid of arguments to the timeout routines
We keep seeing bug reports related to NULL pointer derefs in
o2net_set_nn_state(). When I originally wrote up the configurable timeout
patch, I had tried to plan for multiple clusters. This was silly.

The timeout routines all use o2nm_single_cluster so there's no point in
passing an argument at all. This patch removes the arguments and kills those
bugs dead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:12 -07:00
Julia Lawall b1f3550fa1 ocfs2: Use BUG_ON
if (...) BUG(); should be replaced with BUG_ON(...) when the test has no
side-effects to allow a definition of BUG_ON that drops the code completely.

The semantic patch that makes this change is as follows:
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@ disable unlikely @ expression E,f; @@

(
  if (<... f(...) ...>) { BUG(); }
|
- if (unlikely(E)) { BUG(); }
+ BUG_ON(E);
)

@@ expression E,f; @@

(
  if (<... f(...) ...>) { BUG(); }
|
- if (E) { BUG(); }
+ BUG_ON(E);
)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:11 -07:00
Andi Kleen c9ec14884d ocfs2: Convert ocfs2 over to unlocked_ioctl
As far as I can see there is nothing in ocfs2_ioctl that requires the BKL,
so use unlocked_ioctl

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:11 -07:00
Jan Kara 5dabd69515 ocfs2: Improve rename locking
ocfs2_rename() was being too aggressive with the rename lock - we only need
it for certain forms of directory rename.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:11 -07:00
Julia Lawall 58dadcdbc2 fs/ocfs2/aops.c: test for IS_ERR rather than 0
The function ocfs2_start_trans always returns either a valid pointer or a
value made with ERR_PTR, so its result should be tested with IS_ERR, not
with a test for 0.

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:11 -07:00
Tao Ma 4d0ddb2ce2 ocfs2: Add inode stealing for ocfs2_reserve_new_inode
Inode allocation is modified to look in other nodes allocators during
extreme out of space situations. We retry our own slot when space is freed
back to the global bitmap, or whenever we've allocated more than 1024 inodes
from another slot.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:10 -07:00
Tao Ma a4a4891164 ocfs2: Add ac_alloc_slot in ocfs2_alloc_context
In inode stealing, we no longer restrict the allocation to
happen in the local node. So it is neccessary for us to add
a new member in ocfs2_alloc_context to indicate which slot
we are using for allocation. We also modify the process of
local alloc so that this member can be used there also.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:10 -07:00
Tao Ma ffda89a3bf ocfs2: Add a new parameter for ocfs2_reserve_suballoc_bits
In some cases(Inode stealing from other nodes), we may not want
ocfs2_reserve_suballoc_bits to allocate new groups from the
global_bitmap since it may already be full. So add a new parameter
for this.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:10 -07:00
Tao Ma ad5a4d7093 ocfs2: Enable cross extent block merge.
In ocfs2_figure_merge_contig_type, we judge whether there exists
a cross extent block merge and enable it by setting CONTIG_LEFT
and CONTIG_RIGHT accordingly.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:10 -07:00
Tao Ma 677b975282 ocfs2: Add support for cross extent block
In ocfs2_merge_rec_left, when we find the merge extent is "CONTIG_RIGHT"
with the first extent record of the next extent block, we will merge it to
the next extent block and change all the related extent blocks accordingly.

In ocfs2_merge_rec_right, when we find the merge extent is "CONTIG_LEFT"
with the last extent record of the previous extent block, we will merge
it to the prevoius extent block and change all the related extent blocks
accordingly.

As for CONTIG_LEFTRIGHT, we will handle CONTIG_RIGHT first so that when
the index is zero, the merge process will be more efficient and easier.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:10 -07:00
Mark Fasheh 52f7c21b61 ocfs2: Move /sys/o2cb to /sys/fs/o2cb
/sys/fs is where we really want file system specific sysfs objects.

Ocfs2-tools has been updated to look in /sys/fs/o2cb. We can maintain
backwards compatibility with old ocfs2-tools by using a sysfs symlink. After
some time (2 years), the symlink can be safely removed. This patch also adds
documentation to make it easier for people to figure out what /sys/fs/o2cb
is used for.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:10 -07:00
Mark Fasheh a839c5afcd sysfs: Allow removal of symlinks in the sysfs root
Allow callers of sysfs_remove_link() to pass a NULL kobj, in which case
sysfs_root will be used as the parent directory. This allows us to tear down
top level symlinks created via sysfs_create_link(), which already has
similar handling of a NULL parent object.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-04-18 08:56:10 -07:00
Tao Ma 5cc3bf2786 ocfs2: Reconnect after idle time out.
Currently, o2net connects to a node on hb_up and disconnects on
hb_down and net timeout.

It disconnects on net timeout is ok, but it should attempt to
reconnect back. This is because sometimes nodes get overloaded
enough that the network connection breaks but the disk hb does not.
And if we get into that situation, we either fence (unnecessarily)
or wait for its disk hb to die (and sometimes hang in the process).

So in this updated scheme, when the network disconnects, we keep
attempting to reconnect till we succeed or we get a disk hb down
event.

If the other node is really dead, then we will eventually get a
node down event. If not, we should be able to connect again and
continue.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:10 -07:00
Sunil Mushran 8f50eb9789 ocfs2/dlm: Cleanup lockres print
A previous patch added KERN_NOTICE to printks printing the lockres that
cluttered the output. This patch removes the log level. For people concerned
with syslog clutter, please note we now use this facility to print lockres
only during an error.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:09 -07:00
Sunil Mushran c834cdb157 ocfs2/dlm: Fix lockname in lockres print function
__dlm_print_one_lock_resource was printing lockname incorrectly.
Also, we now use printk directly instead of mlog as the latter prints
the line context which is not useful for this print.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:09 -07:00
Sunil Mushran e5a0334cbd ocfs2/dlm: Move dlm_print_one_mle() from dlmmaster.c to dlmdebug.c
This patch helps in consolidating debugging related functions in dlmdebug.c.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:09 -07:00
Sunil Mushran 7209300a9b ocfs2/dlm: Dumps the purgelist into a debugfs file
This patch dumps all the lockres' on the purgelist it can fit in one page
into a debugfs file. Useful for debugging.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:09 -07:00
Sunil Mushran d0129aceae ocfs2/dlm: Dumps the mles into a debugfs file
This patch dumps all mles it can fit in one page into a debugfs file.
Useful for debugging.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:09 -07:00
Sunil Mushran 751155a953 ocfs2/dlm: Move struct dlm_master_list_entry to dlmcommon.h
This patch moves some mle related definitions from dlmmaster.c
to dlmcommon.h. Future patches need these definitions to dump mle
debugging information.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.beckeroracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:09 -07:00
Sunil Mushran 4e3d24ed1a ocfs2/dlm: Dumps the lockres' into a debugfs file
This patch dumps all the lockres' alongwith all the locks into
a debugfs file. Useful for debugging.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:09 -07:00
Sunil Mushran 007dce53a2 ocfs2/dlm: Dump the dlm state in a debugfs file
This patch dumps the dlm state (dlm_ctxt) into a debugfs file.
Useful for debugging.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:08 -07:00
Sunil Mushran 6325b4a22b ocfs2/dlm: Create debugfs dirs
This patch creates the debugfs directories that will hold the
files to be used to dump the dlm state.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:08 -07:00
Sunil Mushran 29576f8bb5 ocfs2/dlm: Link all lockres' to a tracking list
This patch links all the lockres' to a tracking list in dlm_ctxt.
We will use this in an upcoming patch that will walk the entire
list and to dump the lockres states to a debugfs file.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:08 -07:00
Sunil Mushran 724bdca9b8 ocfs2/dlm: Create slabcaches for lock and lockres
This patch makes the o2dlm allocate memory for lockres, lockname and lock
structures from slabcaches rather than kmalloc. This allows us to not only
make these allocs more efficient but also allows us to track the memory being
consumed by these structures.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:08 -07:00
Sunil Mushran 12eb0035d6 ocfs2/dlm: Rename slabcache dlm_mle_cache to o2dlm_mle
This patch renames dlm_mle_slabcache to prevent namespace clashes with fs/dlm.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:08 -07:00
Joel Becker 9341d22942 ocfs2: Allow selection of cluster plug-ins.
ocfs2 now supports plug-ins for the classic O2CB stack as well as
userspace cluster stacks in conjunction with fs/dlm.  This allows zero,
one, or both of the plug-ins to be selected in Kconfig.  For local mounts
(non-clustered), neither plug-in is needed.  Both plugins can be loaded
at one time, the runtime will select the one needed for the cluster
systme in use.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:07 -07:00
Joel Becker b92eccdd28 ocfs2: Add kbuild for ocfs2_stack_user.ko
Add ocfs2_stack_user.ko to the Makefile so that it builds.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:07 -07:00
Joel Becker 8f318311fa ocfs2: Change mlog_bug_on to BUG_ON in ocfs2_lockid.h
The masklog code is in the o2cb stack, but ocfs2_lockid.h now needs to
be included by the user stack.  The BUG() in ocfs2_lock_type_string()
does not need masklog support, so change it to a regular BUG_ON().

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:07 -07:00
David Teigland cf4d8d75d8 ocfs2: add fsdlm to stackglue
Add code to use fs/dlm.

[ Modified to be part of the stack_user module -- Joel ]

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:07 -07:00
Joel Becker d4b95eef4d ocfs2: Add the 'set version' message to the ocfs2_control device.
The "SETV" message sets the filesystem locking protocol version as
negotiated by the client.  The client negotiates based on the maximum
version advertised in /sys/fs/ocfs2/max_locking_protocol.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:07 -07:00
Joel Becker 3cfd4ab6b6 ocfs2: Add the local node id to the handshake.
This is the second part of the ocfs2_control handshake.  After
negotiating the ocfs2_control protocol, the daemon tells the filesystem
what the local node id is via the SETN message.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:06 -07:00
Joel Becker de870ef022 ocfs2: Introduce the DOWN message to ocfs2_control
When the control daemon sees a node go down, it sends a DOWN message
through the ocfs2_control device.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:06 -07:00
Joel Becker 462c7e6a25 ocfs2: Start the ocfs2_control handshake.
When a control daemon opens the ocfs2_control device, it must perform a
handshake to tell the filesystem it is something capable of monitoring
cluster status.  Only after the handshake is complete will the filesystem
allow mounts.

This is the first part of the handshake.  The daemon reads all supported
ocfs2_control protocols, then writes in the protocol it will use.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:06 -07:00
Joel Becker 6427a72755 ocfs2: Add the ocfs2_control misc device.
The ocfs2_control misc device is how a userspace control daemon (controld)
talks to the filesystem.  Introduce the bare-bones filesystem ops.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:06 -07:00
Joel Becker 8adf0536c9 ocfs2: Add the user stack module.
Add a skeleton for the stack_user module.  It's just the barebones module
code.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:06 -07:00
Joel Becker 9c6c877c04 ocfs2: Add the 'cluster_stack' sysfs file.
Userspace can now query and specify the cluster stack in use via the
/sys/fs/ocfs2/cluster_stack file.  By default, it is 'o2cb', which is
the classic stack.  Thus, old tools that do not know how to modify this
file will work just fine.  The stack cannot be modified if there is a
live filesystem.

ocfs2_cluster_connect() now takes the expected cluster stack as an
argument.  This way, the filesystem and the stack glue ensure they are
speaking to the same backend.

If the stack is 'o2cb', the o2cb stack plugin is used.  For any other
value, the fsdlm stack plugin is selected.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:05 -07:00
Joel Becker b61817e116 ocfs2: Add the USERSPACE_STACK incompat bit.
The filesystem gains the USERSPACE_STACK incomat bit and the
s_cluster_info field on the superblock.  When a userspace stack is in
use, the name of the stack is stored on-disk for mount-time
verification.

The "cluster_stack" option is added to mount(2) processing.  The mount
process needs to pass the matching stack name.  If the passed name and
the on-disk name do not match, the mount is failed.

When using the classic o2cb stack, the incompat bit is *not* set and no
mount option is used other than the usual heartbeat=local.  Thus, the
filesystem is compatible with older tools.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:05 -07:00
Joel Becker 74ae4e104d ocfs2: Create stack glue sysfs files.
Introduce a set of sysfs files that describe the current stack glue
state.  The files live under /sys/fs/ocfs2.  The locking_protocol file
displays the version of ocfs2's locking code.  The
loaded_cluster_plugins file displays all of the currently loaded stack
plugins.  When filesystems are mounted, the active_cluster_plugin file
will display the plugin in use.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:05 -07:00
Joel Becker 286eaa95c5 ocfs2: Break out stackglue into modules.
We define the ocfs2_stack_plugin structure to represent a stack driver.
The o2cb stack code is split into stack_o2cb.c.  This becomes the
ocfs2_stack_o2cb.ko module.

The stackglue generic functions are similarly split into the
ocfs2_stackglue.ko module.  This module now provides an interface to
register drivers.  The ocfs2_stack_o2cb driver registers itself.  As
part of this interface, ocfs2_stackglue can load drivers on demand.
This is accomplished in ocfs2_cluster_connect().

ocfs2_cluster_disconnect() is now notified when a _hangup() is pending.
If a hangup is pending, it will not release the driver module and will
let _hangup() do that.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2008-04-18 08:56:05 -07:00
Joel Becker e3dad42bf9 ocfs2: Create ocfs2_stack_operations and split out the o2cb stack.
Define the ocfs2_stack_operations structure.  Build o2cb_stack_ops from
all of the o2cb-specific stack functions.  Change the generic stack glue
functions to call the stack_ops instead of the o2cb functions directly.

The o2cb functions are moved to stack_o2cb.c.  The headers are cleaned up
to where only needed headers are included.

In this code, stackglue.c and stack_o2cb.c refer to some shared
extern variables.  When they become modules, that will change.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:05 -07:00
Joel Becker 553aa7e408 ocfs2: Split o2cb code from generic stack functions.
Split off the o2cb-specific funtionality from the generic stack glue
calls.  This is a precurser to wrapping the o2cb functionality in an
operations vector.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:05 -07:00
Joel Becker 63e0c48ae6 ocfs2: Clean up stackglue initialization
The stack glue initialization function needs a better name so that it can be
used cleanly when stackglue becomes a module.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:05 -07:00
Joel Becker cf0acdcd64 ocfs2: Abstract out a debugging function for underlying dlms.
dlmglue.c was still referencing a raw o2dlm lksb in one instance.  Let's
create a generic ocfs2_dlm_dump_lksb() function.  This allows underlying
DLMs to print whatever they want about their lock.

We then move the o2dlm dump into stackglue.c where it belongs.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:04 -07:00
David Teigland 1693a5c011 ocfs2: handle async EAGAIN from NOQUEUE request
When using fsdlm, -EAGAIN is returned in the async callback for NOQUEUE
requests. Fix up dlmglue to expect this.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:04 -07:00
Joel Becker de551246e7 ocfs2: Remove CANCELGRANT from the view of dlmglue.
o2dlm has the non-standard behavior of providing a cancel callback
(unlock_ast) even when the cancel has failed (the locking operation
succeeded without canceling).  This is called CANCELGRANT after the
status code sent to the callback.  fs/dlm does not provide this
callback, so dlmglue must be changed to live without it.
o2dlm_unlock_ast_wrapper() in stackglue now ignores CANCELGRANT calls.

Because dlmglue no longer sees CANCELGRANT, ocfs2_unlock_ast() no longer
needs to check for it.  ocfs2_locking_ast() must catch that a cancel was
tried and clear the cancel state.

Making these changes opens up a locking race.  dlmglue uses the the
OCFS2_LOCK_BUSY flag to ensure only one thread is calling the dlm at any
one time.  But dlmglue must unlock the lockres before calling into the
dlm.  In the small window of time between unlocking the lockres and
calling the dlm, the downconvert thread can try to cancel the lock.  The
downconvert thread is checking the OCFS2_LOCK_BUSY flag - it doesn't
know that ocfs2_dlm_lock() has not yet been called.

Because ocfs2_dlm_lock() has not yet been called, the cancel operation
will just be a no-op.  There's nothing to cancel.  With CANCELGRANT,
dlmglue uses the CANCELGRANT callback to clear up the cancel state.
When it comes around again, it will retry the cancel.  Eventually, the
first thread will have called into ocfs2_dlm_lock(), and either the
lock or the cancel will succeed.  The downconvert thread can then do its
downconvert.

Without CANCELGRANT, there is nothing to clean up the cancellation
state.  The downconvert thread does not know to retry its operations.
More importantly, the original lock may be blocking on the other node
that is trying to cancel us.  With neither able to make progress, the
ast is never called and the cancellation state is never cleaned up that
way.  dlmglue is deadlocked.

The OCFS2_LOCK_PENDING flag is introduced to remedy this window.  It is
set at the same time OCFS2_LOCK_BUSY is.  Thus, the downconvert thread
can check whether the lock is cancelable.  If not, it just loops around
to try again.  Once ocfs2_dlm_lock() is called, the thread then clears
OCFS2_LOCK_PENDING and wakes the downconvert thread.  Now, if the
downconvert thread finds the lock BUSY, it can safely try to cancel it.
Whether the cancel works or not, the state will be properly set and the
lock processing can continue.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:04 -07:00
Mark Fasheh 0abd6d1803 ocfs2: Fill node number during cluster stack init
It doesn't make sense to query for a node number before connecting to the
cluster stack. This should be safe to do because node_num is only just
printed,
and we're actually only moving the setting of node num a small amount
further in the mount process.

[ Disconnect when node query fails -- Joel ]

Reviewed-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:04 -07:00
Joel Becker 6953b4c008 ocfs2: Move o2hb functionality into the stack glue.
The last bit of classic stack used directly in ocfs2 code is o2hb.
Specifically, the check for heartbeat during mount and the call to
ocfs2_hb_ctl during unmount.

We create an extra API, ocfs2_cluster_hangup(), to encapsulate the call
to ocfs2_hb_ctl.  Other stacks will just leave hangup() empty.

The check for heartbeat is moved into ocfs2_cluster_connect().  It will
be matched by a similar check for other stacks.

With this change, only stackglue.c includes cluster/ headers.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:04 -07:00
Joel Becker 19fdb624dc ocfs2: Abstract out node number queries.
ocfs2 asks the cluster stack for the local node's node number for two
reasons; to fill the slot map and to print it. While the slot map isn't
necessary for userspace cluster stacks, the printing is very nice for
debugging. Thus we add ocfs2_cluster_this_node() as a generic API to get
this value. It is anticipated that the slot map will not be used under a
userspace cluster stack, so validity checks of the node num only need to
exist in the slot map code. Otherwise, it just gets used and printed as an
opaque value.

[ Fixed up some "int" versus "unsigned int" issues and made osb->node_num
  truly opaque. --Mark ]

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:04 -07:00
Joel Becker 4670c46ded ocfs2: Introduce the new ocfs2_cluster_connect/disconnect() API.
This step introduces a cluster stack agnostic API for initializing and
exiting.  fs/ocfs2/dlmglue.c no longer uses o2cb/o2dlm knowledge to
connect to the stack.  It is all handled in stackglue.c.

heartbeat.c no longer needs to know how it gets called.
ocfs2_do_node_down() is now a clean recovery trigger.

The big gotcha is the ordering of initializations and de-initializations done
underneath ocfs2_cluster_connect().  ocfs2_dlm_init() used to do all
o2dlm initialization in one block.  Thus, the o2dlm functionality of
ocfs2_cluster_connect() is very straightforward.  ocfs2_dlm_shutdown(),
however, did a few things between de-registration of the eviction
callback and actually shutting down the domain.  Now de-registration and
shutdown of the domain are wrapped within the single
ocfs2_cluster_disconnect() call.  I've checked the code paths to make
sure we can safely tear down things in ocfs2_dlm_shutdown() before
calling ocfs2_cluster_disconnect().  The filesystem has already set
itself to ignore the callback.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:04 -07:00
Joel Becker 8f2c9c1b16 ocfs2: Create the lock status block union.
Wrap the lock status block (lksb) in a union.  Later we will add a union
element for the fs/dlm lksb.  Create accessors for the status and lvb
fields.

Other than a debugging function, dlmglue.c does not directly reference
the o2dlm locking path anymore.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:04 -07:00
Joel Becker 7431cd7e8d ocfs2: Use -errno instead of dlm_status for ocfs2_dlm_lock/unlock() API.
Change the ocfs2_dlm_lock/unlock() functions to return -errno values.
This is the first step towards elminiating dlm_status in
fs/ocfs2/dlmglue.c.  The change also passes -errno values to
->unlock_ast().

[ Fix a return code in dlmglue.c and change the error translation table into
  an array of ints. --Mark ]

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:03 -07:00
Joel Becker bd3e76105d ocfs2: Use global DLM_ constants in generic code.
The ocfs2 generic code should use the values in <linux/dlmconstants.h>.
stackglue.c will convert them to o2dlm values.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:03 -07:00
Joel Becker 24ef1815e5 ocfs2: Separate out dlm lock functions.
This is the first in a series of patches to isolate ocfs2 from the
underlying cluster stack. Here we wrap the dlm locking functions with
ocfs2-specific calls. Because ocfs2 always uses the same dlm lock status
callbacks, we can eliminate the callbacks from the filesystem visible
functions.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:03 -07:00
Joel Becker 386a2ef857 ocfs2: New slot map format
The old slot map had a few limitations:

- It was limited to one block, so the maximum slot count was 255.
- Each slot was signed 16bits, limiting node numbers to INT16_MAX.
- An empty slot was marked by the magic 0xFFFF (-1).

The new slot map format provides 32bit node numbers (UINT32_MAX), a
separate space to mark a slot in use, and extra room to grow.  The slot
map is now bounded by i_size, not a block.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:03 -07:00
Joel Becker fb86b1f071 ocfs2: Define the contents of the slot_map file.
The slot map file is merely an array of __le16.  Wrap it in a structure for
cleaner reference.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:03 -07:00
Joel Becker fc881fa0d5 ocfs2: De-magic the in-memory slot map.
The in-memory slot map uses the same magic as the on-disk one.  There is
a special value to mark a slot as invalid.  It relies on the size of
certain types and so on.

Write a new in-memory map that keeps validity as a separate field.  Outside
of the I/O functions, OCFS2_INVALID_SLOT now means what it is supposed to.
It also is no longer tied to the type size.

This also means that only the I/O functions refer to 16bit quantities.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:03 -07:00
Joel Becker 1c8d9a6a33 ocfs2: slot_map I/O based on max_slots.
The slot map code assumed a slot_map file has one block allocated.
This changes the code to I/O as many blocks as will cover max_slots.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:02 -07:00
Joel Becker 553abd046a ocfs2: Change the recovery map to an array of node numbers.
The old recovery map was a bitmap of node numbers.  This was sufficient
for the maximum node number of 254.  Going forward, we want node numbers
to be UINT32.  Thus, we need a new recovery map.

Note that we can't keep track of slots here.  We must write down the
node number to recovery *before* we get the locks needed to convert a
node number into a slot number.

The recovery map is now an array of unsigned ints, max_slots in size.
It moves to journal.c with the rest of recovery.

Because it needs to be initialized, we move all of recovery initialization
into a new function, ocfs2_recovery_init().  This actually cleans up
ocfs2_initialize_super() a little as well.  Following on, recovery cleaup
becomes part of ocfs2_recovery_exit().

A number of node map functions are rendered obsolete and are removed.

Finally, waiting on recovery is wrapped in a function rather than naked
checks on the recovery_event.  This is a cleanup from Mark.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:02 -07:00
Joel Becker d85b20e4b3 ocfs2: Make ocfs2_slot_info private.
Just use osb_lock around the ocfs2_slot_info data.  This allows us to
take the ocfs2_slot_info structure private in slot_info.c.  All access
is now via accessors.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-04-18 08:56:02 -07:00