Remove the allocation of struct nfsd4_compound_state.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Fix a build warning about leaking CONFIG_NFSD to userspace.
The nfsd_stats data structure does not need to be available to
userspace; no kernel interface uses it. So move it inside #ifdef
__KERNEL__ and the warning goes away.
Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Document the format and semantics of the /proc/fs/nfsd/pool_stats file.
Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
There is no need to set rqstp->rq_server to serv, while serv is initialized as rqstp->rq_server at previous line. And between these two lines, there is no change to rqstp->rq_server.
Signed-off-by: ideawu <ideawu@163.com>
Reviewed-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: Enable the use of const arguments in higher level svc_ APIs
by adding const to the arguments of the helper functions in svc_xprt.h
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
There is an inconsistency seen in the behaviour of nfs compared to other local
filesystems on linux when changing owner or group of a directory. If the
directory has SUID/SGID flags set, on changing owner or group on the directory,
the flags are stripped off on nfs. These flags are maintained on other
filesystems such as ext3.
To reproduce on a nfs share or local filesystem, run the following commands
mkdir test; chmod +s+g test; chown user1 test; ls -ld test
On the nfs share, the flags are stripped and the output seen is
drwxr-xr-x 2 user1 root 4096 Feb 23 2009 test
On other local filesystems(ex: ext3), the flags are not stripped and the output
seen is
drwsr-sr-x 2 user1 root 4096 Feb 23 13:57 test
chown_common() called from sys_chown() will only strip the flags if the inode is
not a directory.
static int chown_common(struct dentry * dentry, uid_t user, gid_t group)
{
..
if (!S_ISDIR(inode->i_mode))
newattrs.ia_valid |=
ATTR_KILL_SUID | ATTR_KILL_SGID | ATTR_KILL_PRIV;
..
}
See: http://www.opengroup.org/onlinepubs/7990989775/xsh/chown.html
"If the path argument refers to a regular file, the set-user-ID (S_ISUID) and
set-group-ID (S_ISGID) bits of the file mode are cleared upon successful return
from chown(), unless the call is made by a process with appropriate privileges,
in which case it is implementation-dependent whether these bits are altered. If
chown() is successfully invoked on a file that is not a regular file, these
bits may be cleared. These bits are defined in <sys/stat.h>."
The behaviour as it stands does not appear to violate POSIX. However the
actions performed are inconsistent when comparing ext3 and nfs.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Allow the NFSv4 server to make use of TCP autotuning behaviour, which
was previously disabled by setting the sk_userlocks variable.
Set the receive buffers to be big enough to receive the whole RPC
request, and set this for the listening socket, not the accept socket.
Remove the code that readjusts the receive/send buffer sizes for the
accepted socket. Previously this code was used to influence the TCP
window management behaviour, which is no longer needed when autotuning
is enabled.
This can improve IO bandwidth on networks with high bandwidth-delay
products, where a large tcp window is required. It also simplifies
performance tuning, since getting adequate tcp buffers previously
required increasing the number of nfsd threads.
Signed-off-by: Olga Kornievskaia <aglo@citi.umich.edu>
Cc: Jim Rees <rees@umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The spec allows clients to change ip address, so we shouldn't be
requiring that setclientid always come from the same address. For
example, a client could reboot and get a new dhcpd address, but still
present the same clientid to the server. In that case the server should
revoke the client's previous state and allow it to continue, instead of
(as it currently does) returning a CLID_INUSE error.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Add /proc/fs/nfsd/pool_stats to export to userspace various
statistics about the operation of rpc server thread pools.
This patch is based on a forward-ported version of
knfsd-add-pool-thread-stats which has been shipping in the SGI
"Enhanced NFS" product since 2006 and which was previously
posted:
http://article.gmane.org/gmane.linux.nfs/10375
It has also been updated thus:
* moved EXPORT_SYMBOL() to near the function it exports
* made the new struct struct seq_operations const
* used SEQ_START_TOKEN instead of ((void *)1)
* merged fix from SGI PV 990526 "sunrpc: use dprintk instead of
printk in svc_pool_stats_*()" by Harshula Jayasuriya.
* merged fix from SGI PV 964001 "Crash reading pool_stats before
nfsds are started".
Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: Harshula Jayasuriya <harshula@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Avoid overloading the CPU scheduler with enormous load averages
when handling high call-rate NFS loads. When the knfsd bottom half
is made aware of an incoming call by the socket layer, it tries to
choose an nfsd thread and wake it up. As long as there are idle
threads, one will be woken up.
If there are lot of nfsd threads (a sensible configuration when
the server is disk-bound or is running an HSM), there will be many
more nfsd threads than CPUs to run them. Under a high call-rate
low service-time workload, the result is that almost every nfsd is
runnable, but only a handful are actually able to run. This situation
causes two significant problems:
1. The CPU scheduler takes over 10% of each CPU, which is robbing
the nfsd threads of valuable CPU time.
2. At a high enough load, the nfsd threads starve userspace threads
of CPU time, to the point where daemons like portmap and rpc.mountd
do not schedule for tens of seconds at a time. Clients attempting
to mount an NFS filesystem timeout at the very first step (opening
a TCP connection to portmap) because portmap cannot wake up from
select() and call accept() in time.
Disclaimer: these effects were observed on a SLES9 kernel, modern
kernels' schedulers may behave more gracefully.
The solution is simple: keep in each svc_pool a counter of the number
of threads which have been woken but have not yet run, and do not wake
any more if that count reaches an arbitrary small threshold.
Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
synthetic client threads simulating an rsync (i.e. recursive directory
listing) workload reading from an i386 RH9 install image (161480
regular files in 10841 directories) on the server. That tree is small
enough to fill in the server's RAM so no disk traffic was involved.
This setup gives a sustained call rate in excess of 60000 calls/sec
before being CPU-bound on the server. The server was running 128 nfsds.
Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
taking 5.2%. This patch drops those contributions to 3.0% and 2.2%.
Load average was over 120 before the patch, and 20.9 after.
This patch is a forward-ported version of knfsd-avoid-nfsd-overload
which has been shipping in the SGI "Enhanced NFS" product since 2006.
It has been posted before:
http://article.gmane.org/gmane.linux.nfs/10374
Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Stop gathering the data that feeds the 'th' line in /proc/net/rpc/nfsd
because the questionable data provided is not worth the scalability
impact of calculating it. Instead, always report zeroes. The current
approach suffers from three major issues:
1. update_thread_usage() increments buckets by call service
time or call arrival time...in jiffies. On lightly loaded
machines, call service times are usually < 1 jiffy; on
heavily loaded machines call arrival times will be << 1 jiffy.
So a large portion of the updates to the buckets are rounded
down to zero, and the histogram is undercounting.
2. As seen previously on the nfs mailing list, the format in which
the histogram is presented is cryptic, difficult to explain,
and difficult to use.
3. Updating the histogram requires taking a global spinlock and
dirtying the global variables nfsd_last_call, nfsd_busy, and
nfsdstats *twice* on every RPC call, which is a significant
scaling limitation.
Testing on a 4 CPU 4 NIC Altix using 4 IRIX clients each doing
1K streaming reads at full line rate, shows the stats update code
(inlined into nfsd()) takes about 1.7% of each CPU. This patch drops
the contribution from nfsd() into the profile noise.
This patch is a forward-ported version of knfsd-remove-nfsd-threadstats
which has been shipping in the SGI "Enhanced NFS" product since 2006.
In that time, exactly one customer has noticed that the threadstats
were missing. It has been previously posted:
http://article.gmane.org/gmane.linux.nfs/10376
and more recently requested to be posted again.
Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The main nfsd code was recently modified to no longer do lookups from
withing the readdir callback, to avoid locking problems on certain
filesystems.
This (rather hacky, and overdue for replacement) NFSv4 recovery code has
the same problem. Fix it to build up a list of names (instead of
dentries) and do the lookups afterwards.
Reported symptoms were a deadlock in the xfs code (called from
nfsd4_recdir_load), with /var/lib/nfs on xfs.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Reported-by: David Warren <warren@atmos.washington.edu>
Currently putpubfh returns NFSERR_OPNOTSUPP, which isn't actually
allowed for v4. The right error is probably NFSERR_NOTSUPP.
But let's just implement it; though rarely seen, it can be used by
Solaris (with a special mount option), is mandated by the rfc, and is
trivial for us to support.
Thanks to Yang Hongyang for pointing out the original problem, and to
Mike Eisler, Tom Talpey, Trond Myklebust, and Dave Noveck for further
argument....
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
If a filesystem being written to via NFS returns a short write count
(as opposed to an error) to nfsd, nfsd treats that as a success for
the entire write, rather than the short count that actually succeeded.
For example, given a 8192 byte write, if the underlying filesystem
only writes 4096 bytes, nfsd will ack back to the nfs client that all
8192 bytes were written. The nfs client does have retry logic for
short writes, but this is never called as the client is told the
complete write succeeded.
There are probably other ways it could happen, but in my case it
happened with a fuse (filesystem in userspace) filesystem which can
rather easily have a partial write.
Here is a patch to properly return the short write count to the
client.
Signed-off-by: David Shaw <dshaw@jabberwocky.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Thanks for Bill Baker at sun.com for catching this
at Connectathon 2009.
This bug was introduced in 2.6.27
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The errors returned aren't used. Just return 0 and make them available
to a dprintk(). Also, consistently use -ERRNO errors instead of nfs
errors.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Reviewed-by: Benny Halevy <bhalevy@panasas.com>
As part of reducing the scope of the client_mutex, and in order to
remove the need for mutexes from the callback code (so that callbacks
can be done as asynchronous rpc calls), move manipulations of the
file_hashtable under the recall_lock.
Update the relevant comments while we're here.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Alexandros Batsakis <batsakis@netapp.com>
Reviewed-by: Benny Halevy <bhalevy@panasas.com>
Since free_client() is guaranteed to only be called once, and to only
touch the client structure itself (not any common data structures), it
has no need for the state lock.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Alexandros Batsakis <batsakis@netapp.com>
Previous cleanup reveals an obvious (though harmless) bug: when
delegreturn gets a stateid that isn't for a delegation, it should return
an error rather than doing nothing.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Delegreturn is enough a special case for preprocess_stateid_op to
warrant just open-coding it in delegreturn.
There should be no change in behavior here; we're just reshuffling code.
Thanks to Yang Hongyang for catching a critical typo.
Reviewed-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
I can't recall ever seeing these printk's used to debug a problem. I'll
happily put them back if we see a case where they'd be useful. (Though
if we do that the find_XXX() errors would probably be better
reported in find_XXX() functions themselves.)
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Note that we exit this first big "if" with stp == NULL if and only if we
took the first branch; therefore, the second "if" is redundant, and we
can just combine the two, simplifying the logic.
Reviewed-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The base versions handle constant folding now, none of these headers
are exported to userspace, so the __ prefixed versions are not
necessary.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
sometimes HPUX nfs client sends a create request to linux nfs server(v2/v3).
the dump of the request is like:
obj_attributes
mode: value follows
set_it: value follows (1)
mode: 00
uid: no value
set_it: no value (0)
gid: value follows
set_it: value follows (1)
gid: 8030
size: value follows
set_it: value follows (1)
size: 0
atime: don't change
set_it: don't change (0)
mtime: don't change
set_it: don't change (0)
note that mode is 00(havs no rwx privilege even for the owner) and it requires
to set size to 0.
as current nfsd(v2/v3) implementation, the server does mainly 2 steps:
1) creates the file in mode specified by calling vfs_create().
2) sets attributes for the file by calling nfsd_setattr().
at step 2), it finally calls file system specific setattr() function which may
fail when checking permission because changing size needs WRITE privilege but
it has none since mode is 000.
for this case, a new file created, we may simply ignore the request of
setting size to 0, so that WRITE privilege is not needed and the open
succeeds.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
--
vfs.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
No change in behavior, just rearranging the switch so that we break out
of the switch if and only if we're in the wait case.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
not having the state locked before putting the client/delegation causes a bug.
Also removed the comment from the function header about the state being already locked
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The use of |= is confusing--the bitmask is always initialized to zero in
this case, so we're effectively just doing an assignment here.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Enable NFSD only when FILE_LOCKING is enabled, since we don't want to
support NFSD without FILE_LOCKING.
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
MSDOS_SUPER_MAGIC is defined in <linux/magic.h>,
so use MSDOS_SUPER_MAGIC directly.
Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The caller always knows specifically whether it's releasing a lockowner
or an openowner, and the code is simpler if we use separate functions
(and the apparent recursion is gone).
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The flags here attempt to make the code more general, but I find it
actually just adds confusion.
I think it's clearer to separate the logic for the open and lock cases
entirely. And eventually we may want to separate the stateowner and
stateid types as well, as many of the fields aren't shared between the
lock and open cases.
Also move to eliminate forward references.
Start with the stateid's.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Reviewed-by: Benny Halevy <bhalevy@panasas.com>
The benet driver is now in the proper place in drivers/net/benet, so we
can remove the staging version.
Acked-by: Sathya Perla <sathyap@serverengines.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>