Commit graph

17033 commits

Author SHA1 Message Date
Christoph Hellwig
8c4e4acd66 quota: clean up Q_XQUOTASYNC
Currently Q_XQUOTASYNC calls into the quota_sync method, but XFS does something
entirely different in it than the rest of the filesystems.  xfs_quota which
calls Q_XQUOTASYNC expects an asynchronous data writeout to flush delayed
allocations, while the "VFS" quota support wants to flush changes to the quota
file.

So make Q_XQUOTASYNC call into the writeback code directly and make the
quota_sync method optional as XFS doesn't need in the sense expected by the
rest of the quota code.

GFS2 was using limited XFS-style quota and has a quota_sync method fitting
neither the style used by vfs_quota_sync nor xfs_fs_quota_sync.  I left it
in for now as per discussion with Steve it expects to be called from the
sync path this way.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:24 +01:00
Christoph Hellwig
c988afb5fa quota: simplify permission checking
Stop having complicated different routines for checking permissions for
XQM vs "VFS" quotas.  Instead do the checks for having sb->s_qcop and
a valid type directly in do_quotactl, and munge the *quotactl_valid functions
into a check_quotactl_permission helper that only checks for permissions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:22 +01:00
Christoph Hellwig
6ae09575b3 quota: special case Q_SYNC without device name
The Q_SYNC command can be called without the path to a device, in which case
it iterates over all superblocks.  Special case this variant directly in
sys_quotactl so that the other code always gets a superblock and doesn't
need to deal with this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:22 +01:00
Christoph Hellwig
f450d4fee4 quota: clean up checks for supported quota methods
Move the checks for sb->s_qcop->foo next to the actual calls for them, same
for sb_has_quota_active checks where applicable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:21 +01:00
Christoph Hellwig
c411e5f66a quota: split do_quotactl
Split out a helper for each non-trivial command from do_quotactl.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:21 +01:00
Jan Kara
0a5a9c7255 quota: Fix warning when a delayed write happens before quota is enabled
If a delayed-allocation write happens before quota is enabled, the
kernel spits out a warning:
WARNING: at fs/quota/dquot.c:988 dquot_claim_space+0x77/0x112()

because the fact that user has some delayed allocation is not recorded
in quota structure.

Make dquot_initialize() update amount of reserved space for user if it sees
inode has some space reserved. Also make sure that reserved quota space does
not go negative and we warn about the filesystem bug just once.

Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:21 +01:00
Dmitry Monakhov
c469070aea quota: manage reserved space when quota is not active [v2]
Since we implemented generic reserved space management interface,
then it is possible to account reserved space even when quota
is not active (similar to i_blocks/i_bytes).

Without this patch following testcase result in massive comlain from
WARN_ON in dquot_claim_space()

TEST_CASE:
mount /dev/sdb /mnt -oquota
dd if=/dev/zero of=/mnt/test bs=1M count=1
quotaon /mnt
# fs_reserved_spave == 1Mb
# quota_reserved_space == 0, because quota was disabled
dd if=/dev/zero of=/mnt/test seek=1 bs=1M count=1
# fs_reserved_spave == 2Mb
# quota_reserved_space == 1Mb
sync  # ->dquot_claim_space() -> WARN_ON

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:21 +01:00
Dmitry Monakhov
e1f5c67a19 ext3: trivial quota cleanup
The patch is aimed to reorganize and simplify quota code a bit.
Quota code is itself complex enouth, but we can make it more readable
in some places:
- Move quota option parsing to separate functions.
- Simplify old-quota and journaled-quota mix check.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:20 +01:00
Dmitry Monakhov
e3c9643597 ext3: mount flags manipulation cleanup
Replace intermediate EXT3_MOUNT_XXX flags manipulation to
corresponding macro.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:20 +01:00
Jan Kara
9df93939b7 ext3: Use bitops to read/modify EXT3_I(inode)->i_state
At several places we modify EXT3_I(inode)->i_state without holding i_mutex
(ext3_release_file, ext3_bmap, ext3_journalled_writepage, ext3_do_update_inode,
...). These modifications are racy and we can lose updates to i_state. So
convert handling of i_state to use bitops which are atomic.

Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:20 +01:00
Jan Kara
26245c949c quota: Cleanup S_NOQUOTA handling
Cleanup handling of S_NOQUOTA inode flag and document it a bit. The flag
does not have to be set under dqptr_sem. Only functions modifying inode's
dquot pointers have to check the flag under dqptr_sem before going forward
with the modification. This way we are sure that we cannot add new dquot
pointers to the inode which is just becoming a quota file.

The good thing about this cleanup is that there are no more places in quota
code which enforce i_mutex vs. dqptr_sem lock ordering (in particular that
dqptr_sem -> i_mutex of quota file). This should silence some (false) lockdep
warnings with ext4 + quota and generally make life of some filesystems easier.

Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:19 +01:00
Joern Engel
c6d3830140 [LogFS] Only write journal if dirty
This prevents unnecessary journal writes.  More importantly it prevents
an oops due to a journal write on failed mount.
2010-03-04 21:36:19 +01:00
Joern Engel
9421502b4f [LogFS] Fix bdev erases
Erases for block devices were always just emulated by writing 0xff.
Some time back the write was removed and only the page cache was
changed to 0xff.  Superficialy a good idea with two problems:
1. Touching the page cache isn't necessary either.
2. However, writing out 0xff _is_ necessary for the journal.  As the
   journal is scanned linearly, an old non-overwritten commit entry
   can be used on next mount and cause havoc.

This should fix both aspects.
2010-03-04 21:30:58 +01:00
J. Bruce Fields
4ea41e2de5 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs into for-2.6.34-incoming
Resolve merge conflict in fs/xfs/linux-2.6/xfs_export.c.
2010-03-04 12:04:51 -05:00
Linus Torvalds
64ba992675 Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  exofs: groups support
  exofs: Prepare for groups
  exofs: Error recovery if object is missing from storage
  exofs: convert io_state to use pages array instead of bio at input
  exofs: RAID0 support
  exofs: Define on-disk per-inode optional layout attribute
  exofs: unindent exofs_sbi_read
  exofs: Move layout related members to a layout structure
  exofs: Recover in the case of read-passed-end-of-file
  exofs: Micro-optimize exofs_i_info
  exofs: debug print even less
2010-03-04 08:26:08 -08:00
Linus Torvalds
0f2cc4ecd8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
  init: Open /dev/console from rootfs
  mqueue: fix typo "failues" -> "failures"
  mqueue: only set error codes if they are really necessary
  mqueue: simplify do_open() error handling
  mqueue: apply mathematics distributivity on mq_bytes calculation
  mqueue: remove unneeded info->messages initialization
  mqueue: fix mq_open() file descriptor leak on user-space processes
  fix race in d_splice_alias()
  set S_DEAD on unlink() and non-directory rename() victims
  vfs: add NOFOLLOW flag to umount(2)
  get rid of ->mnt_parent in tomoyo/realpath
  hppfs can use existing proc_mnt, no need for do_kern_mount() in there
  Mirror MS_KERNMOUNT in ->mnt_flags
  get rid of useless vfsmount_lock use in put_mnt_ns()
  Take vfsmount_lock to fs/internal.h
  get rid of insanity with namespace roots in tomoyo
  take check for new events in namespace (guts of mounts_poll()) to namespace.c
  Don't mess with generic_permission() under ->d_lock in hpfs
  sanitize const/signedness for udf
  nilfs: sanitize const/signedness in dealing with ->d_name.name
  ...

Fix up fairly trivial (famous last words...) conflicts in
drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c
2010-03-04 08:15:33 -08:00
Akira Fujita
c437b27335 ext4: Code cleanup for EXT4_IOC_MOVE_EXT ioctl
a) Fix sparse warning in ext4_ioctl()
b) Remove unneeded variable in mext_leaf_block()
c) Fix spelling typo in mext_check_arguments()

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-04 00:39:24 -05:00
Akira Fujita
7247c0caa2 ext4: Fix the NULL reference in double_down_write_data_sem()
If EXT4_IOC_MOVE_EXT ioctl is called with NULL donor_fd, fget() in
ext4_ioctl() gets inappropriate file structure for donor; so we need
to do this check earlier, before calling double_down_write_data_sem().

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-04 00:34:58 -05:00
Akira Fujita
5fd5249aa3 ext4: Fix insertion point of extent in mext_insert_across_blocks()
If the leaf node has 2 extent space or fewer and EXT4_IOC_MOVE_EXT
ioctl is called with the file offset where after the 2nd extent
covers, mext_insert_across_blocks() always tries to insert extent into
the first extent.  As a result, the file gets corrupted because of
wrong extent order.  The patch fixes this problem.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-04 00:31:06 -05:00
Akinobu Mita
731eb1a03a ext4: consolidate in_range() definitions
There are duplicate macro definitions of in_range() in mballoc.h and
balloc.c.  This consolidates these two definitions into ext4.h, and
changes extents.c to use in_range() as well.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@sun.com>
2010-03-03 23:55:01 -05:00
Akinobu Mita
bda00de7e8 ext4: cleanup to use ext4_grp_offs_to_block()
More cleanup to convert open-coded calculations of the first block
number of a free extent to use ext4_grp_offs_to_block() instead.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@sun.com>
2010-03-03 23:53:25 -05:00
Akinobu Mita
5661bd6861 ext4: cleanup to use ext4_group_first_block_no()
This is a cleanup and simplification patch which takes some open-coded
calculations to calculate the first block number of a group and
converts them to use the (already defined) ext4_group_first_block_no()
function.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@sun.com>
2010-03-03 23:53:39 -05:00
Al Viro
9643f5d94a Merge branch 'for-fsnotify' into for-linus 2010-03-03 17:12:40 -05:00
Jan Kara
9b1d0998d2 ext4: Release page references acquired in ext4_da_block_invalidatepages
We forget to release page references we acquire in
ext4_da_block_invalidatepages.  Luckily, this function gets called only if we
are not able to allocate blocks for delay-allocated data so that function
should better never be called.

Also cleanup handling of index variable.

Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-03 16:19:32 -05:00
J. Bruce Fields
8d75da8afd nfsd4: fix minor memory leak
There's no need to allocate this cred more than once.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-03-03 16:13:29 -05:00
Al Viro
4919c5e45a fix race in d_splice_alias()
rehashing the negative placeholder opens a race with d_lookup();
we unhash it almost immediately (by d_move()), but the race
window is there.  Since d_move() doesn't rely on target being
hashed, we don't need that d_rehash() at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:13:08 -05:00
Al Viro
bec1052e5b set S_DEAD on unlink() and non-directory rename() victims
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:12:08 -05:00
Miklos Szeredi
db1f05bb85 vfs: add NOFOLLOW flag to umount(2)
Add a new UMOUNT_NOFOLLOW flag to umount(2).  This is needed to prevent
symlink attacks in unprivileged unmounts (fuse, samba, ncpfs).

Additionally, return -EINVAL if an unknown flag is used (and specify
an explicitly unused flag: UMOUNT_UNUSED).  This makes it possible for
the caller to determine if a flag is supported or not.

CC: Eugene Teo <eugene@redhat.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:08:00 -05:00
Al Viro
0ceeca5a08 hppfs can use existing proc_mnt, no need for do_kern_mount() in there
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:08:00 -05:00
Al Viro
8089352a13 Mirror MS_KERNMOUNT in ->mnt_flags
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:08:00 -05:00
Al Viro
d498b25a4f get rid of useless vfsmount_lock use in put_mnt_ns()
It hadn't been needed since we'd sanitized the logics in
mark_mounts_for_expiry() (which, in turn, used to be a
rudiment of bad old times when namespace_sem was per-ns).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:59 -05:00
Al Viro
47cd813f29 Take vfsmount_lock to fs/internal.h
no more users left outside of fs/*.c (and very few outside of
fs/namespace.c, actually)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:59 -05:00
Al Viro
9f5596af44 take check for new events in namespace (guts of mounts_poll()) to namespace.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:59 -05:00
Al Viro
e21e7095a7 Don't mess with generic_permission() under ->d_lock in hpfs
Just use dentry_unhash() there

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:58 -05:00
Al Viro
391e8bbd38 sanitize const/signedness for udf
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:58 -05:00
Al Viro
072f98b463 nilfs: sanitize const/signedness in dealing with ->d_name.name
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:58 -05:00
Al Viro
0319003d0d nilfs really shouldn't slap struct dentry on stack...
... especially when it only needs (and initializes) .d_name of it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:58 -05:00
Al Viro
89031bc797 sanitize const/signedness of ufs a bit
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:57 -05:00
Al Viro
7e7742ee00 sanitize signedness/const for pointers to char in hpfs a bit
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:57 -05:00
Al Viro
1f707137b5 new helper: iterate_mounts()
apply function to vfsmounts in set returned by collect_mounts(),
stop if it returns non-zero.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:57 -05:00
Al Viro
462d60577a fix NFS4 handling of mountpoint stat
RFC says we need to follow the chain of mounts if there's more
than one stacked on that point.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:57 -05:00
Al Viro
3088dd7080 Clean follow_dotdot() up a bit
No need to open-code follow_up() in it and locking can be lighter.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:56 -05:00
Al Viro
f694869709 a couple of mntget+dget -> path_get in nfs4proc
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:56 -05:00
Al Viro
6eae7974d0 Switch alloc_nfs_open_context() to struct path
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:56 -05:00
Al Viro
2096f759ab New helper: path_is_under(path1, path2)
Analog of is_subdir for vfsmount,dentry pairs, moved from audit_tree.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:55 -05:00
Valerie Aurora
495d6c9c65 VFS: Clean up shared mount flag propagation
The handling of mount flags in set_mnt_shared() got a little tangled
up during previous cleanups, with the following problems:

* MNT_PNODE_MASK is defined as a literal constant when it should be a
bitwise xor of other MNT_* flags
* set_mnt_shared() clears and then sets MNT_SHARED (part of MNT_PNODE_MASK)
* MNT_PNODE_MASK could use a comment in mount.h
* MNT_PNODE_MASK is a terrible name, change to MNT_SHARED_MASK

This patch fixes these problems.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:55 -05:00
Al Viro
5b7e934d88 Use kill_litter_super() in autofs4 ->kill_sb()
... and get rid of open-coding its guts (i.e. RIP autofs4_force_release())

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:54 -05:00
Al Viro
3899167dbd Get rid of mnt_mountpoint abuses in ext4
path to mnt/mnt->mnt_root is no worse than that to
mnt->mnt_parent/mnt->mnt_mountpoint *and* needs no
pinning the sucker down (mnt is not going away and
mnt->mnt_root won't change)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:54 -05:00
Al Viro
f598f9f125 Sanitize autofs_dev_ioctl_ismountpoint()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:53 -05:00
Al Viro
796a6b521d Kill CL_PROPAGATION, sanitize fs/pnode.c:get_source()
First of all, get_source() never results in CL_PROPAGATION
alone.  We either get CL_MAKE_SHARED (for the continuation
of peer group) or CL_SLAVE (slave that is not shared) or both
(beginning of peer group among slaves).  Massage the code to
make that explicit, kill CL_PROPAGATION test in clone_mnt()
(nothing sets CL_MAKE_SHARED without CL_PROPAGATION and in
clone_mnt() we are checking CL_PROPAGATION after we'd found
that there's no CL_SLAVE, so the check for CL_MAKE_SHARED
would do just as well).

Fix comments, while we are at it...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 13:00:22 -05:00
Al Viro
c177c2ac8c Switch gfs2 to nd_set_link()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 13:00:22 -05:00
Al Viro
8737c9305b Switch may_open() and break_lease() to passing O_...
... instead of mixing FMODE_ and O_

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 13:00:21 -05:00
Nick Piggin
d208bbdda9 fs: improve remount,ro vs buffercache coherency
Invalidate sb->s_bdev on remount,ro.

Fixes a problem reported by Jorge Boncompte who is seeing corruption
trying to snapshot a minix filesystem image.  Some filesystems modify
their metadata via a path other than the bdev buffer cache (eg.  they may
use a private linear mapping for their metadata, or implement directories
in pagecache, etc).  Also, file data modifications usually go to the bdev
via their own mappings.

These updates are not coherent with buffercache IO (eg.  via /dev/bdev)
and never have been.  However there could be a reasonable expectation that
after a mount -oremount,ro operation then the buffercache should
subsequently be coherent with previous filesystem modifications.

So invalidate the bdev mappings on a remount,ro operation to provide a
coherency point.

The problem was exposed when we switched the old rd to brd because old rd
didn't really function like a normal block device and updates to rd via
mappings other than the buffercache would still end up going into its
buffercache.  But the same problem has always affected other "normal"
block devices, including loop.

[akpm@linux-foundation.org: repair comment layout]
Reported-by: "Jorge Boncompte [DTI2]" <jorge@dti2.net>
Tested-by: "Jorge Boncompte [DTI2]" <jorge@dti2.net>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 13:00:20 -05:00
H Hartley Sweeten
ec4f860597 fs/dcache.c: CodingStyle cleanup
Cleanup EXPORT* macros according to Documantation/CodingStyle.

Move EXPORT* macros to the line immediately after the closing
function brace.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 13:00:19 -05:00
Helight.Xu
587d4a17d8 some clean up in fs/proc
EXPORT_SYMBOL(proc_symlink);
EXPORT_SYMBOL(proc_mkdir);
EXPORT_SYMBOL(create_proc_entry);
EXPORT_SYMBOL(proc_create_data);
EXPORT_SYMBOL(remove_proc_entry);

Those EXPORT_SYMBOL shouldn't be in fs/proc/root.c,
should be in fs/proc/generic.c.

Signed-off-by: Helight.Xu <helight.xu@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 13:00:18 -05:00
Boaz Harrosh
193cf4b991 libfs: Unexport and kill simple_prepare_write
Remove the EXPORT_UNUSED_SYMBOL of simple_prepare_write

Collapse simple_prepare_write into it's only caller, though
making it simpler and clearer to understand.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 13:00:17 -05:00
Boaz Harrosh
ad2a722f19 libfs: Open code simple_commit_write into only user
* simple_commit_write was only called by simple_write_end.
  Open coding it makes it tiny bit less heavy on the arithmetic and
  much more readable.

* While at it use zero_user() for clearing a partial page.
* While at it add a docbook comment for simple_write_end.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 13:00:16 -05:00
Al Viro
4b1ae27a96 Revert "autofs4: always use lookup for lookup"
This reverts commit 213614d583.

Alas, ->d_revalidate() can't rely on ->lookup() finishing what
it's started; if d_alloc() in do_lookup() fails, we are not going
to call ->lookup() at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 12:58:31 -05:00
Linus Torvalds
feaf77d51a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: add reader's lock for cno in nilfs_ioctl_sync
  nilfs2: delete unnecessary condition in load_segment_summary
  nilfs2: move iterator to write log into segment buffer
  nilfs2: get rid of s_dirt flag use
  nilfs2: get rid of nilfs_segctor_req struct
  nilfs2: delete unnecessary condition in nilfs_dat_translate
  nilfs2: fix potential hang in nilfs_error on errors=remount-ro
  nilfs2: use mnt_want_write in ioctls where write access is needed
  nilfs2: issue discard request after cleaning segments
2010-03-03 08:53:49 -08:00
Linus Torvalds
eca281aad0 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (36 commits)
  Ocfs2: Move ocfs2 ioctl definitions from ocfs2_fs.h to newly added ocfs2_ioctl.h
  ocfs2: send SIGXFSZ if new filesize exceeds limit -v2
  ocfs2/userdlm: Add tracing in userdlm
  ocfs2: Use a separate masklog for AST and BASTs
  dlm: allow dlm do recovery during shutdown
  ocfs2: Only bug out in direct io write for reflinked extent.
  ocfs2: fix warning in ocfs2_file_aio_write()
  ocfs2_dlmfs: Enable the use of user cluster stacks.
  ocfs2_dlmfs: Use the stackglue.
  ocfs2_dlmfs: Don't honor truncate.  The size of a dlmfs file is LVB_LEN
  ocfs2: Pass the locking protocol into ocfs2_cluster_connect().
  ocfs2: Remove the ast pointers from ocfs2_stack_plugins
  ocfs2: Hang the locking proto on the cluster conn and use it in asts.
  ocfs2: Attach the connection to the lksb
  ocfs2: Pass lksbs back from stackglue ast/bast functions.
  ocfs2_dlmfs: Move to its own directory
  ocfs2_dlmfs: Use poll() to signify BASTs.
  ocfs2_dlmfs: Add capabilities parameter.
  ocfs2: Handle errors while setting external xattr values.
  ocfs2: Set inline xattr entries with ocfs2_xa_set()
  ...
2010-03-03 08:53:17 -08:00
Linus Torvalds
60f8a8d4c6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: fix large stack use
  fuse: cleanup in fuse_notify_inval_...()
2010-03-03 08:08:21 -08:00
Linus Torvalds
0a135ba14d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu: add __percpu sparse annotations to what's left
  percpu: add __percpu sparse annotations to fs
  percpu: add __percpu sparse annotations to core kernel subsystems
  local_t: Remove leftover local.h
  this_cpu: Remove pageset_notifier
  this_cpu: Page allocator conversion
  percpu, x86: Generic inc / dec percpu instructions
  local_t: Move local.h include to ringbuffer.c and ring_buffer_benchmark.c
  module: Use this_cpu_xx to dynamically allocate counters
  local_t: Remove cpu_local_xx macros
  percpu: refactor the code in pcpu_[de]populate_chunk()
  percpu: remove compile warnings caused by __verify_pcpu_ptr()
  percpu: make accessors check for percpu pointer in sparse
  percpu: add __percpu for sparse.
  percpu: make access macros universal
  percpu: remove per_cpu__ prefix.
2010-03-03 07:34:18 -08:00
Linus Torvalds
4850f524b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw:
  GFS2: print glock numbers in hex
  GFS2: ordered writes are backwards
  GFS2: Remove old, unused linked list code from quota
  GFS2: Remove loopy umount code
  GFS2: Metadata address space clean up
2010-03-03 07:33:50 -08:00
Linus Torvalds
4846546f7e Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  [CIFS] pSesInfo->sesSem is used as mutex. Rename it to session_mutex and
  [CIFS] Use unsigned ea length for clarity
  cifs: set server_eof in cifs_fattr_to_inode
  [CIFS] Minor cleanup to EA patch
  cifs: merge CIFSSMBQueryEA with CIFSSMBQAllEAs
  cifs: verify lengths of QueryAllEAs reply
  cifs: increase maximum buffer size in CIFSSMBQAllEAs
  cifs: rename name_len to list_len in CIFSSMBQAllEAs
  cifs: clean up indentation in CIFSSMBQAllEAs
  cifs: add parens around smb_var in BCC macros
2010-03-03 07:32:10 -08:00
Linus Torvalds
832d30ca72 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (38 commits)
  SELinux: Make selinux_kernel_create_files_as() shouldn't just always return 0
  TOMOYO: Protect find_task_by_vpid() with RCU.
  Security: add static to security_ops and default_security_ops variable
  selinux: libsepol: remove dead code in check_avtab_hierarchy_callback()
  TOMOYO: Remove __func__ from tomoyo_is_correct_path/domain
  security: fix a couple of sparse warnings
  TOMOYO: Remove unneeded parameter.
  TOMOYO: Use shorter names.
  TOMOYO: Use enum for index numbers.
  TOMOYO: Add garbage collector.
  TOMOYO: Add refcounter on domain structure.
  TOMOYO: Merge headers.
  TOMOYO: Add refcounter on string data.
  TOMOYO: Reduce lines by using common path for addition and deletion.
  selinux: fix memory leak in sel_make_bools
  TOMOYO: Extract bitfield
  syslog: clean up needless comment
  syslog: use defined constants instead of raw numbers
  syslog: distinguish between /proc/kmsg and syscalls
  selinux: allow MLS->non-MLS and vice versa upon policy reload
  ...
2010-03-02 14:47:24 -08:00
Tristan Ye
9df5778ece Ocfs2: Move ocfs2 ioctl definitions from ocfs2_fs.h to newly added ocfs2_ioctl.h
Currently we were adding ioctl cmds/structures for ocfs2 into ocfs2_fs.h
which was used for define ocfs2 on-disk layout. That sounds a little bit
confusing, and it may be quickly polluted espcially when growing the
ocfs2_info_request ioctls afterwards(it will grow i bet).

As a result, such OCFS2 IOCs do need to be placed somewhere other than
ocfs2_fs.h, a separated ocfs2_ioctl.h will be added to store such ioctl
structures and definitions which could also be used from userspace to
invoke ioctls call.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-03-02 14:10:20 -08:00
Andy Adamson
180b62a3d8 nfs41 fix NFS4ERR_CLID_INUSE for exchange id
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-02 13:45:33 -05:00
Linus Torvalds
6c0ad5dfd3 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  Revert "blkdev: fix merge_bvec_fn return value checks"
2010-03-02 10:33:36 -08:00
Jens Axboe
9599945bac Revert "blkdev: fix merge_bvec_fn return value checks"
This reverts commit 9f7cdbc33f.

It's causing oopses om dm setups, so revert it until we investigate.

Reported-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-03-02 19:17:34 +01:00
Trond Myklebust
ebed9203b6 NFS: Fix an allocation-under-spinlock bug
sunrpc_cache_update() will always call detail->update() from inside the
detail->hash_lock, so it cannot allocate memory.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
2010-03-02 13:06:22 -05:00
Trond Myklebust
0f79fd6f5c NFSv4.1: Various fixes to the sequence flag error handling
Ensure that we change the EXCHANGE_ID verifier (i.e. clp->cl_boot_time)
when we want to reset all state. This is mainly needed when the server
tells us that it is revoking our open or lock stateids.

Handle revoking of recallable state by expiring the delegations.

Handle callback path issues by expiring the delegations and then resetting
the session.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-02 13:06:21 -05:00
Alexandros Batsakis
0851de0617 nfs4: renewd renew operations should take/put a client reference
renewd sends RENEW requests to the NFS server in order to renew state.
As the request is asynchronous, renewd should take a reference to the
nfs_client to prevent concurrent umounts from freeing the client

Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-02 13:00:03 -05:00
Alexandros Batsakis
7135840fc7 nfs41: renewd sequence operations should take/put client reference
renewd sends SEQUENCE requests to the NFS server in order to renew state.
As the request is asynchronous, renewd should take a reference to the
nfs_client to prevent concurrent umounts from freeing the session/client

Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-02 12:54:30 -05:00
Alexandros Batsakis
dc96aef96a nfs: prevent backlogging of renewd requests
If the renewd send queue gets backlogged (e.g., if the server goes down),
we will keep filling the queue with periodic RENEW/SEQUENCE requests.

This patch schedules a new renewd request if and only if the previous one
returns (either success or failure)

Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
[Trond.Myklebust@netapp.com: moved nfs4_schedule_state_renewal() into
separate nfs4_renew_release() and nfs41_sequence_release() callbacks
to ensure correct behaviour on call setup failure]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-02 12:44:07 -05:00
Alexandros Batsakis
888ef2e3f8 nfs: kill renewd before clearing client minor version
renewd should be synchronously killed before we destroy the session in
nfs4_clear_minor_version

Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
[Trond.Myklebust@netapp.com: clean up to remove 'unused function
warning when !CONFIG_NFS_V4]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-02 12:16:12 -05:00
Linus Torvalds
6d6b89bd2e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1341 commits)
  virtio_net: remove forgotten assignment
  be2net: fix tx completion polling
  sis190: fix cable detect via link status poll
  net: fix protocol sk_buff field
  bridge: Fix build error when IGMP_SNOOPING is not enabled
  bnx2x: Tx barriers and locks
  scm: Only support SCM_RIGHTS on unix domain sockets.
  vhost-net: restart tx poll on sk_sndbuf full
  vhost: fix get_user_pages_fast error handling
  vhost: initialize log eventfd context pointer
  vhost: logging thinko fix
  wireless: convert to use netdev_for_each_mc_addr
  ethtool: do not set some flags, if others failed
  ipoib: returned back addrlen check for mc addresses
  netlink: Adding inode field to /proc/net/netlink
  axnet_cs: add new id
  bridge: Make IGMP snooping depend upon BRIDGE.
  bridge: Add multicast count/interval sysfs entries
  bridge: Add hash elasticity/max sysfs entries
  bridge: Add multicast_snooping sysfs toggle
  ...

Trivial conflicts in Documentation/feature-removal-schedule.txt
2010-03-02 07:55:08 -08:00
Dmitry Monakhov
67eeb5685d ext4: Fix ext4_quota_write cross block boundary behaviour
We always assume what dquot update result in changes in one data block
But ext4_quota_write() function may handle cross block boundary writes
In fact if this ever happen it will result in incorrect journal
credits reservation, and later a BUG_ON.  As soon this never happen
the boundary cross loop is NOOP.  In order to make things straight
let's remove this loop and assert cross boundary condition.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 08:08:51 -05:00
Frank Mayhar
273df556b6 ext4: Convert BUG_ON checks to use ext4_error() instead
Convert a bunch of BUG_ONs to emit a ext4_error() message and return
EIO.  This is a first pass and most notably does _not_ cover
mballoc.c, which is a morass of void functions.

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 11:46:09 -05:00
Jiaying Zhang
b7adc1f363 ext4: Use direct_IO_no_locking in ext4 dio read
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 13:26:36 -05:00
Jiaying Zhang
744692dc05 ext4: use ext4_get_block_write in buffer write
Allocate uninitialized extent before ext4 buffer write and
convert the extent to initialized after io completes.
The purpose is to make sure an extent can only be marked
initialized after it has been written with new data so
we can safely drop the i_mutex lock in ext4 DIO read without
exposing stale data. This helps to improve multi-thread DIO
read performance on high-speed disks.

Skip the nobh and data=journal mount cases to make things simple for now.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-04 16:14:02 -05:00
Jiaying Zhang
c7064ef13b ext4: mechanical rename some of the direct I/O get_block's identifiers
This commit renames some of the direct I/O's block allocation flags,
variables, and functions introduced in Mingming's "Direct IO for holes
and fallocate" patches so that they can be used by ext4's buffered
write path as well.  Also changed the related function comments
accordingly to cover both direct write and buffered write cases.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 13:28:44 -05:00
Toshiyuki Okajima
b8b8afe236 ext4: make "offset" consistent in ext4_check_dir_entry()
The callers of ext4_check_dir_entry() usually pass in the "file
offset" (ext4_readdir, htree_dirblock_to_tree, search_dirblock,
ext4_dx_find_entry, empty_dir), but a few callers (add_dirent_to_buf,
ext4_delete_entry) only pass in the buffer offset.

To accomodate those last two (which would be hard to fix otherwise),
this patch changes ext4_check_dir_entry() to print the physical block
number and the relative offset as well as the passed-in offset.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 00:21:35 -05:00
Dmitry Monakhov
6e3617e579 ext4: Handle non empty on-disk orphan link
In case of truncate errors we explicitly remove inode from in-core
orphan list via orphan_del(NULL, inode) without modifying the on-disk list.

But later on, the same inode may be inserted in the orphan list again
which will result the on-disk linked list getting corrupted.  If inode
i_dtime contains valid value, then skip on-disk list modification.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 23:29:39 -05:00
Dmitry Monakhov
da1dafca84 ext4: explicitly remove inode from orphan list after failed direct io
Otherwise non-empty orphan list will be triggered on umount.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 23:15:02 -05:00
Dmitry Monakhov
f39490bcd1 ext4: fix error handling in migrate
Set i_nlink to zero for temporary inode from very beginning.
otherwise we may fail to start new journal handle and this
inode will be unreferenced but with i_nlink == 1
Since we hold inode reference it can not be pruned.

Also add missed journal_start retval check.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 23:14:36 -05:00
Dmitry Monakhov
437ca0fda3 ext4: deprecate obsoleted mount options
Declare following list of mount options as deprecated:
 - bsddf, miniddf
 - grpid, bsdgroups, nogrpid, sysvgroups

Declare following list of default mount options as deprecated:
 - bsdgroups

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 22:29:21 -05:00
Christoph Hellwig
f1f724e4b5 xfs: fix locking for inode cache radix tree tag updates
The radix-tree code requires it's users to serialize tag updates
against other updates to the tree.  While XFS protects tag updates
against each other it does not serialize them against updates of the
tree contents, which can lead to tag corruption.  Fix the inode
cache to always take pag_ici_lock in exclusive mode when updating
radix tree tags.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Patrick Schreurs <patrick@news-service.com>
Tested-by: Patrick Schreurs <patrick@news-service.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 19:14:36 -06:00
Tao Ma
cc483f102c ext4: Fix fencepost error in chosing choosing group vs file preallocation.
The ext4 multiblock allocator decides whether to use group or file
preallocation based on the file size.  When the file size reaches
s_mb_stream_request (default is 16 blocks), it changes to use a
file-specific preallocation. This is cool, but it has a tiny problem.

See a simple script:
mkfs.ext4 -b 1024 /dev/sda8 1000000
mount -t ext4 -o nodelalloc /dev/sda8 /mnt/ext4
for((i=0;i<5;i++))
do
cat /mnt/4096>>/mnt/ext4/a	#4096 is a file with 4096 characters.
cat /mnt/4096>>/mnt/ext4/b
done
debuge4fs -R 'stat a' /dev/sda8|grep BLOCKS -A 1

And you get
BLOCKS:
(0-14):8705-8719, (15):2356, (16-19):8465-8468

So there are 3 extents, a bit strange for the lonely 15th logical
block.  As we write to the 16 blocks, we choose file preallocation in
ext4_mb_group_or_file, but in ext4_mb_normalize_request, we meet with
the 16*1024 range, so no preallocation will be carried. file b then
reserves the space after '2356', so when when write 16, we start from
another part.

This patch just change the check in ext4_mb_group_or_file, so
that for the lonely 15 we will still use group preallocation.
After the patch, we will get:
debuge4fs -R 'stat a' /dev/sda8|grep BLOCKS -A 1
BLOCKS:
(0-15):8705-8720, (16-19):8465-8468

Looks more sane. Thanks.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 19:06:35 -05:00
Christoph Hellwig
a14a5ab58f xfs: remove xfs_ipin/xfs_iunpin
Inodes are only pinned/unpinned via the inode item methods, and lots of
code relies on that fact.  So remove the separate xfs_ipin/xfs_iunpin
helpers and merge them into their only callers.  This also fixes up
various duplicate and/or incorrect comments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:35:56 -06:00
Christoph Hellwig
60ec678371 xfs: cleanup xfs_iunpin_wait/xfs_iunpin_nowait
Remove the inode item pointer and ili_last_lsn checks in
__xfs_iunpin_wait as any pinned inode is guaranteed to have them
valid.  After this the xfs_iunpin_nowait case is nothing more than a
xfs_log_force_lsn, as we know that the caller has already checked
the pincount.

Make xfs_iunpin_nowait the new low-level routine just doing the log
force and rewrite xfs_iunpin_wait around it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:35:50 -06:00
Christoph Hellwig
d7658d487f xfs: kill xfs_lrw.h
Move the two declarations to better fitting headers now that
xfs_lrw.c is gone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:35:44 -06:00
Christoph Hellwig
d7e84f4137 xfs: factor common xfs_trans_bjoin code
Most of xfs_trans_bjoin is duplicated in xfs_trans_get_buf,
xfs_trans_getsb and xfs_trans_read_buf.  Add a new _xfs_trans_bjoin
which can be called by all four functions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:35:37 -06:00
Christoph Hellwig
35a8a72f06 xfs: stop passing opaque handles to xfs_log.c routines
Currenly we pass opaque xfs_log_ticket_t handles instead of
struct xlog_ticket pointers, and void pointers instead of
struct xlog_in_core pointers to various log manager functions.
Instead pass properly typed pointers after adding forward
declarations for them to xfs_log.h, and adjust the touched
function prototypes to the standard XFS style while at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:35:32 -06:00
Christoph Hellwig
c467c049e7 xfs: split xfs_bmap_btalloc
Split out the nullfb case into a separate function to reduce the stack
footprint and make the code more readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:35:25 -06:00
Christoph Hellwig
f7008d0aeb xfs: fix xfs_fsblock_t tracing
Using a static buffer in xfs_fmtfsblock means we can corrupt traces if
multiple CPUs hit this code path at the same.  Just remove xfs_fmtfsblock
for now and print the block number purely numerical.  If we want the
NULLFSBLOCK and NULLSTARTBLOCK formatting back the best way would be
a decoding plugin in the trace-cmd userspace command.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:35:17 -06:00
Christoph Hellwig
024910cbac xfs: fix inode pincount check in fsync
We need to hold the ilock to check the inode pincount safely.  While
we're at it also remove the check for ip->i_itemp->ili_last_lsn, a
pinned inode always has it set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:35:10 -06:00
Dave Chinner
77d7a0c2ee xfs: Non-blocking inode locking in IO completion
The introduction of barriers to loop devices has created a new IO
order completion dependency that XFS does not handle. The loop
device implements barriers using fsync and so turns a log IO in the
XFS filesystem on the loop device into a data IO in the backing
filesystem. That is, the completion of log IOs in the loop
filesystem are now dependent on completion of data IO in the backing
filesystem.

This can cause deadlocks when a flush daemon issues a log force with
an inode locked because the IO completion of IO on the inode is
blocked by the inode lock. This in turn prevents further data IO
completion from occuring on all XFS filesystems on that CPU (due to
the shared nature of the completion queues). This then prevents the
log IO from completing because the log is waiting for data IO
completion as well.

The fix for this new completion order dependency issue is to make
the IO completion inode locking non-blocking. If the inode lock
can't be grabbed, simply requeue the IO completion back to the work
queue so that it can be processed later. This prevents the
completion queue from being blocked and allows data IO completion on
other inodes to proceed, hence avoiding completion order dependent
deadlocks.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:34:52 -06:00
Christoph Hellwig
66d834ea60 xfs: implement optimized fdatasync
Allow us to track the difference between timestamp and size updates
by using mark_inode_dirty from the I/O completion code, and checking
the VFS inode flags in xfs_file_fsync.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:34:45 -06:00
Christoph Hellwig
fd3200bef7 xfs: remove wrapper for the fsync file operation
Currently the fsync file operation is divided into a low-level
routine doing all the work and one that implements the Linux file
operation and does minimal argument wrapping.  This is a leftover
from the days of the vnode operations layer and can be removed to
simplify the code a bit, as well as preparing for the implementation
of an optimized fdatasync which needs to look at the Linux inode
state.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:34:38 -06:00
Christoph Hellwig
00258e36b2 xfs: remove wrappers for read/write file operations
Currently the aio_read, aio_write, splice_read and splice_write file
operations are divided into a low-level routine doing all the work
and one that implements the Linux file operations and does minimal
argument wrapping.  This is a leftover from the days of the vnode
operations layer and can be removed to simplify the code a lot.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:34:29 -06:00
Christoph Hellwig
dda35b8f84 xfs: merge xfs_lrw.c into xfs_file.c
Currently the code to implement the file operations is split over
two small files.  Merge the content of xfs_lrw.c into xfs_file.c to
have it in one place.  Note that I haven't done various cleanups
that are possible after this yet, they will follow in the next
patch.  Also the function xfs_dev_is_read_only which was in
xfs_lrw.c before really doesn't fit in here at all and was moved to
xfs_mount.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:34:18 -06:00
Christoph Hellwig
b262e5dfd9 xfs: fix dquota trace format
The be32_to_cpu in the TP_printk output breaks automatic parsing of
the trace format by the trace-cmd tools, so we have to move it into
the TP_assign block.  While we're at it also fix the format for the
quota limits to more regular and easier parseable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:34:11 -06:00
Eric Sandeen
a9cc799eca xfs: increase readdir buffer size
While doing some testing of readdir perf a while back,
I noticed that the buffer size we're using internally is
smaller than what glibc gives us by default.  Upping this
size helped a bit, and seems safe.

glibc's __alloc_dir() does:

  const size_t default_allocation = (4 * BUFSIZ < sizeof (struct dirent64)
                                     ? sizeof (struct dirent64) : 4 * BUFSIZ);
  const size_t small_allocation = (BUFSIZ < sizeof (struct dirent64)
                                   ? sizeof (struct dirent64) : BUFSIZ);
  size_t allocation = default_allocation;
#ifdef _STATBUF_ST_BLKSIZE
  if (statp != NULL && default_allocation < statp->st_blksize)
    allocation = statp->st_blksize;
#endif

and

#define _G_BUFSIZ 8192
#define _IO_BUFSIZ _G_BUFSIZ
# define BUFSIZ _IO_BUFSIZ

so the default buffer is 4 * 8192 = 32768
(except in the unlikely case of blocks > 32k....)

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-01 16:33:41 -06:00
Linus Torvalds
b1bf936840 Merge branch 'for-2.6.34' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.34' of git://git.kernel.dk/linux-2.6-block: (38 commits)
  block: don't access jiffies when initialising io_context
  cfq: remove 8 bytes of padding from cfq_rb_root on 64 bit builds
  block: fix for "Consolidate phys_segment and hw_segment limits"
  cfq-iosched: quantum check tweak
  blktrace: perform cleanup after setup error
  blkdev: fix merge_bvec_fn return value checks
  cfq-iosched: requests "in flight" vs "in driver" clarification
  cciss: Fix problem with scatter gather elements in the scsi half of the driver
  cciss: eliminate unnecessary pointer use in cciss scsi code
  cciss: do not use void pointer for scsi hba data
  cciss: factor out scatter gather chain block mapping code
  cciss: fix scatter gather chain block dma direction kludge
  cciss: simplify scatter gather code
  cciss: factor out scatter gather chain block allocation and freeing
  cciss: detect bad alignment of scsi commands at build time
  cciss: clarify command list padding calculation
  cfq-iosched: rethink seeky detection for SSDs
  cfq-iosched: rework seeky detection
  block: remove padding from io_context on 64bit builds
  block: Consolidate phys_segment and hw_segment limits
  ...
2010-03-01 09:00:29 -08:00
Bob Peterson
4818972efb GFS2: print glock numbers in hex
This patch changes glock numbers from printing in decimal to hex.
Since DLM prints corresponding resource IDs in hex, it makes debugging
easier.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-03-01 14:09:04 +00:00
Dave Chinner
e5884636da GFS2: ordered writes are backwards
When we queue data buffers for ordered write, the buffers are added
to the head of the ordered write list. When the log needs to push
these buffers to disk, it also walks the list from the head. The
result is that the the ordered buffers are submitted to disk in
reverse order.

For large writes, this means that whenever the log flushes large
streams of reverse sequential order buffers are pushed down into the
block layers. The elevators don't handle this particularly well, so
IO rates tend to be significantly lower than if the IO was issued in
ascending block order.

Queue new ordered buffers to the tail of the ordered buffer list to
ensure that IO is dispatched in the order it was submitted. This
should significantly improve large sequential write speeds. On a
disk capable of 85MB/s, speeds increase from 50MB/s to 65MB/s for
noop and from 38MB/s to 50MB/s for cfq.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-03-01 14:08:26 +00:00
Steven Whitehouse
c1184f8ab7 GFS2: Remove loopy umount code
As a consequence of the previous patch, we can now remove the
loop which used to be required due to the circular dependency
between the inodes and glocks. Instead we can just invalidate
the inodes, and then clear up any glocks which are left.

Also we no longer need the rwsem since there is no longer any
danger of the inode invalidation calling back into the glock
code (and from there back into the inode code).

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-03-01 14:07:53 +00:00
Steven Whitehouse
009d851837 GFS2: Metadata address space clean up
Since the start of GFS2, an "extra" inode has been used to store
the metadata belonging to each inode. The only reason for using
this inode was to have an extra address space, the other fields
were unused. This means that the memory usage was rather inefficient.

The reason for keeping each inode's metadata in a separate address
space is that when glocks are requested on remote nodes, we need to
be able to efficiently locate the data and metadata which relating
to that glock (inode) in order to sync or sync and invalidate it
(depending on the remotely requested lock mode).

This patch adds a new type of glock, which has in addition to
its normal fields, has an address space. This applies to all
inode and rgrp glocks (but to no other glock types which remain
as before). As a result, we no longer need to have the second
inode.

This results in three major improvements:
 1. A saving of approx 25% of memory used in caching inodes
 2. A removal of the circular dependency between inodes and glocks
 3. No confusion between "normal" and "metadata" inodes in super.c

Although the first of these is the more immediately apparent, the
second is just as important as it now enables a number of clean
ups at umount time. Those will be the subject of future patches.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-03-01 14:07:37 +00:00
David S. Miller
47871889c6 Merge branch 'master' of /home/davem/src/GIT/linux-2.6/
Conflicts:
	drivers/firmware/iscsi_ibft.c
2010-02-28 19:23:06 -08:00
James Morris
b4ccebdd37 Merge branch 'next' into for-linus 2010-03-01 09:36:31 +11:00
Dmitry Monakhov
9f7cdbc33f blkdev: fix merge_bvec_fn return value checks
merge_bvec_fn() returns bvec->bv_len on success. So we have to check
against this value. But in case of fs_optimization merge we compare
with wrong value. This patch must be included in
 b428cd6da7e6559aca69aa2e3a526037d3f20403
But accidentally i've forgot to add this in the initial patch.
To make things straight let's replace all such checks.
In fact this makes code easy to understand.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-02-28 19:47:18 +01:00
Linus Torvalds
642c4c75a7 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (44 commits)
  rcu: Fix accelerated GPs for last non-dynticked CPU
  rcu: Make non-RCU_PROVE_LOCKING rcu_read_lock_sched_held() understand boot
  rcu: Fix accelerated grace periods for last non-dynticked CPU
  rcu: Export rcu_scheduler_active
  rcu: Make rcu_read_lock_sched_held() take boot time into account
  rcu: Make lockdep_rcu_dereference() message less alarmist
  sched, cgroups: Fix module export
  rcu: Add RCU_CPU_STALL_VERBOSE to dump detailed per-task information
  rcu: Fix rcutorture mod_timer argument to delay one jiffy
  rcu: Fix deadlock in TREE_PREEMPT_RCU CPU stall detection
  rcu: Convert to raw_spinlocks
  rcu: Stop overflowing signed integers
  rcu: Use canonical URL for Mathieu's dissertation
  rcu: Accelerate grace period if last non-dynticked CPU
  rcu: Fix citation of Mathieu's dissertation
  rcu: Documentation update for CONFIG_PROVE_RCU
  security: Apply lockdep-based checking to rcu_dereference() uses
  idr: Apply lockdep-based diagnostics to rcu_dereference() uses
  radix-tree: Disable RCU lockdep checking in radix tree
  vfs: Abstract rcu_dereference_check for files-fdtable use
  ...
2010-02-28 10:13:16 -08:00
Boaz Harrosh
50a76fd3c3 exofs: groups support
* _calc_stripe_info() changes to accommodate for grouping
  calculations. Returns additional information

* old _prepare_pages() becomes _prepare_one_group()
  which stores pages belonging to one device group.

* New _prepare_for_striping iterates on all groups calling
  _prepare_one_group().

* Enable mounting of groups data_maps (group_width != 0)

[QUESTION]
what is faster A or B;
A.	x += stride;
	x = x % width + first_x;

B	x += stride
	if (x < last_x)
		x = first_x;

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:55:53 -08:00
Boaz Harrosh
b367e78bd1 exofs: Prepare for groups
* Rename _offset_dev_unit_off() to _calc_stripe_info()
  and recieve a struct for the output params

* In _prepare_for_striping we only need to call
  _calc_stripe_info() once. The other componets
  are easy to calculate from that. This code
  was inspired by what's done in truncate.

* Some code shifts that make sense now but will make
  more sense when group support is added.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:44:44 -08:00
Boaz Harrosh
96391e2bae exofs: Error recovery if object is missing from storage
If an object is referenced by a directory but does not
exist on a target, it is a very serious corruption that
means:
1. Either a power failure with very slim chance of it
  happening. Because the directory update is always submitted
  much after object creation, but if a directory is written
  to one device and the object creation to another it might
  theoretically happen.
2. It only ever happened to me while developing with BUGs
  causing file corruption. Crashes could also cause it but
  they are more like case 1.

In any way the object does not exist, so data is surely lost.
If there is a mix-up in the obj-id or data-map, then lost objects
can be salvaged by off-line fsck. The only recoverable information
is the directory name. By letting it appear as a regular empty file,
with date==0 (1970 Jan 1st) ownership to root, we enable recovery
of the only useful information. And also enable deletion or over-write.
I can see how this can hurt.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:44:43 -08:00
Boaz Harrosh
86093aaff5 exofs: convert io_state to use pages array instead of bio at input
* inode.c operations are full-pages based, and not actually
  true scatter-gather
* Lets us use more pages at once upto 512 (from 249) in 64 bit
* Brings us much much closer to be able to use exofs's io_state engine
  from objlayout driver. (Once I decide where to put the common code)

After RAID0 patch the outer (input) bio was never used as a bio, but
was simply a page carrier into the raid engine. Even in the simple
mirror/single-dev arrangement pages info was copied into a second bio.
It is now easer to just pass a pages array into the io_state and prepare
bio(s) once.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:44:42 -08:00
Boaz Harrosh
5d952b8391 exofs: RAID0 support
We now support striping over mirror devices. Including variable sized
stripe_unit.

Some limits:
* stripe_unit must be a multiple of PAGE_SIZE
* stripe_unit * stripe_count is maximum upto 32-bit (4Gb)

Tested RAID0 over mirrors, RAID0 only, mirrors only. All check.

Design notes:
* I'm not using a vectored raid-engine mechanism yet. Following the
  pnfs-objects-layout data-map structure, "Mirror" is just a private
  case of "group_width" == 1, and RAID0 is a private case of
  "Mirrors" == 1. The performance lose of the general case over the
  particular special case optimization is totally negligible, also
  considering the extra code size.

* In general I added a prepare_stripes() stage that divides the
  to-be-io pages to the participating devices, the previous
  exofs_ios_write/read, now becomes _write/read_mirrors and a new
  write/read upper layer loops on all devices calling
  _write/read_mirrors. Effectively the prepare_stripes stage is the all
  secret.
  Also truncate need fixing to accommodate for striping.

* In a RAID0 arrangement, in a regular usage scenario, if all inode
  layouts will start at the same device, the small files fill up the
  first device and the later devices stay empty, the farther the device
  the emptier it is.

  To fix that, each inode will start at a different stripe_unit,
  according to it's obj_id modulus number-of-stripe-units. And
  will then span all stripe-units in the same incrementing order
  wrapping back to the beginning of the device table. We call it
  a stripe-units moving window.

  Special consideration was taken to keep all devices in a mirror
  arrangement identical. So a broken osd-device could just be cloned
  from one of the mirrors and no FS scrubbing is needed. (We do that
  by rotating stripe-unit at a time and not a single device at a time.)

TODO:
 We no longer verify object_length == inode->i_size in exofs_iget.
 (since i_size is stripped on multiple objects now).
 I should introduce a multiple-device attribute reading, and use
 it in exofs_iget.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:43:08 -08:00
Boaz Harrosh
d9c740d225 exofs: Define on-disk per-inode optional layout attribute
* Layouts describe the way a file is spread on multiple devices.
  The layout information is stored in the objects attribute introduced
  in this patch.

* There can be multiple generating function for the layout.
  Currently defined:
    - No attribute present - use below moving-window on global
      device table, all devices.
      (This is the only one currently used in exofs)
    - an obj_id generated moving window - the obj_id is a randomizing
      factor in the otherwise global map layout.
    - An explicit layout stored, including a data_map and a device
      index list.
    - More might be defined in future ...

* There are two attributes defined of the same structure:
  A-data-files-layout - This layout is used by data-files. If present
                        at a directory, all files of that directory will
                        be created with this layout.
  A-meta-data-layout - This layout is used by a directory and other
                       meta-data information. Also inherited at creation
                       of subdirectories.

* At creation time inodes are created with the layout specified above.
  A usermode utility may change the creation layout on a give directory
  or file. Which in the case of directories, will also apply to newly
  created files/subdirectories, children of that directory.
  In the simple unaltered case of a newly created exofs, no layout
  attributes are present, and all layouts adhere to the layout specified
  at the device-table.

* In case of a future file system loaded in an old exofs-driver.
  At iget(), the generating_function is inspected and if not supported
  will return an IO error to the application and the inode will not
  be loaded. So not to damage any data.
  Note: After this patch we do not yet support any type of layout
        only the RAID0 patch that enables striping at the super-block
        level will add support for RAID0 layouts above. This way we
        are past and future compatible and fully bisectable.

* Access to the device table is done by an accessor since
  it will change according to above information.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:35:28 -08:00
Boaz Harrosh
46f4d973f6 exofs: unindent exofs_sbi_read
The original idea was that a mirror read can be sub-divided
to multiple devices. But this has very little gain and only
at very large IOes so it's not going to be implemented soon.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:35:27 -08:00
Boaz Harrosh
45d3abcb1a exofs: Move layout related members to a layout structure
* Abstract away those members in exofs_sb_info that are related/needed
  by a layout into a new exofs_layout structure. Embed it in exofs_sb_info.

* At exofs_io_state receive/keep a pointer to an exofs_layout. No need for
  an exofs_sb_info pointer, all we need is at exofs_layout.

* Change any usage of above exofs_sb_info members to their new name.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:35:27 -08:00
Boaz Harrosh
22ddc55638 exofs: Recover in the case of read-passed-end-of-file
In check_io, implement the case of reading passed end of
file, by clearing the pages and recover with no error. In
a raid arrangement this can become a legitimate situation
in case of holes in the file.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:35:26 -08:00
Boaz Harrosh
518f167a37 exofs: Micro-optimize exofs_i_info
optimize the exofs_i_info struct usage by moving the embedded
vfs_inode to be first. A compiler might optimize away an "add"
operation with constant zero. (Which it cannot with other constants)

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:35:25 -08:00
Boaz Harrosh
34ce4e7c23 exofs: debug print even less
* Last debug trimming left in some stupid print, remove them.
  Fixup some other prints
* Shift printing from inode.c to ios.c
* Add couple of prints when memory allocation fails.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-02-28 03:35:25 -08:00
Wengang Wang
5051f76883 ocfs2: send SIGXFSZ if new filesize exceeds limit -v2
This patch makes ocfs2 send SIGXFSZ if new file size exceeds the rlimit.
Processes may get SIGXFSZ on one node (in the cluster) while others will
not on another if file size limits are different on the two nodes.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-27 20:08:51 -08:00
Sunil Mushran
6fcef3f04a ocfs2/userdlm: Add tracing in userdlm
Make use of the newly added BASTS masklog to trace ASTs and BASTs in userdlm.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-27 19:57:07 -08:00
Sunil Mushran
9b915181af ocfs2: Use a separate masklog for AST and BASTs
This patch adds a new masklog and uses it allow tracing ASTs and BASTs
in the dlmglue layer. This has been found to be very useful in debugging
cluster locking issues.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-27 19:57:06 -08:00
Christian Kujau
4912002fff Remove EXPERIMENTAL from NFS_FSCACHE
There's currently an open Ubuntu bug[0], with the intent to compile NFS_FSCACHE
(and possibly AFS_FSCACHE, 9P_FSCACHE) into the standard Ubuntu kernel.
However, since *_FSCACHE still depends on EXPERIMENTAL, this won't happen.

As Arjan van de Ven pointed out[1], the EXPERIMENTAL flag doesn't mean that
much any more, I propose the following patch to fs/nfs/Kconfig.  I'd do the
same for fs/9p/Kconfig and fs/afs/Kconfig, but as I did not test 9p or AFS, I
feel it would not be appropriate for me to remove the flag.

[0] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/440522/comments/5
[1] http://lkml.org/lkml/2010/1/23/145

Signed-off-by: Christian Kujau <lists@nerdbynature.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-26 17:22:35 -08:00
Linus Torvalds
4cbd55188f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: use bastmode in debugfs output
  dlm: Send lockspace name with uevents
  dlm: send reply before bast
  dlm: fix ordering of bast and cast
2010-02-26 17:19:30 -08:00
Linus Torvalds
b305956abc Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs: (52 commits)
  fs/xfs: Correct NULL test
  xfs: optimize log flushing in xfs_fsync
  xfs: only clear the suid bit once in xfs_write
  xfs: kill xfs_bawrite
  xfs: log changed inodes instead of writing them synchronously
  xfs: remove invalid barrier optimization from xfs_fsync
  xfs: kill the unused XFS_QMOPT_* flush flags V2
  xfs: Use delay write promotion for dquot flushing
  xfs: Sort delayed write buffers before dispatch
  xfs: Don't issue buffer IO direct from AIL push V2
  xfs: Use delayed write for inodes rather than async V2
  xfs: Make inode reclaim states explicit
  xfs: more reserved blocks fixups
  xfs: turn off sign warnings
  xfs: don't hold onto reserved blocks on remount,ro
  xfs: quota limit statvfs available blocks
  xfs: replace KM_LARGE with explicit vmalloc use
  xfs: cleanup up xfs_log_force calling conventions
  xfs: kill XLOG_VEC_SET_TYPE
  xfs: remove duplicate buffer flags
  ...
2010-02-26 17:18:52 -08:00
Linus Torvalds
f24407d2bd Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/xfs-vipt
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/xfs-vipt:
  xfs: fix xfs to work with Virtually Indexed architectures
  sh: add mm API for DMA to vmalloc/vmap areas
  arm: add mm API for DMA to vmalloc/vmap areas
  parisc: add mm API for DMA to vmalloc/vmap areas
  mm: add coherence API for DMA to vmalloc/vmap areas
2010-02-26 17:05:10 -08:00
Srinivas Eeda
bc9838c4d4 dlm: allow dlm do recovery during shutdown
If a node down event happens while dlm shutdown in progress, dlm recovery
should be done before dlm is shutdown.  We can't migrate unrecovered locks,
obviously.  But dlm_reco_thread only does recovery if the dlm_state is
in DLM_CTXT_JOINED.

dlm_reco_thread should do recovery if dlm_state is in DLM_CTXT_JOINED or
DLM_CTXT_IN_SHUTDOWN.

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:19 -08:00
Tao Ma
cbaee472f2 ocfs2: Only bug out in direct io write for reflinked extent.
In ocfs2_direct_IO_get_blocks, we only need to bug out
in case of we are going to write a recounted extent rec.

What a silly bug introduced by me!

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Cc: stable@kernel.org
2010-02-26 15:41:19 -08:00
Coly Li
66b116c9d8 ocfs2: fix warning in ocfs2_file_aio_write()
This patch fixes a compiling warning in ocfs2_file_aio_write().

Signed-off-by: Coly Li <coly.li@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:18 -08:00
Joel Becker
cbe0e331fd ocfs2_dlmfs: Enable the use of user cluster stacks.
Unlike ocfs2, dlmfs has no permanent storage.  It can't store off a
cluster stack it is supposed to be using.  So it can't specify the stack
name in ocfs2_cluster_connect().

Instead, we create ocfs2_cluster_connect_agnostic(), which simply uses
the stack that is currently enabled.  This is find for dlmfs, which will
rely on the stack initialization.

We add the "stackglue" capability to dlmfs's capability list.  This lets
userspace know dlmfs can be used with all cluster stacks.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:18 -08:00
Joel Becker
0016eedc41 ocfs2_dlmfs: Use the stackglue.
Rather than directly using o2dlm, dlmfs can now use the stackglue.  This
allows it to use userspace cluster stacks and fs/dlm.  This commit
forces o2cb for now.  A latter commit will bump the protocol version and
allow non-o2cb stacks.

This is one big sed, really.  LKM_xxMODE becomes DLM_LOCK_xx.  LKM_flag
becomes DLM_LKF_flag.

We also learn to check that the LVB is valid before reading it.  Any DLM
can lose the contents of the LVB during a complicated recovery.  userdlm
should be checking this.  Now it does.  dlmfs will return 0 from read(2)
if the LVB was invalid.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:18 -08:00
Joel Becker
e8fce482f3 ocfs2_dlmfs: Don't honor truncate. The size of a dlmfs file is LVB_LEN
We want folks using dlmfs to be able to use the LVB in places other than
just write(2)/read(2).  By ignoring truncate requests, we allow 'echo
"contents" > /dlm/space/lockname' to work.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:18 -08:00
Joel Becker
553b5eb91a ocfs2: Pass the locking protocol into ocfs2_cluster_connect().
Inside the stackglue, the locking protocol structure is hanging off of
the ocfs2_cluster_connection.  This takes it one further; the locking
protocol is passed into ocfs2_cluster_connect().  Now different cluster
connections can have different locking protocols with distinct asts.
Note that all locking protocols have to keep their maximum protocol
version in lock-step.

With the protocol structure set in ocfs2_cluster_connect(), there is no
need for the stackglue to have a static pointer to a specific protocol
structure.  We can change initialization to only pass in the maximum
protocol version.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:17 -08:00
Joel Becker
e603cfb074 ocfs2: Remove the ast pointers from ocfs2_stack_plugins
With the full ocfs2_locking_protocol hanging off of the
ocfs2_cluster_connection, ast wrappers can get the ast/bast pointers
there.  They don't need to get them from their plugin structure.

The user plugin still needs the maximum locking protocol version,
though.  This changes the plugin structure so that it only holds the max
version, not the entire ocfs2_locking_protocol pointer.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:16 -08:00
Joel Becker
110946c8fb ocfs2: Hang the locking proto on the cluster conn and use it in asts.
With the ocfs2_cluster_connection hanging off of the ocfs2_dlm_lksb, we
have access to it in the ast and bast wrapper functions.  Attach the
ocfs2_locking_protocol to the conn.

Now, instead of refering to a static variable for ast/bast pointers, the
wrappers can look at the connection.  This means different connections
can have different ast/bast pointers, and it reduces the need for the
static pointer.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:16 -08:00
Joel Becker
c0e4133851 ocfs2: Attach the connection to the lksb
We're going to want it in the ast functions, so we convert union
ocfs2_dlm_lksb to struct ocfs2_dlm_lksb and let it carry the connection.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:14 -08:00
Joel Becker
a796d2862a ocfs2: Pass lksbs back from stackglue ast/bast functions.
The stackglue ast and bast functions tried to maintain the fiction that
their arguments were void pointers.  In reality, stack_user.c had to
know that the argument was an ocfs2_lock_res in order to get the status
off of the lksb.  That's ugly.

This changes stackglue to always pass the lksb as the argument to ast
and bast functions.  The caller can always use container_of() to get the
ocfs2_lock_res or user_dlm_lock_res.  The net effect to the caller is
zero.  They still get back the lockres in their ast.  stackglue gets
cleaner, and now can use the lksb itself.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:14 -08:00
Joel Becker
34a9dd7e29 ocfs2_dlmfs: Move to its own directory
We're going to remove the tie between ocfs2_dlmfs and o2dlm.
ocfs2_dlmfs doesn't belong in the fs/ocfs2/dlm directory anymore.  Here
we move it to fs/ocfs2/dlmfs.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:14 -08:00
Joel Becker
65b6f34034 ocfs2_dlmfs: Use poll() to signify BASTs.
o2dlm's userspace filesystem is an easy way to use the DLM from
userspace.  It is intentionally simple. For example, it does not allow
for asynchronous behavior or lock conversion.  This is intentional to
keep the interface simple.

Because there is no asynchronous notification, there is no way for a
process holding a lock to know another node needs the lock.  This is the
number one complaint of ocfs2_dlmfs users.  Turns out, we can solve this
very easily.  We add poll() support to ocfs2_dlmfs.  When a BAST is
received, the lock's file descriptor will receive POLLIN.

This is trivial to implement.  Userdlm already has an appropriate
waitqueue, and the lock knows when it is blocked.

We add the "bast" capability to tell userspace this is available.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:14 -08:00
Joel Becker
14a437c2b6 ocfs2_dlmfs: Add capabilities parameter.
Over time, dlmfs has added some features that were not part of the
initial ABI.  Unfortunately, some of these features are not detectable
via standard usage.  For example, Linux's default poll always returns
POLLIN, so there is no way for a caller of poll(2) to know when dlmfs
added poll support.  Instead, we provide this list of new capabilities.

Capabilities is a read-only attribute.  We do it as a module parameter
so we can discover it whether dlmfs is built in, loaded, or even not
loaded (via modinfo).

The ABI features are local to this machine's dlmfs mount.  This is
distinct from the locking protocol, which is concerned with inter-node
interaction.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:13 -08:00
Joel Becker
399ff3a748 ocfs2: Handle errors while setting external xattr values.
ocfs2 can store extended attribute values as large as a single file.  It
does this using a standard ocfs2 btree for the large value.  However,
the previous code did not handle all error cases cleanly.

There are multiple problems to have.

1) We have trouble allocating space for a new xattr.  This leaves us
   with an empty xattr.
2) We overwrote an existing local xattr with a value root, and now we
   have an error allocating the storage.  This leaves us an empty xattr.
   where there used to be a value.  The value is lost.
3) We have trouble truncating a reused value.  This leaves us with the
   original entry pointing to the truncated original value.  The value
   is lost.
4) We have trouble extending the storage on a reused value.  This leaves
   us with the original value safely in place, but with more storage
   allocated when needed.

This doesn't consider storing local xattrs (values that don't require a
btree).  Those only fail when the journal fails.

Case (1) is easy.  We just remove the xattr we added.  We leak the
storage because we can't safely remove it, but otherwise everything is
happy.  We'll print a warning about the leak.

Case (4) is easy.  We still have the original value in place.  We can
just leave the extra storage attached to this xattr.  We return the
error, but the old value is untouched.  We print a warning about the
storage.

Case (2) and (3) are hard because we've lost the original values.  In
the old code, we ended up with values that could be partially read.
That's not good.  Instead, we just wipe the xattr entry and leak the
storage.  It stinks that the original value is lost, but now there isn't
a partial value to be read.  We'll print a big fat warning.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:13 -08:00
Joel Becker
139ffacebf ocfs2: Set inline xattr entries with ocfs2_xa_set()
ocfs2_xattr_ibody_set() is the only remaining user of
ocfs2_xattr_set_entry().  ocfs2_xattr_set_entry() actually does two
things: it calls ocfs2_xa_set(), and it initializes the inline xattrs.
Initializing the inline space really belongs in its own call.

We lift the initialization to ocfs2_xattr_ibody_init(), called from
ocfs2_xattr_ibody_set() only when necessary.  Now
ocfs2_xattr_ibody_set() can call ocfs2_xa_set() directly.
ocfs2_xattr_set_entry() goes away.

Another nice fact is that ocfs2_init_dinode_xa_loc() can trust
i_xattr_inline_size.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:13 -08:00
Joel Becker
d3981544d7 ocfs2: Set xattr block entries with ocfs2_xa_set()
ocfs2_xattr_block_set() calls into ocfs2_xattr_set_entry() with just the
HAS_XATTR flag.  Most of the machinery of ocfs2_xattr_set_entry() is
skipped.  All that really happens other than the call to ocfs2_xa_set()
is making sure the HAS_XATTR flag is set on the inode.

But HAS_XATTR should be set when we also set di->i_xattr_loc.  And
that's done in ocfs2_create_xattr_block().  So let's move it there, and
then ocfs2_xattr_block_set() can just call ocfs2_xa_set().

While we're there, ocfs2_create_xattr_block() can take the set_ctxt for
a smaller argument list.  It also learns to set HAS_XATTR_FL, because it
knows for sure.  ocfs2_create_empty_xatttr_block() in the reflink path
fakes a set_ctxt to call ocfs2_create_xattr_block().

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:13 -08:00
Joel Becker
c5d95df5f7 ocfs2: Let ocfs2_xa_prepare_entry() do space checks.
ocfs2_xattr_set_in_bucket() doesn't need to do its own hacky space
checking.  Let's let ocfs2_xa_prepare_entry() (via ocfs2_xa_set()) do
the more accurate work.  Whenever it doesn't have space,
ocfs2_xattr_set_in_bucket() can try to get more space.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:12 -08:00
Joel Becker
bca5e9bd1e ocfs2: Gell into ocfs2_xa_set()
ocfs2_xa_set() wraps the ocfs2_xa_prepare_entry()/ocfs2_xa_store_value()
logic.  Both callers can now use the same routine.  ocfs2_xa_remove()
moves directly into ocfs2_xa_set().

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:11 -08:00
Joel Becker
73857ee0b5 ocfs2: Allocation in ocfs2_xa_prepare_entry(), values in ocfs2_xa_store_value()
ocfs2_xa_prepare_entry() gets all the logic to add, remove, or modify
external value trees.  Now, when it exits, the entry is ready to receive
a value of any size.

ocfs2_xa_remove() is added to handle the complete removal of an entry.
It truncates the external value tree before calling
ocfs2_xa_remove_entry().

ocfs2_xa_store_inline_value() becomes ocfs2_xa_store_value().  It can
store any value.

ocfs2_xattr_set_entry() loses all the allocation logic and just uses
these functions.  ocfs2_xattr_set_value_outside() disappears.

ocfs2_xattr_set_in_bucket() uses these functions and makes
ocfs2_xattr_set_entry_in_bucket() obsolete.  That goes away, as does
ocfs2_xattr_bucket_set_value_outside() and
ocfs2_xattr_bucket_value_truncate().

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:11 -08:00