Commit Graph

21441 Commits (af06216a8ef1c430cc6ad22b562f3a11a512c5dd)

Author SHA1 Message Date
bpm@sgi.com 24446fc66f xfs: xfs_bmap_add_extent_delay_real should init br_startblock
When filling in the middle of a previous delayed allocation in
xfs_bmap_add_extent_delay_real, set br_startblock of the new delay
extent to the right to nullstartblock instead of 0 before inserting
the extent into the ifork (xfs_iext_insert), rather than setting
br_startblock afterward.

Adding the extent into the ifork with br_startblock=0 can lead to
the extent being copied into the btree by xfs_bmap_extent_to_btree
if we happen to convert from extents format to btree format before
updating br_startblock with the correct value.  The unexpected
addition of this delay extent to the btree can cause subsequent
XFS_WANT_CORRUPTED_GOTO filesystem shutdown in several
xfs_bmap_add_extent_delay_real cases where we are converting a delay
extent to real and unexpectedly find an extent already inserted.
For example:

911         case BMAP_LEFT_FILLING:
912                 /*
913                  * Filling in the first part of a previous delayed allocation.
914                  * The left neighbor is not contiguous.
915                  */
916                 trace_xfs_bmap_pre_update(ip, idx, state, _THIS_IP_);
917                 xfs_bmbt_set_startoff(ep, new_endoff);
918                 temp = PREV.br_blockcount - new->br_blockcount;
919                 xfs_bmbt_set_blockcount(ep, temp);
920                 xfs_iext_insert(ip, idx, 1, new, state);
921                 ip->i_df.if_lastex = idx;
922                 ip->i_d.di_nextents++;
923                 if (cur == NULL)
924                         rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
925                 else {
926                         rval = XFS_ILOG_CORE;
927                         if ((error = xfs_bmbt_lookup_eq(cur, new->br_startoff,
928                                         new->br_startblock, new->br_blockcount,
929                                         &i)))
930                                 goto done;
931                         XFS_WANT_CORRUPTED_GOTO(i == 0, done);

With the bogus extent in the btree we shutdown the filesystem at
931.  The conversion from extents to btree format happens when the
number of extents in the inode increases above ip->i_df.if_ext_max.
xfs_bmap_extent_to_btree copies extents from the ifork into the
btree, ignoring all delalloc extents which are denoted by
br_startblock having some value of nullstartblock.

SGI-PV: 1013221

Signed-off-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-01-28 09:13:29 -06:00
Dave Chinner 0fbca4d1c3 xfs: fix dquot shaker deadlock
Commit 368e136 ("xfs: remove duplicate code from dquot reclaim") fails
to unlock the dquot freelist when the number of loop restarts is
exceeded in xfs_qm_dqreclaim_one(). This causes hangs in memory
reclaim.

Rework the loop control logic into an unwind stack that all the
different cases jump into. This means there is only one set of code
that processes the loop exit criteria, and simplifies the unlocking
of all the items from different points in the loop. It also fixes a
double increment of the restart counter from the qi_dqlist_lock
case.

Reported-by: Malcolm Scott <lkml@malc.org.uk>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-01-28 09:05:36 -06:00
Dave Chinner c6f990d1ff xfs: handle CIl transaction commit failures correctly
Failure to commit a transaction into the CIL is not handled
correctly. This currently can only happen when racing with a
shutdown and requires an explicit shutdown check, so it rare and can
be avoided. Remove the shutdown check and make the CIL commit a void
function to indicate it will always succeed, thereby removing the
incorrectly handled failure case.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-01-28 09:05:36 -06:00
Dave Chinner 5315837dae xfs: limit extsize to size of AGs and/or MAXEXTLEN
The extent size hint can be set to larger than an AG. This means
that the alignment process can push the range to be allocated
outside the bounds of the AG, resulting in assert failures or
corrupted bmbt records. Similarly, if the extsize is larger than the
maximum extent size supported, the alignment process will produce
extents that are too large to fit into the bmbt records, resulting
in a different type of assert/corruption failure.

Fix this by limiting extsize at the time іt is set firstly to be
less than MAXEXTLEN, then to be a maximum of half the size of the
AGs in the filesystem for non-realtime inodes. Realtime inodes do
not allocate out of AGs, so don't have to be restricted by the size
of AGs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-01-28 09:05:36 -06:00
Dave Chinner 4ce159890c xfs: prevent extsize alignment from exceeding maximum extent size
When doing delayed allocation, if the allocation size is for a
maximally sized extent, extent size alignment can push it over this
limit. This results in an assert failure in xfs_bmbt_set_allf() as
the extent length is too large to find in the extent record.

Fix this by ensuring that we allow for space that extent size
alignment requires (up to 2 * (extsize -1) blocks as we have to
handle both head and tail alignment) when limiting the maximum size
of the extent.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-01-28 09:05:36 -06:00
Dave Chinner 14b064ceaa xfs: limit extent length for allocation to AG size
Delayed allocation extents can be larger than AGs, so when trying to
convert a large range we may scan every AG inside
xfs_bmap_alloc_nullfb() trying to find an AG with a size larger than
an AG. We should stop when we find the first AG with a maximum
possible allocation size. This causes excessive CPU usage when there
are lots of AGs.

The same problem occurs when doing preallocation of a range larger
than an AG.

Fix the problem by limiting real allocation lengths to the maximum
that an AG can support. This means if we have empty AGs, we'll stop
the search at the first of them. If there are no empty AGs, we'll
still scan them all, but that is a different problem....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-01-28 09:05:35 -06:00
Dave Chinner b8fc82630a xfs: speculative delayed allocation uses rounddown_power_of_2 badly
rounddown_power_of_2() returns an undefined result when passed a
value of zero. The specualtive delayed allocation code is doing this
when the inode is zero length. Hence occasionally the preallocation
is much, much larger than is necessary (e.g. 8GB for a 270 _byte_
file). Ensure we don't even pass a zero value to this function so
the result of preallocation is always the desired size.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-01-28 09:05:35 -06:00
Dave Chinner e34a314c5e xfs: fix efi item leak on forced shutdown
After test 139, kmemleak shows:

unreferenced object 0xffff880078b405d8 (size 400):
  comm "xfs_io", pid 4904, jiffies 4294909383 (age 1186.728s)
  hex dump (first 32 bytes):
    60 c1 17 79 00 88 ff ff 60 c1 17 79 00 88 ff ff  `..y....`..y....
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff81afb04d>] kmemleak_alloc+0x2d/0x60
    [<ffffffff8115c6cf>] kmem_cache_alloc+0x13f/0x2b0
    [<ffffffff814aaa97>] kmem_zone_alloc+0x77/0xf0
    [<ffffffff814aab2e>] kmem_zone_zalloc+0x1e/0x50
    [<ffffffff8147cd6b>] xfs_efi_init+0x4b/0xb0
    [<ffffffff814a4ee8>] xfs_trans_get_efi+0x58/0x90
    [<ffffffff81455fab>] xfs_bmap_finish+0x8b/0x1d0
    [<ffffffff814851b4>] xfs_itruncate_finish+0x2c4/0x5d0
    [<ffffffff814a970f>] xfs_setattr+0x8df/0xa70
    [<ffffffff814b5c7b>] xfs_vn_setattr+0x1b/0x20
    [<ffffffff8117dc00>] notify_change+0x170/0x2e0
    [<ffffffff81163bf6>] do_truncate+0x66/0xa0
    [<ffffffff81163d0b>] sys_ftruncate+0xdb/0xe0
    [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
    [<ffffffffffffffff>] 0xffffffffffffffff

The cause of the leak is that the "remove" parameter of IOP_UNPIN()
is never set when a CIL push is aborted. This means that the EFI
item is never freed if it was in the push being cancelled. The
problem is specific to delayed logging, but has uncovered a couple
of problems with the handling of IOP_UNPIN(remove).

Firstly, we cannot safely call xfs_trans_del_item() from IOP_UNPIN()
in the CIL commit failure path or the iclog write failure path
because for delayed loging we have no transaction context. Hence we
must only call xfs_trans_del_item() if the log item being unpinned
has an active log item descriptor.

Secondly, xfs_trans_uncommit() does not handle log item descriptor
freeing during the traversal of log items on a transaction. It can
reference a freed log item descriptor when unpinning an EFI item.
Hence it needs to use a safe list traversal method to allow items to
be removed from the transaction during IOP_UNPIN().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-01-28 09:01:33 -06:00
Linus Torvalds b12ece7d85 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: avoid picking MDS that is not active
  ceph: avoid immediate cap check after import
  ceph: fix flushing of caps vs cap import
  ceph: fix erroneous cap flush to non-auth mds
  ceph: fix cap_wanted_delay_{min,max} mount option initialization
  ceph: fix xattr rbtree search
  ceph: fix getattr on directory when using norbytes
2011-01-28 12:12:58 +10:00
Shirish Pargaonkar ee2c925850 cifs: More crypto cleanup (try #2)
Replaced md4 hashing function local to cifs module with kernel crypto APIs.
As a result, md4 hashing function and its supporting functions in
file md4.c are not needed anymore.

Cleaned up function declarations, removed forward function declarations,
and removed a header file that is being deleted from being included.

Verified that sec=ntlm/i, sec=ntlmv2/i, and sec=ntlmssp/i work correctly.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-27 19:58:13 +00:00
Dave Chinner 7db37c5e65 xfs: fix log ticket leak on forced shutdown.
The kmemleak detector shows this after test 139:

unreferenced object 0xffff880079b88bb0 (size 264):
  comm "xfs_io", pid 4904, jiffies 4294909382 (age 276.824s)
  hex dump (first 32 bytes):
    00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
    ff ff ff ff ff ff ff ff 48 7b c9 82 ff ff ff ff  ........H{......
  backtrace:
    [<ffffffff81afb04d>] kmemleak_alloc+0x2d/0x60
    [<ffffffff8115c6cf>] kmem_cache_alloc+0x13f/0x2b0
    [<ffffffff814aaa97>] kmem_zone_alloc+0x77/0xf0
    [<ffffffff814aab2e>] kmem_zone_zalloc+0x1e/0x50
    [<ffffffff8148f394>] xlog_ticket_alloc+0x34/0x170
    [<ffffffff81494444>] xlog_cil_push+0xa4/0x3f0
    [<ffffffff81494eca>] xlog_cil_force_lsn+0x15a/0x160
    [<ffffffff814933a5>] _xfs_log_force_lsn+0x75/0x2d0
    [<ffffffff814a264d>] _xfs_trans_commit+0x2bd/0x2f0
    [<ffffffff8148bfdd>] xfs_iomap_write_allocate+0x1ad/0x350
    [<ffffffff814ac17f>] xfs_map_blocks+0x21f/0x370
    [<ffffffff814ad1b7>] xfs_vm_writepage+0x1c7/0x550
    [<ffffffff8112200a>] __writepage+0x1a/0x50
    [<ffffffff81122df2>] write_cache_pages+0x1c2/0x4c0
    [<ffffffff81123117>] generic_writepages+0x27/0x30
    [<ffffffff814aba5d>] xfs_vm_writepages+0x5d/0x80

By inspection, the leak occurs when xlog_write() returns and error
and we jump to the abort path without dropping the reference on the
active ticket.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-01-27 12:02:00 +11:00
Li Zefan 4d728ec7ae Btrfs: Fix file clone when source offset is not 0
Suppose:
- the source extent is: [0, 100]
- the src offset is 10
- the clone length is 90
- the dest offset is 0

This statement:

	new_key.offset = key.offset + destoff - off

will produce such an extent for the dest file:

	[ino, BTRFS_EXTENT_DATA_KEY, -10]

, which is obviously wrong.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-01-27 01:11:18 +08:00
Miao Xie b897abec03 Btrfs: Fix memory leak in writepage fixup work
fixup, which is allocated when starting page write to fix up the
extent without ORDERED bit set, should be freed after this work
is done.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-01-27 01:10:30 +08:00
Miao Xie d0f69686c2 Btrfs: Don't return acl info when mounting with noacl option
Steps to reproduce:

  # mkfs.btrfs /dev/sda2
  # mount /dev/sda2 /mnt
  # touch /mnt/file0
  # setfacl -m 'u:root:x,g::x,o::x' /mnt/file0
  # umount /mnt
  # mount /dev/sda2 -o noacl /mnt
  # getfacl /mnt/file0
  ...
  user::rw-
  user:root:--x
  group::--x
  mask::--x
  other::--x

The output should be:

  user::rw-
  group::--x
  other::--x

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-01-27 01:05:16 +08:00
Tero Roponen 3f3d0bc0df Btrfs: Free correct pointer after using strsep
We must save and free the original kstrdup()'ed pointer
because strsep() modifies its first argument.

Signed-off-by: Tero Roponen <tero.roponen@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-01-27 01:05:11 +08:00
Ian Kent bdc924bb4c Btrfs: Fix memory leak on finding existing super
We missed a memory deallocation in commit 450ba0ea.

If an existing super block is found at mount and there is no
error condition then the pre-allocated tree_root and fs_info
are no not used and are not freeded.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-01-27 01:05:07 +08:00
Li Zefan 83a4d54840 Btrfs: Fix memory leak at umount
fs_info, which is allocated in open_ctree(), should be freed
in close_ctree().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-01-27 01:05:02 +08:00
Li Zefan f333adb5d6 btrfs: Check mergeable free space when removing a cluster
After returing extents from a cluster to the block group, some
extents in the block group may be mergeable.

Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-01-27 01:04:57 +08:00
Li Zefan 120d66eec0 btrfs: Add a helper try_merge_free_space()
When adding a new extent, we'll firstly see if we can merge
this extent to the left or/and right extent. Extract this as
a helper try_merge_free_space().

As a side effect, we fix a small bug that if the new extent
has non-bitmap left entry but is unmergeble, we'll directly
link the extent without trying to drop it into bitmap.

This also prepares for the next patch.

Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-01-27 01:04:50 +08:00
Li Zefan 5e71b5d5ec btrfs: Update stats when allocating from a cluster
When allocating extent entry from a cluster, we should update
the free_space and free_extents fields of the block group.

Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-01-27 01:04:46 +08:00
Li Zefan 70b7da304f btrfs: Free fully occupied bitmap in cluster
If there's no more free space in a bitmap, we should free it.

Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-01-27 01:04:41 +08:00
Li Zefan edf6e2d1dd btrfs: Add helper function free_bitmap()
Remove some duplicated code.

This prepares for the next patch.

Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-01-27 01:04:37 +08:00
Li Zefan 8eb2d829ff btrfs: Fix threshold calculation for block groups smaller than 1GB
If a block group is smaller than 1GB, the extent entry threadhold
calculation will always set the threshold to 0.

So as free space gets fragmented, btrfs will switch to use bitmap
to manage free space, but then will never switch back to extents
due to this bug.

Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-01-27 01:04:31 +08:00
Andy Adamson c7a360b05b NFS construct consistent co_ownerid for v4.1
As stated in section 2.4 of RFC 5661, subsequent instances of the client need
to present the same co_ownerid. Concatinate the client's IP dot address,
host name, and the rpc_auth pseudoflavor to form the co_ownerid.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-25 22:49:14 -05:00
Torben Hohn ac751efa6a console: rename acquire/release_console_sem() to console_lock/unlock()
The -rt patches change the console_semaphore to console_mutex.  As a
result, a quite large chunk of the patches changes all
acquire/release_console_sem() to acquire/release_console_mutex()

This commit makes things use more neutral function names which dont make
implications about the underlying lock.

The only real change is the return value of console_trylock which is
inverted from try_acquire_console_sem()

This patch also paves the way to switching console_sem from a semaphore to
a mutex.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make console_trylock return 1 on success, per Geert]
Signed-off-by: Torben Hohn <torbenh@gmx.de>
Cc: Thomas Gleixner <tglx@tglx.de>
Cc: Greg KH <gregkh@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-26 10:50:06 +10:00
Phillip Lougher 3689456b4b squashfs: fix use of uninitialised variable in zlib & xz decompressors
Fix potential use of uninitialised variable caused by recent
decompressor code optimisations.

In zlib_uncompress (zlib_wrapper.c) we have

	int zlib_err, zlib_init = 0;
	...
	do {
		...
			if (avail == 0) {
				offset = 0;
				put_bh(bh[k++]);
				continue;
			}
		...
		zlib_err = zlib_inflate(stream, Z_SYNC_FLUSH);
		...
	} while (zlib_err == Z_OK);

If continue is executed (avail == 0) then the while condition will be
evaluated testing zlib_err, which is uninitialised first time around the
loop.

Fix this by getting rid of the 'if (avail == 0)' condition test, this
edge condition should not be being handled in the decompressor code, and
instead handle it generically in the caller code.

Similarly for xz_wrapper.c.

Incidentally, on most architectures (bar Mips and Parisc), no
uninitialised variable warning is generated by gcc, this is because the
while condition test on continue is optimised out and not performed
(when executing continue zlib_err has not been changed since entering
the loop, and logically if the while condition was true previously, then
it's still true).

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Reported-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-26 10:50:05 +10:00
Linus Torvalds 3af03655e8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: fix crash after one superblock became unavailable
2011-01-26 09:03:36 +10:00
Trond Myklebust 27dc1cd3ad NFS: nfs_wcc_update_inode() should set nfsi->attr_gencount
If the call to nfs_wcc_update_inode() results in an attribute update, we
need to ensure that the inode's attr_gencount gets bumped too, otherwise
we are not protected against races with other GETATTR calls.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-25 15:28:21 -05:00
Andy Adamson b2a2897dc4 NFS improve pnfs_put_deviceid_cache debug print
What we really want to know is the ref count.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-25 15:26:51 -05:00
Andy Adamson 2c4cdf8f6d NFS fix cb_sequence error processing
Always assign the cb_process_state nfs_client pointer so a processing error
in cb_sequence after the nfs_client is found and referenced returns
a non-NULL cb_process_state nfs_client and the matching nfs_put_client in
nfs4_callback_compound dereferences the client.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-25 15:26:51 -05:00
Andy Adamson 778be232a2 NFS do not find client in NFSv4 pg_authenticate
The information required to find the nfs_client cooresponding to the incoming
back channel request is contained in the NFS layer. Perform minimal checking
in the RPC layer pg_authenticate method, and push more detailed checking into
the NFS layer where the nfs_client can be found.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-25 15:26:51 -05:00
Chuck Lever 80c30e8de4 NLM: Fix "kernel BUG at fs/lockd/host.c:417!" or ".../host.c:283!"
Nick Bowler <nbowler@elliptictech.com> reports:

> We were just having some NFS server troubles, and my client machine
> running 2.6.38-rc1+ (specifically, commit 2b1caf6ed7) crashed
> hard (syslog output appended to this mail).
>
> I'm not sure what the exact timeline was or how to reproduce this,
> but the server was rebooted during all this.  Since I've never seen
> this happen before, it is possibly a regression from previous kernel
> releases.  However, I recently updated my nfs-utils (on the client) to
> version 1.2.3, so that might be related as well.

  [ BUG output redacted ]

When done searching, the for_each_host loop in next_host_state() falls
through and returns the final host on the host chain without bumping
it's reference count.

Since the host's ref count is only one at that point, releasing the
host in nlm_host_rebooted() attempts to destroy the host prematurely,
and therefore hits a BUG().

Likely, the original intent of the for_each_host behavior in
next_host_state() was to handle the case when the host chain is empty.
Searching the chain and finding no suitable host to return needs to be
handled as well.

Defensively restructure next_host_state() always to return NULL when
the loop falls through.

Introduced by commit b10e30f6 "lockd: reorganize nlm_host_rebooted".

Cc: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-25 15:24:47 -05:00
Chuck Lever f61f6da0d5 NFS: Prevent memory allocation failure in nfsacl_encode()
nfsacl_encode() allocates memory in certain cases.  This of course
is not guaranteed to work.

Since commit 9f06c719 "SUNRPC: New xdr_streams XDR encoder API", the
kernel's XDR encoders can't return a result indicating possibly a
failure, so a memory allocation failure in nfsacl_encode() has become
fatal (ie, the XDR code Oopses) in some cases.

However, the allocated memory is a tiny fixed amount, on the order
of 40-50 bytes.  We can easily use a stack-allocated buffer for
this, with only a wee bit of nose-holding.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-25 15:24:47 -05:00
Chuck Lever 731f3f482a NFS: nfsacl_{encode,decode} should return signed integer
Clean up.

The nfsacl_encode() and nfsacl_decode() functions return negative
errno values, and each call site verifies that the returned value
is not negative.  Change the synopsis of both of these functions
to reflect this usage.

Document the synopsis and return values.

Reported-by: Trond Myklebust <trond.myklebust@netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-25 15:24:47 -05:00
Chuck Lever ee5dc7732b NFS: Fix "kernel BUG at fs/nfs/nfs3xdr.c:1338!"
Milan Broz <mbroz@redhat.com> reports:

> on today Linus' tree I get OOps if using nfs.
>
> server (2.6.36) exports dir:
> /dir   172.16.1.0/24(rw,async,all_squash,no_subtree_check,anonuid=500,anongid=500)
>
> on client it is mounted  in fstab
> server:/dir  /mnt/tst  nfs  rw,soft 0 0
>
> and these commands OOpses it (simplified from a configure script):
>
> cd /dir
> touch x
> install x y
>
> [  105.327701] ------------[ cut here ]------------
> [  105.327979] kernel BUG at fs/nfs/nfs3xdr.c:1338!
> [  105.328075] invalid opcode: 0000 [#1] PREEMPT SMP
> [  105.328223] last sysfs file: /sys/devices/virtual/bdi/0:16/uevent
> [  105.328349] Modules linked in: usbcore dm_mod
> [  105.328553]
> [  105.328678] Pid: 3710, comm: install Not tainted 2.6.37+ #423 440BX Desktop Reference Platform/VMware Virtual Platform
> [  105.328853] EIP: 0060:[<c116c06c>] EFLAGS: 00010282 CPU: 0
> [  105.329152] EIP is at nfs3_xdr_enc_setacl3args+0x61/0x98
> [  105.329249] EAX: ffffffea EBX: ce941d98 ECX: 00000000 EDX: 00000004
> [  105.329340] ESI: ce941cd0 EDI: 000000a4 EBP: ce941cc0 ESP: ce941cb4
> [  105.329431]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> [  105.329525] Process install (pid: 3710, ti=ce940000 task=ced36f20 task.ti=ce940000)
> [  105.336600] Stack:
> [  105.336693]  ce941cd0 ce9dc000 00000000 ce941cf8 c12ecd02 c12f43e0 c116c00b cf754158
> [  105.336982]  ce9dc004 cf754284 ce9dc004 cf7ffee8 ceff9978 ce9dc000 cf7ffee8 ce9dc000
> [  105.337182]  ce9dc000 ce941d14 c12e698d cf75412c ce941d98 cf7ffee8 cf7fff20 00000000
> [  105.337405] Call Trace:
> [  105.337695]  [<c12ecd02>] rpcauth_wrap_req+0x75/0x7f
> [  105.337806]  [<c12f43e0>] ? xdr_encode_opaque+0x12/0x15
> [  105.337898]  [<c116c00b>] ? nfs3_xdr_enc_setacl3args+0x0/0x98
> [  105.337988]  [<c12e698d>] call_transmit+0x17e/0x1e8
> [  105.338072]  [<c12ec307>] __rpc_execute+0x6d/0x1a6
> [  105.338155]  [<c12ec474>] rpc_execute+0x34/0x37
> [  105.338235]  [<c12e738d>] rpc_run_task+0xb5/0xbd
> [  105.338316]  [<c12e7474>] rpc_call_sync+0x3d/0x58
> [  105.338402]  [<c116d0c6>] nfs3_proc_setacls+0x18e/0x24f
> [  105.338493]  [<c10b3f76>] ? __kmalloc+0x148/0x1c4
> [  105.338579]  [<c10ecd01>] ? posix_acl_alloc+0x12/0x22
> [  105.338665]  [<c116d5c8>] nfs3_proc_setacl+0xa0/0xca
> [  105.338748]  [<c116d69c>] nfs3_setxattr+0x62/0x88
> [  105.338834]  [<c1317042>] ? sub_preempt_count+0x7c/0x89
> [  105.338926]  [<c116d63a>] ? nfs3_setxattr+0x0/0x88
> [  105.339026]  [<c10cfa79>] __vfs_setxattr_noperm+0x26/0x95
> [  105.339114]  [<c10cfb43>] vfs_setxattr+0x5b/0x76
> [  105.339211]  [<c10cfbfb>] setxattr+0x9d/0xc3
> [  105.339298]  [<c10a2ea8>] ? handle_pte_fault+0x258/0x5cb
> [  105.339428]  [<c1091ff6>] ? __free_pages+0x1a/0x23
> [  105.339517]  [<c10498ea>] ? up_read+0x16/0x2c
> [  105.339599]  [<c10b8365>] ? fget+0x0/0xa3
> [  105.339677]  [<c10b8365>] ? fget+0x0/0xa3
> [  105.339760]  [<c1025d23>] ? get_parent_ip+0xb/0x31
> [  105.339843]  [<c1317042>] ? sub_preempt_count+0x7c/0x89
> [  105.339931]  [<c10cfc72>] sys_fsetxattr+0x51/0x79
> [  105.340014]  [<c1002853>] sysenter_do_call+0x12/0x32
> [  105.340133] Code: 2e 76 18 00 58 31 d2 8b 7f 28 f6 43 04 01 74 03 8b 53 08 6a 00 8b 46 04 6a 01 8b 0b 52 89 fa e8 85 10 f8 ff 83 c4 0c 85 c0 79 04 <0f> 0b eb fe 31 c9 f6 43 04 04 74 03 8b 4b 0c 68 00 10 00 00 8d
> [  105.350321] EIP: [<c116c06c>] nfs3_xdr_enc_setacl3args+0x61/0x98 SS:ESP 0068:ce941cb4
> [  105.364385] ---[ end trace 01fcfe7f0f7f6e4a ]---

nfs3_xdr_enc_setacl3args() is not properly setting up the target
buffer before nfsacl_encode() attempts to encode the ACL.

Introduced by commit d9c407b1 "NFS: Introduce new-style XDR encoding
functions for NFSv3."

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-25 15:24:47 -05:00
Chuck Lever 839f7ad693 NFS: Fix "kernel BUG at fs/aio.c:554!"
Nick Piggin reports:

> I'm getting use after frees in aio code in NFS
>
> [ 2703.396766] Call Trace:
> [ 2703.396858]  [<ffffffff8100b057>] ? native_sched_clock+0x27/0x80
> [ 2703.396959]  [<ffffffff8108509e>] ? put_lock_stats+0xe/0x40
> [ 2703.397058]  [<ffffffff81088348>] ? lock_release_holdtime+0xa8/0x140
> [ 2703.397159]  [<ffffffff8108a2a5>] lock_acquire+0x95/0x1b0
> [ 2703.397260]  [<ffffffff811627db>] ? aio_put_req+0x2b/0x60
> [ 2703.397361]  [<ffffffff81039701>] ? get_parent_ip+0x11/0x50
> [ 2703.397464]  [<ffffffff81612a31>] _raw_spin_lock_irq+0x41/0x80
> [ 2703.397564]  [<ffffffff811627db>] ? aio_put_req+0x2b/0x60
> [ 2703.397662]  [<ffffffff811627db>] aio_put_req+0x2b/0x60
> [ 2703.397761]  [<ffffffff811647fe>] do_io_submit+0x2be/0x7c0
> [ 2703.397895]  [<ffffffff81164d0b>] sys_io_submit+0xb/0x10
> [ 2703.397995]  [<ffffffff8100307b>] system_call_fastpath+0x16/0x1b
>
> Adding some tracing, it is due to nfs completing the request then
> returning something other than -EIOCBQUEUED, so aio.c
> also completes the request.

To address this, prevent the NFS direct I/O engine from completing
async iocbs when the forward path returns an error without starting
any I/O.

This fix appears to survive ^C during both "xfstest no. 208" and "fsx
-Z."

It's likely this bug has existed for a very long while, as we are seeing
very similar symptoms in OEL 5.  Copying stable.

Cc: Stable <stable@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-25 15:24:47 -05:00
Jesper Juhl ad3d2eedf0 NFS4: Avoid potential NULL pointer dereference in decode_and_add_ds().
On Mon, 17 Jan 2011, Mi Jinlong wrote:

>
>
> Jesper Juhl:
> > strrchr() can return NULL if nothing is found. If this happens we'll
> > dereference a NULL pointer in
> > fs/nfs/nfs4filelayoutdev.c::decode_and_add_ds().
> >
> > I tried to find some other code that guarantees that this can never
> > happen but I was unsuccessful. So, unless someone else can point to some
> > code that ensures this can never be a problem, I believe this patch is
> > needed.
> >
> > While I was changing this code I also noticed that all the dprintk()
> > statements, except one, start with "%s:". The one missing the ":" I added
> > it to.
>
>   Maybe another one also should be changed at decode_and_add_ds() at line 243:
>
>    243  printk("%s Decoded address and port %s\n", __func__, buf);
>
Missed that one. Thanks.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-25 15:24:46 -05:00
Pavel Shilovsky d39454ffe4 CIFS: Add strictcache mount option
Use for switching on strict cache mode. In this mode the
client reads from the cache all the time it has Oplock Level II,
otherwise - read from the server. As for write - the client stores
a data in the cache in Exclusive Oplock case, otherwise - write
directly to the server.

Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-25 19:31:38 +00:00
Pavel Shilovsky 72432ffcf5 CIFS: Implement cifs_strict_writev (try #4)
If we don't have Exclusive oplock we write a data to the server.
Also set invalidate_mapping flag on the inode if we wrote something
to the server. Add cifs_iovec_write to let the client write iovec
buffers through CIFSSMBWrite2.

Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-25 19:30:13 +00:00
Steve French 93c100c0b4 [CIFS] Replace cifs md5 hashing functions with kernel crypto APIs
Replace remaining use of md5 hash functions local to cifs module
with kernel crypto APIs.
Remove header and source file containing those local functions.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-25 19:28:43 +00:00
Sage Weil d66bbd441c ceph: avoid picking MDS that is not active
Ignore replication or auth frag data if it indicates an MDS that is not
active.  This can happen if the MDS shuts down and the client has stale
data about the namespace distribution across the MDS cluster.  If that's
the case, fall back to directing the request based on the auth cap (which
should always be accurate).

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-25 08:16:37 -08:00
Linus Torvalds c723fdab8a Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  Make CIFS mount work in a container.
  CIFS: Remove pointless variable assignment in cifs_dfs_do_automount()
2011-01-25 14:23:54 +10:00
Rob Landley f1d0c99865 Make CIFS mount work in a container.
Teach cifs about network namespaces, so mounting uses adresses/routing
visible from the container rather than from init context.

A container is a chroot on steroids that changes more than just the root
filesystem the new processes see.  One thing containers can isolate is
"network namespaces", meaning each container can have its own set of
ethernet interfaces, each with its own own IP address and routing to the
outside world.  And if you open a socket in _userspace_ from processes
within such a container, this works fine.

But sockets opened from within the kernel still use a single global
networking context in a lot of places, meaning the new socket's address
and routing are correct for PID 1 on the host, but are _not_ what
userspace processes in the container get to use.

So when you mount a network filesystem from within in a container, the
mount code in the CIFS driver uses the host's networking context and not
the container's networking context, so it gets the wrong address, uses
the wrong routing, and may even try to go out an interface that the
container can't even access...  Bad stuff.

This patch copies the mount process's network context into the CIFS
structure that stores the rest of the server information for that mount
point, and changes the socket open code to use the saved network context
instead of the global network context.  I.E. "when you attempt to use
these addresses, do so relative to THIS set of network interfaces and
routing rules, not the old global context from back before we supported
containers".

The big long HOWTO sets up a test environment on the assumption you've
never used ocntainers before.  It basically says:

1) configure and build a new kernel that has container support
2) build a new root filesystem that includes the userspace container
control package (LXC)
3) package/run them under KVM (so you don't have to mess up your host
system in order to play with containers).
4) set up some containers under the KVM system
5) set up contradictory routing in the KVM system and the container so
that the host and the container see different things for the same address
6) try to mount a CIFS share from both contexts so you can both force it
to work and force it to fail.

For a long drawn out test reproduction sequence, see:

  http://landley.livejournal.com/47024.html
  http://landley.livejournal.com/47205.html
  http://landley.livejournal.com/47476.html

Signed-off-by: Rob Landley <rlandley@parallels.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-24 04:28:51 +00:00
Jesper Juhl 3f391c79b0 CIFS: Remove pointless variable assignment in cifs_dfs_do_automount()
In fs/cifs/cifs_dfs_ref.c::cifs_dfs_do_automount() we have this code:

	...
	mnt = ERR_PTR(-EINVAL);
	if (IS_ERR(tlink)) {
		mnt = ERR_CAST(tlink);
		goto free_full_path;
	}
	ses = tlink_tcon(tlink)->ses;

	rc = get_dfs_path(xid, ses, full_path + 1, cifs_sb->local_nls,
		&num_referrals, &referrals,
		cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR);

	cifs_put_tlink(tlink);

	mnt = ERR_PTR(-ENOENT);
	...

The assignment of 'mnt = ERR_PTR(-EINVAL);' is completely pointless. If we
take the 'if (IS_ERR(tlink))' branch we'll set 'mnt' again and we'll also
do so if we do not take the branch. There is no way we'll ever use 'mnt'
with the assigned 'ERR_PTR(-EINVAL)' value, so we may as well just remove
the pointless assignment.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-24 03:32:01 +00:00
Randy Dunlap ff5fdb6149 fs: fix new dcache.c kernel-doc warnings
Fix new fs/dcache.c kernel-doc warnings:

  Warning(fs/dcache.c:184): No description found for parameter 'dentry'
  Warning(fs/dcache.c:296): No description found for parameter 'parent'
  Warning(fs/dcache.c:1985): No description found for parameter 'dparent'
  Warning(fs/dcache.c:1985): Excess function parameter 'parent' description in 'd_validate'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc:	Alexander Viro <viro@zeniv.linux.org.uk>
Cc:	Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-22 20:32:38 -08:00
Ryusuke Konishi 0ca7a5b9ac nilfs2: fix crash after one superblock became unavailable
Fixes the following kernel oops in nilfs_setup_super() which could
arise if one of two super-blocks is unavailable.

> BUG: unable to handle kernel NULL pointer dereference at   (null)
> Pid: 3529, comm: mount.nilfs2 Not tainted 2.6.37 #1 /
> EIP: 0060:[<c03196bc>] EFLAGS: 00010202 CPU: 3
> EIP is at memcpy+0xc/0x1b
> Call Trace:
>  [<f953720e>] ? nilfs_setup_super+0x6c/0xa5 [nilfs2]
>  [<f95369e9>] ? nilfs_get_root_dentry+0x81/0xcb [nilfs2]
>  [<f9537a08>] ? nilfs_mount+0x4f9/0x62c [nilfs2]
>  [<c02745cf>] ? kstrdup+0x36/0x3f
>  [<f953750f>] ? nilfs_mount+0x0/0x62c [nilfs2]
>  [<c0293940>] ? vfs_kern_mount+0x4d/0x12c
>  [<c02a5100>] ? get_fs_type+0x76/0x8f
>  [<c0293a68>] ? do_kern_mount+0x33/0xbf
>  [<c02a784a>] ? do_mount+0x2ed/0x714
>  [<c02a6171>] ? copy_mount_options+0x28/0xfc
>  [<c02a7ce3>] ? sys_mount+0x72/0xaf
>  [<c0473085>] ? syscall_call+0x7/0xb

Reported-by: Wakko Warner <wakko@animx.eu.org>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Wakko Warner <wakko@animx.eu.org>
Cc: stable <stable@kernel.org> [2.6.37, 2.6.36]
LKML-Reference: <20110121024918.GA29598@animx.eu.org>
2011-01-22 15:22:36 +09:00
Linus Torvalds 9093ba53b7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: fix up CIFSSMBEcho for unaligned access
  cifs: fix unaligned accesses in cifsConvertToUCS
  cifs: clean up unaligned accesses in cifs_unicode.c
  cifs: fix unaligned access in check2ndT2 and coalesce_t2
  cifs: clean up unaligned accesses in validate_t2
  cifs: use get/put_unaligned functions to access ByteCount
  cifs: move time field in cifsInodeInfo
  cifs: TCP_Server_Info diet
  CIFS: Implement cifs_strict_readv (try #4)
  CIFS: Implement cifs_file_strict_mmap (try #2)
  CIFS: Implement cifs_strict_fsync
  CIFS: Make cifsFileInfo_put work with strict cache mode
2011-01-21 13:44:07 -08:00
Linus Torvalds 4843456c5c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  quota: Fix deadlock during path resolution
2011-01-21 07:33:37 -08:00
Jeff Layton 99d86c8f1b cifs: fix up CIFSSMBEcho for unaligned access
Make sure that CIFSSMBEcho can handle unaligned fields. Also fix a minor
bug that causes this warning:

fs/cifs/cifssmb.c: In function 'CIFSSMBEcho':
fs/cifs/cifssmb.c:740: warning: large integer implicitly truncated to unsigned type

...WordCount is u8, not __le16, so no need to convert it.

This patch should apply cleanly on top of the rest of the patchset to
clean up unaligned access.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-21 02:23:27 +00:00
Steve French bf67b9be97 Merge branch 'for-next' 2011-01-21 02:19:30 +00:00
Linus Torvalds 8d99641f6c Merge branch 'akpm'
* akpm:
  kernel/smp.c: consolidate writes in smp_call_function_interrupt()
  kernel/smp.c: fix smp_call_function_many() SMP race
  memcg: correctly order reading PCG_USED and pc->mem_cgroup
  backlight: fix 88pm860x_bl macro collision
  drivers/leds/ledtrig-gpio.c: make output match input, tighten input checking
  MAINTAINERS: update Atmel AT91 entry
  mm: fix truncate_setsize() comment
  memcg: fix rmdir, force_empty with THP
  memcg: fix LRU accounting with THP
  memcg: fix USED bit handling at uncharge in THP
  memcg: modify accounting function for supporting THP better
  fs/direct-io.c: don't try to allocate more than BIO_MAX_PAGES in a bio
  mm: compaction: prevent division-by-zero during user-requested compaction
  mm/vmscan.c: remove duplicate include of compaction.h
  memblock: fix memblock_is_region_memory()
  thp: keep highpte mapped until it is no longer needed
  kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT
2011-01-20 17:02:14 -08:00
David Dillow 20d9600cb4 fs/direct-io.c: don't try to allocate more than BIO_MAX_PAGES in a bio
When using devices that support max_segments > BIO_MAX_PAGES (256), direct
IO tries to allocate a bio with more pages than allowed, which leads to an
oops in dio_bio_alloc().  Clamp the request to the supported maximum, and
change dio_bio_alloc() to reflect that bio_alloc() will always return a
bio when called with __GFP_WAIT and a valid number of vectors.

[akpm@linux-foundation.org: remove redundant BUG_ON()]
Signed-off-by: David Dillow <dillowda@ornl.gov>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-20 17:02:05 -08:00
David Rientjes 6a108a14fa kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT
The meaning of CONFIG_EMBEDDED has long since been obsoleted; the option
is used to configure any non-standard kernel with a much larger scope than
only small devices.

This patch renames the option to CONFIG_EXPERT in init/Kconfig and fixes
references to the option throughout the kernel.  A new CONFIG_EMBEDDED
option is added that automatically selects CONFIG_EXPERT when enabled and
can be used in the future to isolate options that should only be
considered for embedded systems (RISC architectures, SLOB, etc).

Calling the option "EXPERT" more accurately represents its intention: only
expert users who understand the impact of the configuration changes they
are making should enable it.

Reviewed-by: Ingo Molnar <mingo@elte.hu>
Acked-by: David Woodhouse <david.woodhouse@intel.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Greg KH <gregkh@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Robin Holt <holt@sgi.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-20 17:02:05 -08:00
Linus Torvalds 5cdec1fca2 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: mangle existing header for SMB_COM_NT_CANCEL
  cifs: remove code for setting timeouts on requests
  [CIFS] cifs: reconnect unresponsive servers
  cifs: set up recurring workqueue job to do SMB echo requests
  cifs: add ability to send an echo request
  cifs: add cifs_call_async
  cifs: allow for different handling of received response
  cifs: clean up sync_mid_result
  cifs: don't reconnect server when we don't get a response
  cifs: wait indefinitely for responses
  cifs: Use mask of ACEs for SID Everyone to calculate all three permissions user, group, and other
  cifs: Fix regression during share-level security mounts (Repost)
  [CIFS] Update cifs version number
  cifs: move mid result processing into common function
  cifs: move locked sections out of DeleteMidQEntry and AllocMidQEntry
  cifs: clean up accesses to midCount
  cifs: make wait_for_free_request take a TCP_Server_Info pointer
  cifs: no need to mark smb_ses_list as cifs_demultiplex_thread is exiting
  cifs: don't fail writepages on -EAGAIN errors
  CIFS: Fix oplock break handling (try #2)
2011-01-20 16:36:38 -08:00
Linus Torvalds b06ae1ead2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  GFS2: Fix error path in gfs2_lookup_by_inum()
  GFS2: remove iopen glocks from cache on failed deletes
2011-01-20 16:29:04 -08:00
Linus Torvalds 28e58ee8ce Fix broken "pipe: use event aware wakeups" optimization
Commit e462c448fd ("pipe: use event aware wakeups") optimized the pipe
event wakeup calls to avoid wakeups if the events do not match the
requested set.

However, the optimization was buggy, in that it didn't actually use the
correct sets for the events: when we make room for more data to be
written, the pipe poll() routine will return both the POLLOUT _and_
POLLWRNORM bits.  Similarly for read.

And most critically, when a pipe is released, that will potentially
result in POLLHUP|POLLERR (depending on whether it was the last reader
or writer), not just the regular POLLIN|POLLOUT.

This bug showed itself as a hung gnome-screensaver-dialog process, stuck
forever (or at least until it was poked by a signal or by being traced)
in a poll() system call.

Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-20 16:21:59 -08:00
Jeff Layton 84cdf74e80 cifs: fix unaligned accesses in cifsConvertToUCS
Move cifsConvertToUCS to cifs_unicode.c where all of the other unicode
related functions live. Have it store mapped characters in 'temp' and
then use put_unaligned_le16 to copy it to the target buffer. Also fix
the comments to match kernel coding style.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 21:48:00 +00:00
Jeff Layton ba2dbf30df cifs: clean up unaligned accesses in cifs_unicode.c
Make sure we use get/put_unaligned routines when accessing wide
character strings.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 21:47:57 +00:00
Jeff Layton 26ec254869 cifs: fix unaligned access in check2ndT2 and coalesce_t2
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 21:47:54 +00:00
Jeff Layton 12df83c9b9 cifs: clean up unaligned accesses in validate_t2
...and clean up function to reduce indentation.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 21:46:33 +00:00
Jeff Layton 690c522fa5 cifs: use get/put_unaligned functions to access ByteCount
It's possible that when we access the ByteCount that the alignment
will be off. Most CPUs deal with that transparently, but there's
usually some performance impact. Some CPUs raise an exception on
unaligned accesses.

Fix this by accessing the byte count using the get_unaligned and
put_unaligned inlined functions. While we're at it, fix the types
of some of the variables that end up getting returns from these
functions.

Acked-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 21:46:29 +00:00
Jeff Layton aae62fdb6b cifs: move time field in cifsInodeInfo
...and remove length qualifiers from bools.

Before:

	/* size: 1176, cachelines: 19, members: 13 */
	/* sum members: 1165, holes: 2, sum holes: 11 */
	/* bit holes: 1, sum bit holes: 4 bits */
	/* last cacheline: 24 bytes */

After:

	/* size: 1168, cachelines: 19, members: 13 */
	/* last cacheline: 16 bytes */

...savings of 8 bytes per inode.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 21:46:26 +00:00
Jeff Layton c3dccf4817 cifs: TCP_Server_Info diet
Remove fields that are completely unused, and rearrange struct
according to recommendations by "pahole".

Before:

	/* size: 1112, cachelines: 18, members: 49 */
	/* sum members: 1086, holes: 8, sum holes: 26 */
	/* bit holes: 1, sum bit holes: 7 bits */
	/* last cacheline: 24 bytes */

After:

	/* size: 1072, cachelines: 17, members: 42 */
	/* sum members: 1065, holes: 3, sum holes: 7 */
	/* last cacheline: 48 bytes */

...savings of 40 bytes per struct on x86_64. 21 bytes by field removal,
and 19 by reorganizing to eliminate holes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 21:46:22 +00:00
Pavel Shilovsky a70307eeeb CIFS: Implement cifs_strict_readv (try #4)
Read from the cache if we have at least Level II oplock - otherwise
read from the server. Add cifs_user_readv to let the client read into
iovec buffers.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 21:42:29 +00:00
Pavel Shilovsky 7a6a19b17a CIFS: Implement cifs_file_strict_mmap (try #2)
Invalidate inode mapping if we don't have at least Level II oplock.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 21:42:25 +00:00
Pavel Shilovsky 8be7e6ba14 CIFS: Implement cifs_strict_fsync
Invalidate inode mapping if we don't have at least Level II oplock in
cifs_strict_fsync. Also remove filemap_write_and_wait call from cifs_fsync
because it is previously called from vfs_fsync_range. Add file operations'
structures for strict cache mode.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 21:42:21 +00:00
Pavel Shilovsky 4f8ba8a0c0 CIFS: Make cifsFileInfo_put work with strict cache mode
On strict cache mode when we close the last file handle of the inode we
should set invalid_mapping flag on this inode to prevent data coherency
problem when we open it again but it has been modified on the server.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 21:42:17 +00:00
Jeff Layton 76dcc26f1d cifs: mangle existing header for SMB_COM_NT_CANCEL
The NT_CANCEL command looks just like the original command, except for a
few small differences. The send_nt_cancel function however currently takes
a tcon, which we don't have in SendReceive and SendReceive2.

Instead of "respinning" the entire header for an NT_CANCEL, just mangle
the existing header by replacing just the fields we need. This means we
don't need a tcon and allows us to call it from other places.

Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 18:08:36 +00:00
Jeff Layton 7749981ec3 cifs: remove code for setting timeouts on requests
Since we don't time out individual requests anymore, remove the code
that we used to use for setting timeouts on different requests.

Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 18:07:55 +00:00
Steve French fda3594362 [CIFS] cifs: reconnect unresponsive servers
If the server isn't responding to echoes, we don't want to leave tasks
hung waiting for it to reply. At that point, we'll want to reconnect
so that soft mounts can return an error to userspace quickly.

If the client hasn't received a reply after a specified number of echo
intervals, assume that the transport is down and attempt to reconnect
the socket.

The number of echo_intervals to wait before attempting to reconnect is
tunable via a module parameter. Setting it to 0, means that the client
will never attempt to reconnect. The default is 5.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
2011-01-20 18:06:34 +00:00
Jeff Layton c74093b694 cifs: set up recurring workqueue job to do SMB echo requests
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 17:48:10 +00:00
Jeff Layton 766fdbb57f cifs: add ability to send an echo request
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 17:46:44 +00:00
Jeff Layton a6827c184e cifs: add cifs_call_async
Add a function that will send a request, and set up the mid for an
async reply.

Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 17:46:04 +00:00
Jeff Layton 2b84a36c55 cifs: allow for different handling of received response
In order to incorporate async requests, we need to allow for a more
general way to do things on receive, rather than just waking up a
process.

Turn the task pointer in the mid_q_entry into a callback function and a
generic data pointer. When a response comes in, or the socket is
reconnected, cifsd can call the callback function in order to wake up
the process.

The default is to just wake up the current process which should mean no
change in behavior for existing code.

Also, clean up the locking in cifs_reconnect. There doesn't seem to be
any need to hold both the srv_mutex and GlobalMid_Lock when walking the
list of mids.

Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 17:43:59 +00:00
Jeff Layton 74dd92a881 cifs: clean up sync_mid_result
Make it use a switch statement based on the value of the midStatus. If
the resp_buf is set, then MID_RESPONSE_RECEIVED is too.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 17:41:02 +00:00
Jeff Layton dad255b182 cifs: don't reconnect server when we don't get a response
We only want to force a reconnect to the server under very limited and
specific circumstances. Now that we have processes waiting indefinitely
for responses, we shouldn't reach this point unless a reconnect is
already in process. Thus, there's no reason to re-mark the server for
reconnect here.

Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by:  Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 17:08:50 +00:00
Jeff Layton 0ade640e9c cifs: wait indefinitely for responses
The client should not be timing out on individual SMB requests. Too much
of the state between client and server is tied to the state of the
socket. If we time out requests and issue spurious disconnects then that
comprimises data integrity.

Instead of doing this complicated dance where we try to decide how long
to wait for a response for particular requests, have the client instead
wait indefinitely for a response. Also, use a TASK_KILLABLE sleep here
so that fatal signals will break out of this waiting.

Later patches will add support for detecting dead peers and forcing
reconnects based on that.

Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by:  Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-20 17:07:49 +00:00
Shirish Pargaonkar 2fbc2f1729 cifs: Use mask of ACEs for SID Everyone to calculate all three permissions user, group, and other
If a DACL has entries for ACEs for SID Everyone and Authenticated Users,
factor in mask in respective entries during calculation of permissions
for all three, user, group, and other.

http://technet.microsoft.com/en-us/library/bb463216.aspx

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-19 21:25:58 +00:00
Fred Isaman 0da2a4ac33 NFS: fix handling of malloc failure during nfs_flush_multi()
Cleanup of the allocated list entries should not call
put_nfs_open_context() on each entry, as the context will
always be NULL, causing an oops.

Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-01-19 15:37:49 -05:00
Shirish Pargaonkar 540b2e3777 cifs: Fix regression during share-level security mounts (Repost)
NTLM response length was changed to 16 bytes instead of 24 bytes
that are sent in Tree Connection Request during share-level security
share mounts.  Revert it back to 24 bytes.

Reported-and-Tested-by: Grzegorz Ozanski <grzegorz.ozanski@intel.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Cc: stable@kernel.org
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-19 18:11:18 +00:00
Steve French 1cd3508d5e [CIFS] Update cifs version number
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-19 17:53:44 +00:00
Jeff Layton 053d503445 cifs: move mid result processing into common function
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-19 17:52:45 +00:00
Jeff Layton ddc8cf8fc7 cifs: move locked sections out of DeleteMidQEntry and AllocMidQEntry
In later patches, we're going to need to have finer-grained control
over the addition and removal of these structs from the pending_mid_q
and we'll need to be able to call the destructor while holding the
spinlock. Move the locked sections out of both routines and into
the callers. Fix up current callers of DeleteMidQEntry to call a new
routine that dequeues the entry and then destroys it.

Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-19 17:52:42 +00:00
Jeff Layton 8097531a5c cifs: clean up accesses to midCount
It's an atomic_t and the code accesses the "counter" field in it directly
instead of using atomic_read(). It also is sometimes accessed under a
spinlock and sometimes not. Move it out of the spinlock since we don't need
belt-and-suspenders for something that's just informational.

Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-19 17:52:38 +00:00
Jeff Layton c5797a945c cifs: make wait_for_free_request take a TCP_Server_Info pointer
The cifsSesInfo pointer is only used to get at the server.

Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by:  Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-19 17:52:30 +00:00
Jeff Layton 9d78315b03 cifs: no need to mark smb_ses_list as cifs_demultiplex_thread is exiting
The TCP_Server_Info is refcounted and every SMB session holds a
reference to it. Thus, smb_ses_list is always going to be empty when
cifsd is coming down. This is dead code.

Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-19 17:52:30 +00:00
Jeff Layton 941b853d77 cifs: don't fail writepages on -EAGAIN errors
If CIFSSMBWrite2 returns -EAGAIN, then the error should be considered
temporary. CIFS should retry the write instead of setting an error on
the mapping and returning.

For WB_SYNC_ALL, just retry the write immediately. In the WB_SYNC_NONE
case, call redirty_page_for_writeback on all of the pages that didn't
get written out and then move on.

Also, fix up the handling of a short write with a successful return
code. MS-CIFS says that 0 bytes_written means ENOSPC or EFBIG. It
doesn't mention what a short, but non-zero write means, so for now
treat it as we would an -EAGAIN return.

Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-19 17:52:29 +00:00
Pavel Shilovsky 12fed00de9 CIFS: Fix oplock break handling (try #2)
When we get oplock break notification we should set the appropriate
value of OplockLevel field in oplock break acknowledge according to
the oplock level held by the client in this time. As we only can have
level II oplock or no oplock in the case of oplock break, we should be
aware only about clientCanCacheRead field in cifsInodeInfo structure.

Also fix bug connected with wrong interpretation of OplockLevel field
during oplock break notification processing.

Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-19 17:52:29 +00:00
Sage Weil 7e57b81c76 ceph: avoid immediate cap check after import
The NODELAY flag avoids the heuristics that delay cap (issued/wanted)
release.  There's no reason for that after we import a cap, and it kills
whatever benefit we get from those delays.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-19 09:23:26 -08:00
Sage Weil 088b3f5e9e ceph: fix flushing of caps vs cap import
If we are mid-flush and a cap is migrated to another node, we need to
resend the cap flush message to the new MDS, and do so with the original
flush_seq to avoid leaking across a sync boundary.  Previously we didn't
redo the flush (we only flushed newly dirty data), which would cause a
later sync to hang forever.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-19 09:23:25 -08:00
Sage Weil 24be0c4810 ceph: fix erroneous cap flush to non-auth mds
The int flushing is global and not clear on each iteration of the loop,
which can cause a second flush of caps to any MDSs with ids greater than
the auth.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-19 09:23:24 -08:00
Sage Weil 50aac4fec5 ceph: fix cap_wanted_delay_{min,max} mount option initialization
These were initialized to 0 instead of the default, fallout from the RBD
refactor in 3d14c5d2b6.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-19 09:23:22 -08:00
Steven Whitehouse 24d9765fc1 GFS2: Fix error path in gfs2_lookup_by_inum()
In the (impossible, except if there is fs corruption) error path
in gfs2_lookup_by_inum() if the call to gfs2_inode_refresh()
fails, it was leaving the function by calling iput() rather
than iget_failed(). This would cause future lookups of the same
inode to block forever.

This patch fixes the problem by moving the call to gfs2_inode_refresh()
into gfs2_inode_lookup() where iget_failed() is part of the error path
already. Also this cleans up some unreachable code and makes
gfs2_set_iop() static.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-01-18 14:49:08 +00:00
Benjamin Marzinski 23c3010808 GFS2: remove iopen glocks from cache on failed deletes
When a file gets deleted on GFS2, if a node can't get an exclusive lock on the
file's iopen glock, it punts on actually freeing up the space, because another
node is using the file.  When it does this, it needs to drop the iopen glock
from its cache so that the other node can get an exclusive lock on it. Now,
gfs2_delete_inode() sets GL_NOCACHE before dropping the shared lock on the
iopen glock in preparation for grabbing it in the exclusive state.  Since the
node needs the glock in the exclusive state, dropping the shared lock from the
cache doesn't slow down the case where no other nodes are using the file.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-01-18 14:28:29 +00:00
Al Viro b89b12b462 autofs4: clean ->d_release() and autofs4_free_ino() up
The latter is called only when both ino and dentry are about to
be freed, so cleaning ->d_fsdata and ->dentry is pointless.

Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:29 -05:00
Al Viro 26e6c91067 autofs4: split autofs4_init_ino()
split init_ino into new_ino and clean_ino; the former is
what used to be init_ino(NULL, sbi), the latter is for cases
where we passed non-NULL ino.  Lose unused arguments.

Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:28 -05:00
Al Viro 5a37db302e autofs4: mkdir and symlink always get a dentry that had passed lookup
... so ->d_fsdata will have been set up before we get there

Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:28 -05:00
Al Viro 726a5e0688 autofs4: autofs4_get_inode() doesn't need autofs_info * argument anymore
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:28 -05:00
Al Viro 0bf71d4d00 autofs4: kill ->size in autofs_info
It's used only to pass the length of symlink body to
autofs4_get_inode() in autofs4_dir_symlink().  We can
bloody well set inode->i_size in autofs4_dir_symlink()
directly and be done with that.

Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:28 -05:00
Al Viro 09f12c03fa autofs4: pass mode to autofs4_get_inode() explicitly
In all cases we'd set inf->mode to know value just before
passing it to autofs4_get_inode().  That kills the need
to store it in autofs_info and pass it to autofs_init_ino()

Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:27 -05:00
Al Viro 14a2f00bde autofs4: autofs4_mkroot() is not different from autofs4_init_ino()
Kill it.  Mind you, it's been an obfuscated call of autofs4_init_ino()
ever since 2.3.99pre6-4...

Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:27 -05:00
Al Viro 292c5ee802 autofs4: keep symlink body in inode->i_private
gets rid of all ->free()/->u.symlink machinery in autofs; we simply
keep symlink bodies in inode->i_private and free them in ->evict_inode().

Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:27 -05:00
Ian Kent c0bcc9d552 autofs4 - fix debug print in autofs4_lookup()
oz_mode isn't defined any more, use autofs4_oz_mode(sbi) instead.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:27 -05:00
Ian Kent 8931221411 vfs - fix dentry ref count in do_lookup()
There is a ref count problem in fs/namei.c:do_lookup().

When walking in ref-walk mode, if follow_managed() returns a fail we
need to drop dentry and possibly vfsmount.  Clean up properly,
as we do in the other caller of follow_managed().

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:26 -05:00
Ian Kent c14cc63a63 autofs4 - fix get_next_positive_dentry()
The initialization condition in fs/autofs4/expire.c:get_next_positive_dentry()
appears to be incorrect. If prev == NULL I believe that root should be
returned.

Further down, at the current dentry check for it being simple_positive()
it looks like the d_lock for dentry p should be dropped instead of dentry
ret, otherwise when p is assinged to ret we end up with no lock on p and
a lost lock on ret, which leads to a deadlock.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-18 01:21:26 -05:00
Linus Torvalds eee2a817df Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
  Btrfs: forced readonly mounts on errors
  btrfs: Require CAP_SYS_ADMIN for filesystem rebalance
  Btrfs: don't warn if we get ENOSPC in btrfs_block_rsv_check
  btrfs: Fix memory leak in btrfs_read_fs_root_no_radix()
  btrfs: check NULL or not
  btrfs: Don't pass NULL ptr to func that may deref it.
  btrfs: mount failure return value fix
  btrfs: Mem leak in btrfs_get_acl()
  btrfs: fix wrong free space information of btrfs
  btrfs: make the chunk allocator utilize the devices better
  btrfs: restructure find_free_dev_extent()
  btrfs: fix wrong calculation of stripe size
  btrfs: try to reclaim some space when chunk allocation fails
  btrfs: fix wrong data space statistics
  fs/btrfs: Fix build of ctree
  Btrfs: fix off by one while setting block groups readonly
  Btrfs: Add BTRFS_IOC_SUBVOL_GETFLAGS/SETFLAGS ioctls
  Btrfs: Add readonly snapshots support
  Btrfs: Refactor btrfs_ioctl_snap_create()
  btrfs: Extract duplicate decompress code
  ...
2011-01-17 14:43:43 -08:00
Linus Torvalds 9e8a462a01 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
  ecryptfs: remove unnecessary decrypt when extending a file
  ecryptfs: Fix ecryptfs_printk() size_t warnings
  fs/ecryptfs: Add printf format/argument verification and fix fallout
  ecryptfs: fixed testing of file descriptor flags
  ecryptfs: test lower_file pointer when lower_file_mutex is locked
  ecryptfs: missing initialization of the superblock 'magic' field
  ecryptfs: moved ECRYPTFS_SUPER_MAGIC definition to linux/magic.h
  ecryptfs: fix truncation error in ecryptfs_read_update_atime
2011-01-17 12:39:57 -08:00
Geert Uytterhoeven cf78859f52 xfs: Do not name variables "panic"
On platforms that call panic() inside their BUG() macro (m68k/sun3, and
all platforms that don't set HAVE_ARCH_BUG), compilation fails with:

| fs/xfs/support/debug.c: In function ‘xfs_cmn_err’:
| fs/xfs/support/debug.c:92: error: called object ‘panic’ is not a function

as the local variable "panic" conflicts with the "panic()" function.
Rename the local variable to resolve this.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-17 12:39:07 -08:00
liubo acce952b02 Btrfs: forced readonly mounts on errors
This patch comes from "Forced readonly mounts on errors" ideas.

As we know, this is the first step in being more fault tolerant of disk
corruptions instead of just using BUG() statements.

The major content:
- add a framework for generating errors that should result in filesystems
  going readonly.
- keep FS state in disk super block.
- make sure that all of resource will be freed and released at umount time.
- make sure that fter FS is forced readonly on error, there will be no more
  disk change before FS is corrected. For this, we should stop write operation.

After this patch is applied, the conversion from BUG() to such a framework can
happen incrementally.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-17 15:13:08 -05:00
Linus Torvalds 1a47f7a84e Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: add cruid= mount option
  cifs: cFYI the entire error code in map_smb_to_linux_error
2011-01-17 11:17:51 -08:00
Linus Torvalds ab2020f2f1 Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6: (59 commits)
  mtd: mtdpart: disallow reading OOB past the end of the partition
  mtd: pxa3xx_nand: NULL dereference in pxa3xx_nand_probe
  UBI: use mtd->writebufsize to set minimal I/O unit size
  mtd: initialize writebufsize in the MTD object of a partition
  mtd: onenand: add mtd->writebufsize initialization
  mtd: nand: add mtd->writebufsize initialization
  mtd: cfi: add writebufsize initialization
  mtd: add writebufsize field to mtd_info struct
  mtd: OneNAND: OMAP2/3: prevent regulator sleeping while OneNAND is in use
  mtd: OneNAND: add enable / disable methods to onenand_chip
  mtd: m25p80: Fix JEDEC ID for AT26DF321
  mtd: txx9ndfmc: limit transfer bytes to 512 (ECC provides 6 bytes max)
  mtd: cfi_cmdset_0002: add support for Samsung K8D3x16UxC NOR chips
  mtd: cfi_cmdset_0002: add support for Samsung K8D6x16UxM NOR chips
  mtd: nand: ams-delta: drop omap_read/write, use ioremap
  mtd: m25p80: add debugging trace in sst_write
  mtd: nand: ams-delta: select for built-in by default
  mtd: OneNAND: lighten scary initial bad block messages
  mtd: OneNAND: OMAP2/3: add support for command line partitioning
  mtd: nand: rearrange ONFI revision checking, add ONFI 2.3
  ...

Fix up trivial conflict in drivers/mtd/Kconfig as per DavidW.
2011-01-17 11:15:30 -08:00
Frank Swiderski 24562486be ecryptfs: remove unnecessary decrypt when extending a file
Removes an unecessary page decrypt from ecryptfs_begin_write when the
page is beyond the current file size. Previously, the call to
ecryptfs_decrypt_page would result in a read of 0 bytes, but still
attempt to decrypt an entire page. This patch detects that case and
merely zeros the page before marking it up-to-date.

Signed-off-by: Frank Swiderski <fes@chromium.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-01-17 13:01:25 -06:00
Tyler Hicks f24b38874e ecryptfs: Fix ecryptfs_printk() size_t warnings
Commit cb55d21f6fa19d8c6c2680d90317ce88c1f57269 revealed a number of
missing 'z' length modifiers in calls to ecryptfs_printk() when
printing variables of type size_t. This patch fixes those compiler
warnings.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-01-17 13:01:24 -06:00
Joe Perches 888d57bbc9 fs/ecryptfs: Add printf format/argument verification and fix fallout
Add __attribute__((format... to __ecryptfs_printk
Make formats and arguments match.
Add casts to (unsigned long long) for %llu.

Signed-off-by: Joe Perches <joe@perches.com>
[tyhicks: 80 columns cleanup and fixed typo]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-01-17 13:01:23 -06:00
Roberto Sassu 0abe116947 ecryptfs: fixed testing of file descriptor flags
This patch replaces the check (lower_file->f_flags & O_RDONLY) with
((lower_file & O_ACCMODE) == O_RDONLY).

Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-01-17 11:24:43 -06:00
Roberto Sassu 27992890b0 ecryptfs: test lower_file pointer when lower_file_mutex is locked
This patch prevents the lower_file pointer in the 'ecryptfs_inode_info'
structure to be checked when the mutex 'lower_file_mutex' is not locked.

Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-01-17 11:24:42 -06:00
Roberto Sassu 070baa5128 ecryptfs: missing initialization of the superblock 'magic' field
This patch initializes the 'magic' field of ecryptfs filesystems to
ECRYPTFS_SUPER_MAGIC.

Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
[tyhicks: merge with 66cb76666d]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-01-17 11:23:01 -06:00
Roberto Sassu 2a8652f4e0 ecryptfs: moved ECRYPTFS_SUPER_MAGIC definition to linux/magic.h
The definition of ECRYPTFS_SUPER_MAGIC has been moved to the include
file 'linux/magic.h' to become available to other kernel subsystems.

Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-01-17 10:44:31 -06:00
Edward Shishkin 38a708d775 ecryptfs: fix truncation error in ecryptfs_read_update_atime
This is similar to the bug found in direct-io not so long ago.

Fix up truncation (ssize_t->int).  This only matters with >2G
reads/writes, which the kernel doesn't permit.

Signed-off-by: Edward Shishkin <edward.shishkin@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Eric Sandeen <esandeen@redhat.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-01-17 10:44:30 -06:00
Namhyung Kim ecf5632dd1 fs: fix address space warnings in ioctl_fiemap()
The fi_extents_start field of struct fiemap_extent_info is a
user pointer but was not marked as __user. This makes sparse
emit following warnings:

  CHECK   fs/ioctl.c
fs/ioctl.c:114:26: warning: incorrect type in argument 1 (different address spaces)
fs/ioctl.c:114:26:    expected void [noderef] <asn:1>*dst
fs/ioctl.c:114:26:    got struct fiemap_extent *[assigned] dest
fs/ioctl.c:202:14: warning: incorrect type in argument 1 (different address spaces)
fs/ioctl.c:202:14:    expected void const volatile [noderef] <asn:1>*<noident>
fs/ioctl.c:202:14:    got struct fiemap_extent *[assigned] fi_extents_start
fs/ioctl.c:212:27: warning: incorrect type in argument 1 (different address spaces)
fs/ioctl.c:212:27:    expected void [noderef] <asn:1>*dst
fs/ioctl.c:212:27:    got char *<noident>

Also add 'ufiemap' variable to eliminate unnecessary casts.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 08:21:42 -05:00
Namhyung Kim 27eaa1c90c aio: check return value of create_workqueue()
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 05:12:44 -05:00
Dr. David Alan Gilbert 274052ef0b hpfs_setattr error case avoids unlock_kernel
This fixed a case that 'sparse' spotted where hpfs_setattr has an error return
that didn't go through it's path that unlocks.

This is against git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
version 6313e3c217.

Build tested only, I don't have an hpfs file system to test.

Dave

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 05:11:37 -05:00
Namhyung Kim e0bb6bda43 compat: copy missing fields in compat_statfs64 to user
f_flags and f_spare fields were not copied to userspace when
compat_sys_[f]statfs64 called.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 04:54:38 -05:00
Namhyung Kim 974d879e80 compat: update comment of compat statfs syscalls
The commit 7ed1ee6118 ("Take statfs variants to fs/statfs.c")
separates out statfs syscalls from fs/open.c. Thus the comment
should be changed also.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Jiri Kosina <trivial@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 04:54:38 -05:00
Namhyung Kim 6a5640f102 compat: remove unnecessary assignment in compat_rw_copy_check_uvector()
*@ret_pointer is initialized to @fast_pointer thus the assignment is
redundant.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 04:54:38 -05:00
Randy Dunlap 16ebe911eb fs: FS_POSIX_ACL does not depend on BLOCK
- Fix a kconfig unmet dependency warning.
- Remove the comment that identifies which filesystems use POSIX ACL
  utility routines.
- Move the FS_POSIX_ACL symbol outside of the BLOCK symbol if/endif block
  because its functions do not depend on BLOCK and some of the filesystems
  that use it do not depend on BLOCK.

warning: (GENERIC_ACL && JFFS2_FS_POSIX_ACL && NFSD_V4 && NFS_ACL_SUPPORT && 9P_FS_POSIX_ACL) selects FS_POSIX_ACL which has unmet direct dependencies (BLOCK)

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 03:30:37 -05:00
Steven Rostedt 3bc0ba4305 fs: Remove unlikely() from fget_light()
There's an unlikely() in fget_light() that assumes the file ref count
will be 1. Running the annotate branch profiler on a desktop that is
performing daily tasks (running firefox, evolution, xchat and is also part
of a distcc farm), it shows that the ref count is not 1 that often.

 correct incorrect      %    Function                  File              Line
 ------- ---------      -    --------                  ----              ----
1035099358 6209599193  85    fget_light              file_table.c         315

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 03:26:27 -05:00
Christoph Hellwig 2fe17c1075 fallocate should be a file operation
Currently all filesystems except XFS implement fallocate asynchronously,
while XFS forced a commit.  Both of these are suboptimal - in case of O_SYNC
I/O we really want our allocation on disk, especially for the !KEEP_SIZE
case where we actually grow the file with user-visible zeroes.  On the
other hand always commiting the transaction is a bad idea for fast-path
uses of fallocate like for example in recent Samba versions.   Given
that block allocation is a data plane operation anyway change it from
an inode operation to a file operation so that we have the file structure
available that lets us check for O_SYNC.

This also includes moving the code around for a few of the filesystems,
and remove the already unnedded S_ISDIR checks given that we only wire
up fallocate for regular files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 02:25:31 -05:00
Christoph Hellwig 64c23e8687 make the feature checks in ->fallocate future proof
Instead of various home grown checks that might need updates for new
flags just check for any bit outside the mask of the features supported
by the filesystem.  This makes the check future proof for any newly
added flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 02:25:30 -05:00
Al Viro b1e75df45a tidy up around finish_automount()
do_add_mount() and mnt_clear_expiry() are not needed outside of
namespace.c anymore, now that namei has finish_automount() to
use.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 01:47:59 -05:00
Al Viro 15f9a3f3e1 don't drop newmnt on error in do_add_mount()
That gets rid of the kludge in finish_automount() - we need
to keep refcount on the vfsmount as-is until we evict it from
expiry list.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 01:41:58 -05:00
Al Viro 19a167af7c Take the completion of automount into new helper
... and shift it from namei.c to namespace.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-17 01:35:23 -05:00
Linus Torvalds 8a335bc631 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/scsi-post-merge-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/scsi-post-merge-2.6:
  ocfs2: Make OCFS2_FS depend on CONFIGFS_FS
  dlm: Make DLM depend on CONFIGFS_FS
  net: Make NETCONSOLE_DYNAMIC depend on CONFIGFS_FS
  configfs: change depends -> select SYSFS
  [SCSI] sd,sr: kill compat SDEV_MEDIA_CHANGE event
  [SCSI] sd: implement sd_check_events()
2011-01-16 15:06:43 -08:00
Al Viro 7e3d0eb0b0 VFS: Fix UP compile error in fs/namespace.c
mnt_longterm is there only on SMP

Reported-and-tested-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-16 14:59:45 -08:00
Nicholas Bellinger 7b1fff7e4f ocfs2: Make OCFS2_FS depend on CONFIGFS_FS
This patch fixes the following kconfig error after changing
CONFIGFS_FS -> select SYSFS:

fs/sysfs/Kconfig:1:error: recursive dependency detected!
fs/sysfs/Kconfig:1:	symbol SYSFS is selected by CONFIGFS_FS
fs/configfs/Kconfig:1:	symbol CONFIGFS_FS is selected by OCFS2_FS
fs/ocfs2/Kconfig:1:	symbol OCFS2_FS depends on SYSFS

Signed-off-by: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: James Bottomley <James.Bottomley@suse.de>
2011-01-16 21:22:40 +00:00
Nicholas Bellinger 86c747d2a4 dlm: Make DLM depend on CONFIGFS_FS
This patch fixes the following kconfig error after changing
CONFIGFS_FS -> select SYSFS:

fs/sysfs/Kconfig:1:error: recursive dependency detected!
fs/sysfs/Kconfig:1:	symbol SYSFS is selected by CONFIGFS_FS
fs/configfs/Kconfig:1:	symbol CONFIGFS_FS is selected by DLM
fs/dlm/Kconfig:1:	symbol DLM depends on SYSFS

Signed-off-by: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: James Bottomley <James.Bottomley@suse.de>
2011-01-16 21:22:37 +00:00
Nicholas Bellinger e205117285 configfs: change depends -> select SYSFS
This patch changes configfs to select SYSFS to fix the following:

warning: (TARGET_CORE && GFS2_FS) selects CONFIGFS_FS which has unmet direct dependencies (SYSFS)

Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Nicholas A. Bellinger <nab@linux-iscsi.org>
Acked-by: Joel Becker <jlbec@evilplan.org>
2011-01-16 21:22:29 +00:00
Stefan Schmidt f8b18087fd fs/btrfs: Fix build of ctree
Fix the build failure in some configurations:

     CC [M]  fs/btrfs/ctree.o
  In file included from fs/btrfs/ctree.c:21:0:
  fs/btrfs/ctree.h:1003:17: error: field 'super_kobj' has incomplete type
  fs/btrfs/ctree.h:1074:17: error: field 'root_kobj' has incomplete type
  make[2]: *** [fs/btrfs/ctree.o] Error 1
  make[1]: *** [fs/btrfs] Error 2
  make: *** [fs] Error 2

caused by commit 57cc7215b7 ("headers: kobject.h redux")

We need to include kobject.h here.

Reported-by: Jeff Garzik <jeff@garzik.org>
Fix-suggested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-16 12:59:42 -08:00
Linus Torvalds 5520ebd308 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
  Squashfs: simplify CONFIG_SQUASHFS_LZO handling
  Squashfs: move squashfs_i() definition from squashfs.h
  Squashfs: get rid of default n in Kconfig
  Squashfs: add missing check in zlib_wrapper
  Squashfs: remove unnecessary variable in zlib_wrapper
  Squashfs: Add XZ compression configuration option
  Squashfs: add XZ compression support
2011-01-16 12:11:13 -08:00
Linus Torvalds f8206b925f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (23 commits)
  sanitize vfsmount refcounting changes
  fix old umount_tree() breakage
  autofs4: Merge the remaining dentry ops tables
  Unexport do_add_mount() and add in follow_automount(), not ->d_automount()
  Allow d_manage() to be used in RCU-walk mode
  Remove a further kludge from __do_follow_link()
  autofs4: Bump version
  autofs4: Add v4 pseudo direct mount support
  autofs4: Fix wait validation
  autofs4: Clean up autofs4_free_ino()
  autofs4: Clean up dentry operations
  autofs4: Clean up inode operations
  autofs4: Remove unused code
  autofs4: Add d_manage() dentry operation
  autofs4: Add d_automount() dentry operation
  Remove the automount through follow_link() kludge code from pathwalk
  CIFS: Use d_automount() rather than abusing follow_link()
  NFS: Use d_automount() rather than abusing follow_link()
  AFS: Use d_automount() rather than abusing follow_link()
  Add an AT_NO_AUTOMOUNT flag to suppress terminal automount
  ...
2011-01-16 11:31:50 -08:00
Al Viro f03c65993b sanitize vfsmount refcounting changes
Instead of splitting refcount between (per-cpu) mnt_count
and (SMP-only) mnt_longrefs, make all references contribute
to mnt_count again and keep track of how many are longterm
ones.

Accounting rules for longterm count:
	* 1 for each fs_struct.root.mnt
	* 1 for each fs_struct.pwd.mnt
	* 1 for having non-NULL ->mnt_ns
	* decrement to 0 happens only under vfsmount lock exclusive

That allows nice common case for mntput() - since we can't drop the
final reference until after mnt_longterm has reached 0 due to the rules
above, mntput() can grab vfsmount lock shared and check mnt_longterm.
If it turns out to be non-zero (which is the common case), we know
that this is not the final mntput() and can just blindly decrement
percpu mnt_count.  Otherwise we grab vfsmount lock exclusive and
do usual decrement-and-check of percpu mnt_count.

For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
namespace.c uses the latter in places where we don't already hold
vfsmount lock exclusive and opencodes a few remaining spots where
we need to manipulate mnt_longterm.

Note that we mostly revert the code outside of fs/namespace.c back
to what we used to have; in particular, normal code doesn't need
to care about two kinds of references, etc.  And we get to keep
the optimization Nick's variant had bought us...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-16 13:47:07 -05:00
Al Viro 7b8a53fd81 fix old umount_tree() breakage
Expiry-related code calls umount_tree() several times with
the same list to collect vfsmounts to.  Which is fine, except
that umount_tree() implicitly assumed that the list would
be empty on each call - it moves the victims over there and
then iterates through the list kicking them out.  It's *almost*
idempotent, so everything nearly worked.  However, mnt->ghosts
handling (and thus expirability checks) had been broken - that
part was not idempotent...

The fix is trivial - use local temporary list, splice it to
the the collector list when we are through.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-16 13:47:01 -05:00
Ben Hutchings 6f88a4403d btrfs: Require CAP_SYS_ADMIN for filesystem rebalance
Filesystem rebalancing (BTRFS_IOC_BALANCE) affects the entire
filesystem and may run uninterruptibly for a long time.  This does not
seem to be something that an unprivileged user should be able to do.

Reported-by: Aron Xu <happyaron.xu@gmail.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:20 -05:00
Josef Bacik f690efb1aa Btrfs: don't warn if we get ENOSPC in btrfs_block_rsv_check
If we run low on space we could get a bunch of warnings out of
btrfs_block_rsv_check, but this is mostly just called via the transaction code
to see if we need to end the transaction, it expects to see failures, so let's
not WARN and freak everybody out for no reason.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:20 -05:00
Tsutomu Itoh 5e540f7715 btrfs: Fix memory leak in btrfs_read_fs_root_no_radix()
In btrfs_read_fs_root_no_radix(), 'root' is not freed if
btrfs_search_slot() returns error.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:20 -05:00
Tsutomu Itoh 91ca338d77 btrfs: check NULL or not
Should check if functions returns NULL or not.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:20 -05:00
Jesper Juhl ff175d57f0 btrfs: Don't pass NULL ptr to func that may deref it.
Hi,

In fs/btrfs/inode.c::fixup_tree_root_location() we have this code:

...
 		if (!path) {
 			err = -ENOMEM;
 			goto out;
 		}
...
 	out:
 		btrfs_free_path(path);
 		return err;

btrfs_free_path() passes its argument on to other functions and some of
them end up dereferencing the pointer.
In the code above that pointer is clearly NULL, so btrfs_free_path() will
eventually cause a NULL dereference.

There are many ways to cut this cake (fix the bug). The one I chose was to
make btrfs_free_path() deal gracefully with NULL pointers. If you
disagree, feel free to come up with an alternative patch.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:20 -05:00
Dave Young 20b450773d btrfs: mount failure return value fix
I happened to pass swap partition as root partition in cmdline,
then kernel panic and tell me about "Cannot open root device".
It is not correct, in fact it is a fs type mismatch instead of 'no device'.

Eventually I found btrfs mounting failed with -EIO, it should be -EINVAL.
The logic in init/do_mounts.c:
        for (p = fs_names; *p; p += strlen(p)+1) {
                int err = do_mount_root(name, p, flags, root_mount_data);
                switch (err) {
                        case 0:
                                goto out;
                        case -EACCES:
                                flags |= MS_RDONLY;
                                goto retry;
                        case -EINVAL:
                                continue;
                }
		print "Cannot open root device"
		panic
	}
SO fs type after btrfs will have no chance to mount

Here fix the return value as -EINVAL

Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Jesper Juhl 42838bb265 btrfs: Mem leak in btrfs_get_acl()
It seems to me that we leak the memory allocated to 'value' in
btrfs_get_acl() if the call to posix_acl_from_xattr() fails.
Here's a patch that attempts to correct that problem.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Miao Xie 6d07bcec96 btrfs: fix wrong free space information of btrfs
When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.

 # mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
 # btrfs-show
 Label: none  uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
 	Total devices 2 FS bytes used 28.00KB
 	devid    1 size 5.01GB used 2.03GB path /dev/sda9
 	devid    2 size 10.00GB used 2.01GB path /dev/sda10
 # btrfs device scan /dev/sda9 /dev/sda10
 # mount /dev/sda9 /mnt
 # dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
   (fill the filesystem)
 # sync
 # df -TH
 Filesystem	Type	Size	Used	Avail	Use%	Mounted on
 /dev/sda9	btrfs	17G	8.6G	5.4G	62%	/mnt
 # btrfs-show
 Label: none  uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
 	Total devices 2 FS bytes used 3.99GB
 	devid    1 size 5.01GB used 5.01GB path /dev/sda9
 	devid    2 size 10.00GB used 4.99GB path /dev/sda10

It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.

This patch fixes it by calcing the free space that can be used to allocate
chunks.

Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
   3.1. if it is not zero, and then check the number of the devices that has
        more free space than this device,
        if the number of the devices is beyond the min stripe number, the free
        space can be used, and add into total free space.
        if the number of the devices is below the min stripe number, we can not
        use the free space, the check ends.
   3.2. if the free space is zero, check the next devices, goto 3.1

This implementation is just likely fake chunk allocation.

After appling this patch, df can show correct space information:
 # df -TH
 Filesystem	Type	Size	Used	Avail	Use%	Mounted on
 /dev/sda9	btrfs	17G	8.6G	0	100%	/mnt

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Miao Xie b2117a39fa btrfs: make the chunk allocator utilize the devices better
With this patch, we change the handling method when we can not get enough free
extents with default size.

Implementation:
1. Look up the suitable free extent on each device and keep the search result.
   If not find a suitable free extent, keep the max free extent
2. If we get enough suitable free extents with default size, chunk allocation
   succeeds.
3. If we can not get enough free extents, but the number of the extent with
   default size is >= min_stripes, we just change the mapping information
   (reduce the number of stripes in the extent map), and chunk allocation
   succeeds.
4. If the number of the extent with default size is < min_stripes, sort the
   devices by its max free extent's size descending
5. Use the size of the max free extent on the (num_stripes - 1)th device as the
   stripe size to allocate the device space

By this way, the chunk allocator can allocate chunks as large as possible when
the devices' space is not enough and make full use of the devices.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Miao Xie 7bfc837df9 btrfs: restructure find_free_dev_extent()
- make it return the start position and length of the max free space when it can
  not find a suitable free space.
- make it more readability

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Miao Xie 1974a3b42d btrfs: fix wrong calculation of stripe size
There are two tiny problem:
- One is When we check the chunk size is greater than the max chunk size or not,
  we should take mirrors into account, but the original code didn't.
- The other is btrfs shouldn't use the size of the residual free space as the
  length of of a dup chunk when doing chunk allocation. It is because the device
  space that a dup chunk needs is twice as large as the chunk size, if we use
  the size of the residual free space as the length of a dup chunk, we can not
  get enough free space. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Miao Xie d52a5b5f1f btrfs: try to reclaim some space when chunk allocation fails
We cannot write data into files when when there is tiny space in the filesystem.

Reproduce steps:
 # mkfs.btrfs /dev/sda1
 # mount /dev/sda1 /mnt
 # dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1
 # dd if=/dev/zero of=/mnt/tmpfile1 bs=4K count=99999999999999
   (fill the filesystem)
 # umount /mnt
 # mount /dev/sda1 /mnt
 # rm -f /mnt/tmpfile0
 # dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1
   (failed with nospec)

But if we do the last step again, we can write data successfully. The reason of
the problem is that btrfs didn't try to commit the current transaction and
reclaim some space when chunk allocation failed.

This patch fixes it by committing the current transaction to reclaim some
space when chunk allocation fails.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Miao Xie 299a08b1c3 btrfs: fix wrong data space statistics
Josef has implemented mixed data/metadata chunks, we must add those chunks'
space just like data chunks.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Stefan Schmidt f580eb0931 fs/btrfs: Fix build of ctree
CC [M]  fs/btrfs/ctree.o
In file included from fs/btrfs/ctree.c:21:0:
fs/btrfs/ctree.h:1003:17: error: field <91>super_kobj<92> has incomplete type
fs/btrfs/ctree.h:1074:17: error: field <91>root_kobj<92> has incomplete type
make[2]: *** [fs/btrfs/ctree.o] Error 1
make[1]: *** [fs/btrfs] Error 2
make: *** [fs] Error 2

We need to include kobject.h here.

Reported-by: Jeff Garzik <jeff@garzik.org>
Fix-suggested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-16 11:30:19 -05:00
Chris Mason f892436eb2 Merge branch 'lzo-support' of git://repo.or.cz/linux-btrfs-devel into btrfs-38 2011-01-16 11:25:54 -05:00
Chris Mason 26c79f6ba0 Merge branch 'readonly-snapshots' of git://repo.or.cz/linux-btrfs-devel into btrfs-38 2011-01-16 11:24:45 -05:00
David Howells b650c858c2 autofs4: Merge the remaining dentry ops tables
Merge the remaining autofs4 dentry ops tables.  It doesn't matter if
d_automount and d_manage are present on something that's not mountable or
holdable as these ops are only used if the appropriate flags are set in
dentry->d_flags.

[AV] switch to ->s_d_op, since now _everything_ on autofs4 is using the
same dentry_operations.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:49 -05:00
David Howells ea5b778a8b Unexport do_add_mount() and add in follow_automount(), not ->d_automount()
Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
added rather than calling do_add_mount() itself.  follow_automount() will then
do the addition.

This slightly complicates things as ->d_automount() normally wants to add the
new vfsmount to an expiration list and start an expiration timer.  The problem
with that is that the vfsmount will be deleted if it has a refcount of 1 and
the timer will not repeat if the expiration list is empty.

To this end, we require the vfsmount to be returned from d_automount() with a
refcount of (at least) 2.  One of these refs will be dropped unconditionally.
In addition, follow_automount() must get a 3rd ref around the call to
do_add_mount() lest it eat a ref and return an error, leaving the mount we
have open to being expired as we would otherwise have only 1 ref on it.

d_automount() should also add the the vfsmount to the expiration list (by
calling mnt_set_expiry()) and start the expiration timer before returning, if
this mechanism is to be used.  The vfsmount will be unlinked from the
expiration list by follow_automount() if do_add_mount() fails.

This patch also fixes the call to do_add_mount() for AFS to propagate the mount
flags from the parent vfsmount.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:48 -05:00
David Howells ab90911ff9 Allow d_manage() to be used in RCU-walk mode
Allow d_manage() to be called from pathwalk when it is in RCU-walk mode as well
as when it is in Ref-walk mode.  This permits __follow_mount_rcu() to call
d_manage() directly.  d_manage() needs a parameter to indicate that it is in
RCU-walk mode as it isn't allowed to sleep if in that mode (but should return
-ECHILD instead).

autofs4_d_manage() can then be set to retain RCU-walk mode if the daemon
accesses it and otherwise request dropping back to ref-walk mode.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:47 -05:00
David Howells 87556ef199 Remove a further kludge from __do_follow_link()
Remove a further kludge from __do_follow_link() as it's no longer required with
the automount code.

This reverts the non-helper-function parts of
051d381259, which breaks union mounts.

Reported-by: vaurora@redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:46 -05:00
Ian Kent dd89f90d2d autofs4: Add v4 pseudo direct mount support
Version 4 of autofs provides a pseudo direct mount implementation
that relies on directories at the leaves of a directory tree under
an indirect mount to trigger mounts.

This patch adds support for that functionality.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:44 -05:00
Ian Kent 9e3fea16ba autofs4: Fix wait validation
It is possible for the check in wait.c:validate_request() to return
an incorrect result if the dentry that was mounted upon has changed
during the callback.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:43 -05:00
Ian Kent 6651149371 autofs4: Clean up autofs4_free_ino()
When this function is called the local reference count does't need to
be updated since the dentry is going away and dput definitely must
not be called here.

Also the autofs info struct field inode isn't used so remove it.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:42 -05:00
Ian Kent 71e469db24 autofs4: Clean up dentry operations
There are now two distinct dentry operations uses. One for dentrys
that trigger mounts and one for dentrys that do not.

Rationalize the use of these dentry operations and rename them to
reflect their function.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:41 -05:00
Ian Kent e61da20a50 autofs4: Clean up inode operations
Since the use of ->follow_link() has been eliminated there is no
need to separate the indirect and direct inode operations.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:40 -05:00
Ian Kent 8c13a676d5 autofs4: Remove unused code
Remove code that is not used due to the use of ->d_automount()
and ->d_manage().

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:39 -05:00
Ian Kent b5b801779d autofs4: Add d_manage() dentry operation
This patch required a previous patch to add the ->d_automount()
dentry operation.

Add a function to use the newly defined ->d_manage() dentry operation
for blocking during mount and expire.

Whether the VFS calls the dentry operations d_automount() and d_manage()
is controled by the DMANAGED_AUTOMOUNT and DMANAGED_TRANSIT flags. autofs
uses the d_automount() operation to callback to user space to request
mount operations and the d_manage() operation to block walks into mounts
that are under construction or destruction.

In order to prevent these functions from being called unnecessarily the
DMANAGED_* flags are cleared for cases which would cause this. In the
common case the DMANAGED_AUTOMOUNT and DMANAGED_TRANSIT flags are both
set for dentrys waiting to be mounted. The DMANAGED_TRANSIT flag is
cleared upon successful mount request completion and set during expire
runs, both during the dentry expire check, and if selected for expire,
is left set until a subsequent successful mount request completes.

The exception to this is the so-called rootless multi-mount which has
no actual mount at its base. In this case the DMANAGED_AUTOMOUNT flag
is cleared upon successful mount request completion as well and set
again after a successful expire.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:38 -05:00
Ian Kent 10584211e4 autofs4: Add d_automount() dentry operation
Add a function to use the newly defined ->d_automount() dentry operation
for triggering mounts instead of doing the user space callback in ->lookup()
and ->d_revalidate().

Note, to be useful the subsequent patch to add the ->d_manage() dentry
operation is also needed so the discussion of functionality is deferred to
that patch.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:37 -05:00
David Howells db3729153e Remove the automount through follow_link() kludge code from pathwalk
Remove the automount through follow_link() kludge code from pathwalk in favour
of using d_automount().

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:36 -05:00
David Howells 01c64feac4 CIFS: Use d_automount() rather than abusing follow_link()
Make CIFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.

[NOTE: THIS IS UNTESTED!]

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:35 -05:00
David Howells 36d43a4376 NFS: Use d_automount() rather than abusing follow_link()
Make NFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:34 -05:00
David Howells d18610b0ce AFS: Use d_automount() rather than abusing follow_link()
Make AFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:33 -05:00
David Howells 6f45b65672 Add an AT_NO_AUTOMOUNT flag to suppress terminal automount
Add an AT_NO_AUTOMOUNT flag to suppress terminal automounting of automount
point directories.  This can be used by fstatat() users to permit the
gathering of attributes on an automount point and also prevent
mass-automounting of a directory of automount points by ls.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:33 -05:00
David Howells cc53ce53c8 Add a dentry op to allow processes to be held during pathwalk transit
Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
sleep when it tries to transit away from one of that filesystem's directories
during a pathwalk.  The operation is keyed off a new dentry flag
(DCACHE_MANAGE_TRANSIT).

The filesystem is allowed to be selective about which processes it holds and
which it permits to continue on or prohibits from transiting from each flagged
directory.  This will allow autofs to hold up client processes whilst letting
its userspace daemon through to maintain the directory or the stuff behind it
or mounted upon it.

The ->d_manage() dentry operation:

	int (*d_manage)(struct path *path, bool mounting_here);

takes a pointer to the directory about to be transited away from and a flag
indicating whether the transit is undertaken by do_add_mount() or
do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.

It should return 0 if successful and to let the process continue on its way;
-EISDIR to prohibit the caller from skipping to overmounted filesystems or
automounting, and to use this directory; or some other error code to return to
the user.

->d_manage() is called with namespace_sem writelocked if mounting_here is true
and no other locks held, so it may sleep.  However, if mounting_here is true,
it may not initiate or wait for a mount or unmount upon the parameter
directory, even if the act is actually performed by userspace.

Within fs/namei.c, follow_managed() is extended to check with d_manage() first
on each managed directory, before transiting away from it or attempting to
automount upon it.

follow_down() is renamed follow_down_one() and should only be used where the
filesystem deliberately intends to avoid management steps (e.g. autofs).

A new follow_down() is added that incorporates the loop done by all other
callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
and CIFS do use it, their use is removed by converting them to use
d_automount()).  The new follow_down() calls d_manage() as appropriate.  It
also takes an extra parameter to indicate if it is being called from mount code
(with namespace_sem writelocked) which it passes to d_manage().  follow_down()
ignores automount points so that it can be used to mount on them.

__follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
sleep.  It would be possible to enter d_manage() in rcu-walk mode too, and have
that determine whether to abort or not itself.  That would allow the autofs
daemon to continue on in rcu-walk mode.

Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
required as every tranist from that directory will cause d_manage() to be
invoked.  It can always be set again when necessary.

==========================
WHAT THIS MEANS FOR AUTOFS
==========================

Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
trigger the automounting of indirect mounts, and both of these can be called
with i_mutex held.

autofs knows that the i_mutex will be held by the caller in lookup(), and so
can drop it before invoking the daemon - but this isn't so for d_revalidate(),
since the lock is only held on _some_ of the code paths that call it.  This
means that autofs can't risk dropping i_mutex from its d_revalidate() function
before it calls the daemon.

The bug could manifest itself as, for example, a process that's trying to
validate an automount dentry that gets made to wait because that dentry is
expired and needs cleaning up:

	mkdir         S ffffffff8014e05a     0 32580  24956
	Call Trace:
	 [<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897
	 [<ffffffff80127f7d>] avc_has_perm+0x46/0x58
	 [<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e
	 [<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b
	 [<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149
	 [<ffffffff80036d96>] __lookup_hash+0xa0/0x12f
	 [<ffffffff80057a2f>] lookup_create+0x46/0x80
	 [<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4

versus the automount daemon which wants to remove that dentry, but can't
because the normal process is holding the i_mutex lock:

	automount     D ffffffff8014e05a     0 32581      1              32561
	Call Trace:
	 [<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b
	 [<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1
	 [<ffffffff80063c89>] .text.lock.mutex+0xf/0x14
	 [<ffffffff800e6d55>] do_rmdir+0x77/0xde
	 [<ffffffff8005d229>] tracesys+0x71/0xe0
	 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

which means that the system is deadlocked.

This patch allows autofs to hold up normal processes whilst the daemon goes
ahead and does things to the dentry tree behind the automouter point without
risking a deadlock as almost no locks are held in d_manage() and none in
d_automount().

Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:07:31 -05:00
David Howells 9875cf8064 Add a dentry op to handle automounting rather than abusing follow_link()
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation.  The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).

This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.

The ->d_automount() dentry operation:

	struct vfsmount *(*d_automount)(struct path *mountpoint);

takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted.  If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar).  If there's a collision with
another automount attempt, NULL should be returned.  If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned.  In any other case, an error code should be
returned.

The ->d_automount() operation is called with no locks held and may sleep.  At
this point the pathwalk algorithm will be in ref-walk mode.

Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints.  It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful.  The path will be updated to point to the mounted
filesystem if a successful automount took place.

__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()).  This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).

__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.

follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".

I've also extracted the mount/don't-mount logic from autofs4 and included it
here.  It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname.  If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.

I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points.  This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate().  This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary.  It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.

[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]

Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:05:03 -05:00
Al Viro 1a8edf40e7 do_lookup() fix
do_lookup() has a path leading from LOOKUP_RCU case to non-RCU
crossing of mountpoints, which breaks things badly.  If we
hit need_revalidate: and do nothing in there, we need to come
back into LOOKUP_RCU half of things, not to done: in non-RCU
one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15 20:03:39 -05:00
Linus Torvalds 7cb3920a65 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: prevent NMI timeouts in cmn_err
  xfs: Add log level to assertion printk
  xfs: fix an assignment within an ASSERT()
  xfs: fix error handling for synchronous writes
  xfs: add FITRIM support
  xfs: ensure log covering transactions are synchronous
  xfs: serialise unaligned direct IOs
  xfs: factor common write setup code
  xfs: split buffered IO write path from xfs_file_aio_write
  xfs: split direct IO write path from xfs_file_aio_write
  xfs: introduce xfs_rw_lock() helpers for locking the inode
  xfs: factor post-write newsize updates
  xfs: factor common post-write isize handling code
  xfs: ensure sync write errors are returned
2011-01-14 15:24:17 -08:00
Linus Torvalds 6ab8219649 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: restore multiple bd_link_disk_holder() support
  block cfq: compensate preempted queue even if it has no slice assigned
  block cfq: make queue preempt work for queues from different workload
2011-01-14 13:32:07 -08:00
Linus Torvalds 6f7f7caab2 Turn d_set_d_op() BUG_ON() into WARN_ON_ONCE()
It's indicative of a real problem, and it actually triggers with
autofs4, but the BUG_ON() is excessive.  The autofs4 case is being fixed
(to only set d_op in the ->lookup method) but not merged yet.  In the
meantime this gets the code limping along.

Reported-by: Alex Elder <aelder@sgi.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14 13:26:18 -08:00
Linus Torvalds 18bce371ae Merge branch 'for-2.6.38' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.38' of git://linux-nfs.org/~bfields/linux: (62 commits)
  nfsd4: fix callback restarting
  nfsd: break lease on unlink, link, and rename
  nfsd4: break lease on nfsd setattr
  nfsd: don't support msnfs export option
  nfsd4: initialize cb_per_client
  nfsd4: allow restarting callbacks
  nfsd4: simplify nfsd4_cb_prepare
  nfsd4: give out delegations more quickly in 4.1 case
  nfsd4: add helper function to run callbacks
  nfsd4: make sure sequence flags are set after destroy_session
  nfsd4: re-probe callback on connection loss
  nfsd4: set sequence flag when backchannel is down
  nfsd4: keep finer-grained callback status
  rpc: allow xprt_class->setup to return a preexisting xprt
  rpc: keep backchannel xprt as long as server connection
  rpc: move sk_bc_xprt to svc_xprt
  nfsd4: allow backchannel recovery
  nfsd4: support BIND_CONN_TO_SESSION
  nfsd4: modify session list under cl_lock
  Documentation: fl_mylease no longer exists
  ...

Fix up conflicts in fs/nfsd/vfs.c with the vfs-scale work.  The
vfs-scale work touched some msnfs cases, and this merge removes support
for that entirely, so the conflict was trivial to resolve.
2011-01-14 13:17:26 -08:00
J. Bruce Fields a8f2800b4f nfsd4: fix callback restarting
Ensure a new callback is added to the client's list of callbacks at most
once.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-14 14:51:31 -05:00
Jeff Layton bd76331955 cifs: add cruid= mount option
In commit 3e4b3e1f we separated the "uid" mount option such that it
no longer determined the owner of the credential cache by default. When
we did this, we added a new option to cifs.upcall (--legacy-uid) to
try to make it so that it would behave the same was as it did before.

This ignored a rather important point -- the kernel has no way to know
what options are being passed to cifs.upcall, so it doesn't know what
uid it should use to determine whether to match an existing krb5 session.

The simplest solution is to simply add a new "cruid=" mount option that
only governs the uid owner of the credential cache for the mount.

Unfortunately, this means that the --legacy-uid option in cifs.upcall was
ill-considered and is now useless, but I don't see a better way to deal
with this.

A patch for the mount.cifs manpage will follow once this patch has been
accepted.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-14 18:51:11 +00:00
Jeff Layton 56c24305d1 cifs: cFYI the entire error code in map_smb_to_linux_error
We currently only print the DOS error part.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-14 18:51:11 +00:00
Tejun Heo 49731baa41 block: restore multiple bd_link_disk_holder() support
Commit e09b457b (block: simplify holder symlink handling) incorrectly
assumed that there is only one link at maximum.  dm may use multiple
links and expects block layer to track reference count for each link,
which is different from and unrelated to the exclusive device holder
identified by @holder when the device is opened.

Remove the single holder assumption and automatic removal of the link
and revive the per-link reference count tracking.  The code
essentially behaves the same as before commit e09b457b sans the
unnecessary kobject reference count dancing.

While at it, note that this facility should not be used by anyone else
than the current ones.  Sysfs symlinks shouldn't be abused like this
and the whole thing doesn't belong in the block layer at all.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Milan Broz <mbroz@redhat.com>
Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-01-14 18:44:22 +01:00
Tejun Heo 0ad53eeefc afs: add afs_wq and use it instead of the system workqueue
flush_scheduled_work() is going away.  afs needs to make sure all the
works it has queued have finished before being unloaded and there can
be arbitrary number of pending works.  Add afs_wq and use it as the
flush domain instead of the system workqueue.

Also, convert cancel_delayed_work() + flush_scheduled_work() to
cancel_delayed_work_sync() in afs_mntpt_kill_timer().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: linux-afs@lists.infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14 09:25:11 -08:00
Akshat Aranya ba28b93a52 FS-Cache: Fix operation handling
fscache_submit_exclusive_op() adds an operation to the pending list if
other operations are pending.  Fix the check for pending ops as n_ops
must be greater than 0 at the point it is checked as it is incremented
immediately before under lock.

Signed-off-by: Akshat Aranya <aranya@nec-labs.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14 09:23:36 -08:00
Linus Torvalds acda4721ae Merge branch 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin
* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
  kernel: fix hlist_bl again
  cgroups: Fix a lockdep warning at cgroup removal
  fs: namei fix ->put_link on wrong inode in do_filp_open
2011-01-14 09:08:29 -08:00
Nick Piggin 7b9337aaf9 fs: namei fix ->put_link on wrong inode in do_filp_open
J. R. Okajima noticed that ->put_link is being attempted on the
wrong inode, and suggested the way to fix it. I changed it a bit
according to Al's suggestion to keep an explicit link path around.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-14 08:42:43 +00:00
Linus Torvalds db9effe99a Merge branch 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin
* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
  fs: fix do_last error case when need_reval_dot
  nfs: add missing rcu-walk check
  fs: hlist UP debug fixup
  fs: fix dropping of rcu-walk from force_reval_path
  fs: force_reval_path drop rcu-walk before d_invalidate
  fs: small rcu-walk documentation fixes

Fixed up trivial conflicts in Documentation/filesystems/porting
2011-01-13 20:14:13 -08:00
J. R. Okajima f20877d94a fs: fix do_last error case when need_reval_dot
When open(2) without O_DIRECTORY opens an existing dir, it should return
EISDIR. In do_last(), the variable 'error' is initialized EISDIR, but it
is changed by d_revalidate() which returns any positive to represent
'the target dir is valid.'

Should we keep and return the initialized 'error' in this case.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-14 03:56:04 +00:00
Nick Piggin 657e94b673 nfs: add missing rcu-walk check
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-14 02:48:39 +00:00
Nick Piggin 90dbb77ba4 fs: fix dropping of rcu-walk from force_reval_path
As J. R. Okajima noted, force_reval_path passes in the same dentry to
d_revalidate as the one in the nameidata structure (other callers pass in a
child), so the locking breaks. This can oops with a chrooted nfs mount, for
example. Similarly there can be other problems with revalidating a dentry
which is already in nameidata of the path walk.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-14 02:36:19 +00:00
Nick Piggin bb20c18db6 fs: force_reval_path drop rcu-walk before d_invalidate
d_revalidate can return in rcu-walk mode even when it returns 0.  We can't just
call any old dcache function on rcu-walk dentry (the dentry is unstable, so
even through d_lock can safely be taken, the result may no longer be what we
expect -- careful re-checks would be required). So just drop rcu in this case.

(I missed this conversion when switching to the rcu-walk convention that Linus
suggested)

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-14 02:35:53 +00:00
J. Bruce Fields 4795bb37ef nfsd: break lease on unlink, link, and rename
Any change to any of the links pointing to an entry should also break
delegations.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-13 21:04:09 -05:00
J. Bruce Fields 6a76bebefe nfsd4: break lease on nfsd setattr
Leases (delegations) should really be broken on any metadata change, not
just on size change.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-13 21:04:08 -05:00
J. Bruce Fields 9ce137eee4 nfsd: don't support msnfs export option
We've long had these pointless #ifdef MSNFS's sprinkled throughout the
code--pointless because MSNFS is always defined (and we give no config
option to make that easy to change).  So we could just remove the
ifdef's and compile the resulting code unconditionally.

But as long as we're there: why not just rip out this code entirely?
The only purpose is to implement the "msnfs" export option which turns
on Windows-like behavior in some cases, and:

	- the export option isn't documented anywhere;
	- the userland utilities (which would need to be able to parse
	  "msnfs" in an export file) don't support it;
	- I don't know how to maintain this, as I don't know what the
	  proper behavior is; and
	- google shows no evidence that anyone has ever used this.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-13 21:04:07 -05:00
J. Bruce Fields 9ee1ba5402 nfsd4: initialize cb_per_client
Otherwise a callback that is aborted before it runs will result in a
list_del on an uninitialized list head.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-01-13 21:04:06 -05:00
Stefan Hajnoczi cb9ef8d5e3 fs/fs-writeback.c: fix sync_inodes_sb() return value kernel-doc
The sync_inodes_sb() function does not have a return value.  Remove the
outdated documentation comment.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:48 -08:00