Commit graph

22540 commits

Author SHA1 Message Date
Dave Chinner
50e86686df xfs: always push the AIL to the target
The recent conversion of the xfsaild functionality to a work queue
introduced a hard-to-hit log space grant hang. One of the problems
discovered is a target mismatch between the item pushing loop and
the target itself.

The push trigger checks for the target increasing (i.e. new target >
current) while the push loop only pushes items that have a LSN <
current. As a result, we can get the situation where the push target
is X, the items at the tail of the AIL have LSN X and they don't get
pushed. The push work then completes thinking it is done, and cannot
be restarted until the push target increases to >= X + 1. If the
push target then never increases (because the tail is not moving),
then we never run the push work again and we stall.

Fix it by making sure log items with a LSN that matches the target
exactly are pushed during the loop.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>

(cherry picked from commit cb64026b6e)
2011-05-09 18:35:03 -05:00
Dave Chinner
9e7004e741 xfs: exit AIL push work correctly when AIL is empty
The recent conversion of the xfsaild functionality to a work queue
introduced a hard-to-hit log space grant hang. The main cause is a
regression where a work exit path fails to clear the PUSHING state
and recheck the target correctly.

Make both exit paths do the same PUSHING bit clearing and target
checking when the "no more work to be done" condition is hit.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>

(cherry picked from commit ea35a20021)
2011-05-09 18:35:03 -05:00
Dave Chinner
228d62dd3f xfs: ensure reclaim cursor is reset correctly at end of AG
On a 32 bit highmem PowerPC machine, the XFS inode cache was growing
without bound and exhausting low memory causing the OOM killer to be
triggered. After some effort, the problem was reproduced on a 32 bit
x86 highmem machine.

The problem is that the per-ag inode reclaim index cursor was not
getting reset to the start of the AG if the radix tree tag lookup
found no more reclaimable inodes. Hence every further reclaim
attempt started at the same index beyond where any reclaimable
inodes lay, and no further background reclaim ever occurred from the
AG.

Without background inode reclaim the VM driven cache shrinker
simply cannot keep up with cache growth, and OOM is the result.

While the change that exposed the problem was the conversion of the
inode reclaim to use work queues for background reclaim, it was not
the cause of the bug. The bug was introduced when the cursor code
was added, just waiting for some weird configuration to strike....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-By: Christian Kujau <lists@nerdbynature.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>

(cherry picked from commit b223221956)
2011-05-09 18:35:03 -05:00
Mikulas Patocka
a09a79f668 Don't lock guardpage if the stack is growing up
Linux kernel excludes guard page when performing mlock on a VMA with
down-growing stack. However, some architectures have up-growing stack
and locking the guard page should be excluded in this case too.

This patch fixes lvm2 on PA-RISC (and possibly other architectures with
up-growing stack). lvm2 calculates number of used pages when locking and
when unlocking and reports an internal error if the numbers mismatch.

[ Patch changed fairly extensively to also fix /proc/<pid>/maps for the
  grows-up case, and to move things around a bit to clean it all up and
  share the infrstructure with the /proc bits.

  Tested on ia64 that has both grow-up and grow-down segments  - Linus ]

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Tested-by: Tony Luck <tony.luck@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 16:22:07 -07:00
Linus Torvalds
7f4238a0ef Merge branch 'hpfs'
* hpfs:
  HPFS: Remove unused variable
  HPFS: Move declaration up, so that there are no out-of-scope pointers
  HPFS: Fix some unaligned accesses
  HPFS: Fix endianity. Make hpfs work on big-endian machines
  HPFS: Implement fsync for hpfs
  HPFS: Fix a bug that filesystem was not marked dirty when remounting it
  HPFS: Restrict uid and gid to 16-bit values
  HPFS: When marking or clearing the dirty bit, sync the filesystem
  HPFS: Use types with defined width
  HPFS: Remove mark_inode_dirty
  HPFS: Remove CR/LF conversion option
  HPFS: Remove remaining locks
  HPFS: Introduce a global mutex and lock it on every callback from VFS.
  HPFS: Make HPFS compile on preempt and SMP
2011-05-09 09:07:55 -07:00
Mikulas Patocka
88f4e9e870 HPFS: Remove unused variable
Remove unused variable

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
c351481744 HPFS: Move declaration up, so that there are no out-of-scope pointers
Move declaration up, so that there are no out-of-scope pointers

Reported-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
d0969d1949 HPFS: Fix some unaligned accesses
Fix some unaligned accesses

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
0b69760be6 HPFS: Fix endianity. Make hpfs work on big-endian machines
Fix endianity. Make hpfs work on big-endian machines.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
bc8728ee56 HPFS: Implement fsync for hpfs
Implement fsync for hpfs.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
dab4c82a6e HPFS: Fix a bug that filesystem was not marked dirty when remounting it
Fix a bug that filesystem was not marked dirty when remounting it

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
48f10e8ce7 HPFS: Restrict uid and gid to 16-bit values
Restrict uid and gid to 16-bit values.

HPFS stores only 2 bytes in the EAs.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
f73976818a HPFS: When marking or clearing the dirty bit, sync the filesystem
When marking or clearing the dirty bit, sync the filesystem

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:24 -07:00
Mikulas Patocka
d878597c2c HPFS: Use types with defined width
Use types with defined width

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:23 -07:00
Mikulas Patocka
e5d6a7dd5e HPFS: Remove mark_inode_dirty
Remove mark_inode_dirty

HPFS doesn't use kernel's dirty inode indicator anyway because
writing an inode requires directory's mutex.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:23 -07:00
Mikulas Patocka
0fe105aa29 HPFS: Remove CR/LF conversion option
Remove CR/LF conversion option

It is unused anyway. It was used on 2.2 kernels or so.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:23 -07:00
Mikulas Patocka
7d23ce36e3 HPFS: Remove remaining locks
Remove remaining locks

Because of a new global per-fs lock, no other locks are needed

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:23 -07:00
Mikulas Patocka
7dd29d8d86 HPFS: Introduce a global mutex and lock it on every callback from VFS.
Introduce a global mutex and lock it on every callback from VFS.

Performance doesn't matter, reviewing the whole code for locking correctness
would be too complicated, so simply lock it all.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:23 -07:00
Mikulas Patocka
637b424bf8 HPFS: Make HPFS compile on preempt and SMP
Make HPFS compile on preempt and SMP

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-09 09:04:23 -07:00
Linus Torvalds
c2bf807eb3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: handle errors from coalesce_t2
  cifs: refactor mid finding loop in cifs_demultiplex_thread
  cifs: sanitize length checking in coalesce_t2 (try #3)
  cifs: check for bytes_remaining going to zero in CIFS_SessSetup
  cifs: change bleft in decode_unicode_ssetup back to signed type
2011-05-06 15:32:41 -07:00
Timo Warns
fa039d5f6b Validate size of EFI GUID partition entries.
Otherwise corrupted EFI partition tables can cause total confusion.

Signed-off-by: Timo Warns <warns@pre-sense.de>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-06 07:46:37 -07:00
David Sterba
182608c829 btrfs: remove old unused commented out code
Remove code which has been #if0-ed out for a very long time and does not
seem to be related to current codebase anymore.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-06 12:34:10 +02:00
David Sterba
f2a97a9dbd btrfs: remove all unused functions
Remove static and global declarations and/or definitions. Reduces size
of btrfs.ko by ~3.4kB.

  text    data     bss     dec     hex filename
402081    7464     200  409745   64091 btrfs.ko.base
398620    7144     200  405964   631cc btrfs.ko.remove-all

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-06 12:34:03 +02:00
Linus Torvalds
bd355f8ae6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: do not call __mark_dirty_inode under i_lock
  libceph: fix ceph_osdc_alloc_request error checks
  ceph: handle ceph_osdc_new_request failure in ceph_writepages_start
  libceph: fix ceph_msg_new error path
  ceph: use ihold() when i_lock is held
2011-05-04 14:22:20 -07:00
Sage Weil
fca65b4ad7 ceph: do not call __mark_dirty_inode under i_lock
The __mark_dirty_inode helper now takes i_lock as of 250df6ed.  Fix the
one ceph callers that held i_lock (__ceph_mark_dirty_caps) to return the
flags value so that the callers can do it outside of i_lock.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-04 12:56:45 -07:00
David Sterba
621496f4fd btrfs: remove unused function prototypes
function prototypes without a body

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-04 14:01:26 +02:00
Linus Torvalds
cce2c56e76 logfs: initialize superblock entries earlier
In particular, s_freeing_list needs to be initialized early, since it is
used on some of the error paths when mounts fail.  The mapping inode,
for example, would be initialized and then free'd on an error path
before s_freeing_list was initialized, but the inode drop operation
needs the s_freeing_list to be set up.

Normally you'd never see this, because not only is logfs fairly rare,
but a successful mount will never have any issues.

Reported-by: werner <w.landgraf@ru.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-03 16:10:25 -07:00
Henry C Chang
8c71897be2 ceph: handle ceph_osdc_new_request failure in ceph_writepages_start
We should unlock the page and return -ENOMEM if ceph_osdc_new_request
failed.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-03 09:28:12 -07:00
Sage Weil
3772d26d87 ceph: use ihold() when i_lock is held
See 0444d76ae6.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-03 09:28:08 -07:00
Jeff Layton
16541ba11c cifs: handle errors from coalesce_t2
cifs_demultiplex_thread calls coalesce_t2 to try and merge follow-on t2
responses into the original mid buffer. coalesce_t2 however can return
errors, but the caller doesn't handle that situation properly. Fix the
thread to treat such a case as it would a malformed packet. Mark the
mid as being malformed and issue the callback.

Cc: stable@kernel.org
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-03 03:42:15 +00:00
Jeff Layton
146f9f65bd cifs: refactor mid finding loop in cifs_demultiplex_thread
...to reduce the extreme indentation. This should introduce no
behavioral changes.

Cc: stable@kernel.org
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-03 03:42:07 +00:00
Linus Torvalds
adadfe48df Merge branch 'for-linus' of git://git.infradead.org/ubifs-2.6
* 'for-linus' of git://git.infradead.org/ubifs-2.6:
  UBIFS: seek journal heads to the latest bud in replay
  UBIFS: do not free write-buffers when in R/O mode
2011-05-02 12:17:29 -07:00
Artem Bityutskiy
52c6e6f990 UBIFS: seek journal heads to the latest bud in replay
This is the second fix of the following symptom:

UBIFS error (pid 34456): could not find an empty LEB

which sometimes happens after power cuts when we mount the file-system - UBIFS
refuses it with the above error message which comes from the
'ubifs_rcvry_gc_commit()' function. I can reproduce this using the integck test
with the UBIFS power cut emulation enabled.

Analysis of the problem.

Currently UBIFS replay seeks the journal heads to the last _replayed_ bud.
But the buds are replayed out-of-order, so the replay basically seeks journal
heads to the "random" bud belonging to this head, and not to the _last_ one.

The result of this is that the GC head may be seeked to a full LEB with no free
space, or very little free space. And 'ubifs_rcvry_gc_commit()' tries to find a
fully or mostly dirty LEB to match the current GC head (because we need to
garbage-collect that dirty LEB at one go, because we do not have @c->gc_lnum).
So 'ubifs_find_dirty_leb()' fails and we fall back to finding an empty LEB and
also fail. As a result - recovery fails and mounting fails.

This patch teaches the replay to initialize the GC heads exactly to the latest
buds, i.e. the buds which have the largest sequence number in corresponding
log reference nodes.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: stable@kernel.org
2011-05-02 19:23:48 +03:00
Artem Bityutskiy
b50b9f4085 UBIFS: do not free write-buffers when in R/O mode
Currently UBIFS has a small optimization - it frees write-buffers when it is
re-mounted from R/W mode to R/O mode. Of course, when it is mounted R/O, it
does not allocate write-buffers as well.

This optimization is nice but it leads to subtle problems and complications
in recovery, which I can reproduce using the integck test. The symptoms are
that after a power cut the file-system cannot be mounted if we first mount
it R/O, and then re-mount R/W - 'ubifs_rcvry_gc_commit()' prints:

UBIFS error (pid 34456): could not find an empty LEB

Analysis of the  problem.

When mounting R/W, the reply process sets journal heads to buds [1], but
when mounting R/O - it does not do this, because the write-buffers are not
allocated. So 'ubifs_rcvry_gc_commit()' works completely differently for the
same file-system but for the following 2 cases:

1. mounting R/W after a power cut and recover
2. mounting R/O after a power cut, re-mounting R/W and run deferred recovery

In the former case, we have journal heads seeked to the a bud, in the latter
case, they are non-seeked (wbuf->lnum == -1). So in the latter case we do not
try to recover the GC LEB by garbage-collecting to the GC head, but we just
try to find an empty LEB, and there may be no empty LEBs, so we just fail.
On the other hand, in the former case (mount R/W), we are able to make a GC LEB
(@c->gc_lnum) by garbage-collecting.

Thus, let's remove this small nice optimization and always allocate
write-buffers. This should not make too big difference - we have only 3
of them, each of max. write unit size, which is usually 2KiB. So this is
about 6KiB of RAM for the typical case, and only when mounted R/O.

[1]: Note, currently the replay process is setting (seeking) the journal heads
to _some_ buds, not necessarily to the buds which had been the journal heads
before the power cut happened. This will be fixed separately.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: stable@kernel.org
2011-05-02 19:23:36 +03:00
David Sterba
8cc33e5c19 btrfs: Document a mutex lock/unlock sequence 2011-05-02 15:29:25 +02:00
David Sterba
b3b4aa74b5 btrfs: drop unused parameter from btrfs_release_path
parameter tree root it's not used since commit
5f39d397df ("Btrfs: Create extent_buffer
interface for large blocksizes")

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:22 +02:00
David Sterba
ba14419264 btrfs: drop gfp parameter from alloc_extent_buffer
pass GFP_NOFS directly to kmem_cache_alloc

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:22 +02:00
David Sterba
f09d1f60e6 btrfs: drop gfp parameter from find_extent_buffer
pass GFP_NOFS directly to kmem_cache_alloc

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:22 +02:00
David Sterba
172ddd60a6 btrfs: drop gfp parameter from alloc_extent_map
pass GFP_NOFS directly to kmem_cache_alloc

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:21 +02:00
David Sterba
a8067e022a btrfs: drop unused parameter from extent_map_tree_init
the GFP flags are not stored anywhere and all allocations are done via
alloc_extent_map(GFP_NOFS).

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:21 +02:00
David Sterba
f993c883ad btrfs: drop unused argument from extent_io_tree_init
all callers pass GFP_NOFS, but the GFP mask argument is not used in the
function; GFP_ATOMIC is passed to radix tree initialization and it's the
only correct one, since we're using the preload/insert mechanism of
radix tree.
Let's drop the gfp mask from btrfs function, this will not change
behaviour.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:21 +02:00
David Sterba
62a45b6092 btrfs: make functions static when possible
Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:20 +02:00
David Sterba
c704005d88 btrfs: unify checking of IS_ERR and null
use IS_ERR_OR_NULL when possible, done by this coccinelle script:

@ match @
identifier id;
@@
(
- BUG_ON(IS_ERR(id) || !id);
+ BUG_ON(IS_ERR_OR_NULL(id));
|
- IS_ERR(id) || !id
+ IS_ERR_OR_NULL(id)
|
- !id || IS_ERR(id)
+ IS_ERR_OR_NULL(id)
)

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:20 +02:00
David Sterba
4891aca2da btrfs: fix dereference before check
The superblock's ->s_fs_info is properly set in btrfs_fill_super, after
a call to open_ctree, which derefereces it before check. Although
tree_root is set via btrfs_set_super, let's be defensive and leave the
check in place.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:20 +02:00
David Sterba
edc95aec57 btrfs: remove nested duplicate variable declarations
Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:19 +02:00
David Sterba
306e16ce13 btrfs: rename variables clashing with global function names
reported by gcc -Wshadow:
page_index, page_offset, new_inode, dev_name

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:19 +02:00
Tejun Heo
02e352287a block: rescan partitions on invalidated devices on -ENOMEDIA too
__blkdev_get() doesn't rescan partitions if disk->fops->open() fails,
which leads to ghost partition devices lingering after medimum removal
is known to both the kernel and userland.  The behavior also creates a
subtle inconsistency where O_NONBLOCK open, which doesn't fail even if
there's no medium, clears the ghots partitions, which is exploited to
work around the problem from userland.

Fix it by updating __blkdev_get() to issue partition rescan after
-ENOMEDIA too.

This was reported in the following bz.

 https://bugzilla.kernel.org/show_bug.cgi?id=13029

Stable: 2.6.38

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: David Zeuthen <zeuthen@gmail.com>
Reported-by: Martin Pitt <martin.pitt@ubuntu.com>
Reported-by: Kay Sievers <kay.sievers@vrfy.org>
Tested-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-29 10:17:26 +02:00
Jeff Layton
2a2047bc94 cifs: sanitize length checking in coalesce_t2 (try #3)
There are a couple of places in this code where these values can wrap or
go negative, and that could potentially end up overflowing the buffer.
Ensure that that doesn't happen. Do all of the length calculation and
checks first, and only perform the memcpy after they pass.

Also, increase some stack variables to 32 bits to ensure that they don't
wrap without being detected.

Finally, change the error codes to be a bit more descriptive of any
problems detected. -EINVAL isn't very accurate.

Cc: stable@kernel.org
Reported-and-Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-29 05:02:08 +00:00
Jeff Layton
fcda7f4578 cifs: check for bytes_remaining going to zero in CIFS_SessSetup
It's possible that when we go to decode the string area in the
SESSION_SETUP response, that bytes_remaining will be 0. Decrementing it at
that point will mean that it can go "negative" and wrap. Check for a
bytes_remaining value of 0, and don't try to decode the string area if
that's the case.

Cc: stable@kernel.org
Reported-and-Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-29 04:57:39 +00:00
Jeff Layton
bfacf2225a cifs: change bleft in decode_unicode_ssetup back to signed type
The buffer length checks in this function depend on this value being a
signed data type, but 690c522fa converted it to an unsigned type.

Also, eliminate a problem with the null termination check in the same
function. cifs_strndup_from_ucs handles that situation correctly
already, and the existing check could potentially lead to a buffer
overrun since it increments bleft without checking to see whether it
falls off the end of the buffer.

Cc: stable@kernel.org
Reported-and-Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-29 04:57:35 +00:00
Linus Torvalds
9cab1ba421 Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  nfs: don't lose MS_SYNCHRONOUS on remount of noac mount
  NFS: Return meaningful status from decode_secinfo()
  NFSv4: Ensure we request the ordinary fileid when doing readdirplus
  NFSv4: Ensure that clientid and session establishment can time out
  SUNRPC: Allow RPC calls to return ETIMEDOUT instead of EIO
  NFSv4.1: Don't loop forever in nfs4_proc_create_session
  NFSv4: Handle NFS4ERR_WRONGSEC outside of nfs4_handle_exception()
  NFSv4.1: Don't update sequence number if rpc_task is not sent
  NFSv4.1: Ensure state manager thread dies on last umount
  SUNRPC: Fix the SUNRPC Kerberos V RPCSEC_GSS module dependencies
  NFS: Use correct variable for page bounds checking
  NFS: don't negotiate when user specifies sec flavor
  NFS: Attempt mount with default sec flavor first
  NFS: flav_array honors NFS_MAX_SECFLAVORS
  NFS: Fix infinite loop in gss_create_upcall()
  Don't mark_inode_dirty_sync() while holding lock
  NFS: Get rid of pointless test in nfs_commit_done
  NFS: Remove unused argument from nfs_find_best_sec()
  NFS: Eliminate duplicate call to nfs_mark_request_dirty
  NFS: Remove dead code from nfs_fs_mount()
2011-04-28 13:13:07 -07:00
Andrew Morton
6d4831c283 vfs: avoid large kmalloc()s for the fdtable
Azurit reports large increases in system time after 2.6.36 when running
Apache.  It was bisected down to a892e2d7dc ("vfs: use kmalloc()
to allocate fdmem if possible").

That patch caused the vfs to use kmalloc() for very large allocations and
this is causing excessive work (and presumably excessive reclaim) within
the page allocator.

Fix it by falling back to vmalloc() earlier - when the allocation attempt
would have been considered "costly" by reclaim.

Reported-by: azurIt <azurit@pobox.sk>
Tested-by: azurIt <azurit@pobox.sk>
Acked-by: Changli Gao <xiaosuo@gmail.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-28 11:28:20 -07:00
Jeff Layton
26c4c17073 nfs: don't lose MS_SYNCHRONOUS on remount of noac mount
On a remount, the VFS layer will clear the MS_SYNCHRONOUS bit on the
assumption that the flags on the mount syscall will have it set if the
remounted fs is supposed to keep it.

In the case of "noac" though, MS_SYNCHRONOUS is implied. A remount of
such a mount will lose the MS_SYNCHRONOUS flag since "sync" isn't part
of the mount options.

Reported-by: Max Matveev <makc@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-27 16:20:01 -04:00
Bryan Schumaker
613e901e1e NFS: Return meaningful status from decode_secinfo()
When compiling, I was getting this warning:
fs/nfs/nfs4xdr.c: In function ‘decode_secinfo’:
fs/nfs/nfs4xdr.c:4839:6: warning: variable ‘status’ set but not used
[-Wunused-but-set-variable]

We were unconditionally returning 0 as long as there wasn't an error
coming out of xdr_inline_decode().  We probably want to check the error
status coming out of decode_op_hdr() and decode_secinfo_gss(), rather
than assuming that everything is OK all the time.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-27 16:17:29 -04:00
Trond Myklebust
28331a46d8 NFSv4: Ensure we request the ordinary fileid when doing readdirplus
When readdir() returns a directory entry for the root of a mounted
filesystem, Linux follows the old convention of returning the inode
number of the covered directory (despite newer versions of POSIX declaring
that this is a bug).
To ensure this continues to work, the NFSv4 readdir implementation requests
the 'mounted-on-fileid' from the server.

However, readdirplus also needs to instantiate an inode for this entry, and
for that, we also need to request the real fileid as per this patch.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-27 15:57:16 -04:00
Lucas De Marchi
e9c549998d Revert wrong fixes for common misspellings
These changes were incorrectly fixed by codespell. They were now
manually corrected.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-04-26 23:31:11 -07:00
Linus Torvalds
019793b755 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: cleanup error handling in inode.c
  Btrfs: put the right bio if we have an error
  Btrfs: free bitmaps properly when evicting the cache
  Btrfs: Free free_space item properly in btrfs_trim_block_group()
  btrfs: add missing spin_unlock to a rare exit path
  Btrfs: check return value of kmalloc()
  btrfs: fix wrong allocating flag when reading page
  Btrfs: fix missing mutex_unlock in btrfs_del_dir_entries_in_log()
2011-04-26 08:26:58 -07:00
Linus Torvalds
cb49f57787 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: do some plugging in the submit_bio threads
2011-04-26 08:25:16 -07:00
Linus Torvalds
f727a938ce Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  CIFS: Fix memory over bound bug in cifs_parse_mount_options
2011-04-25 20:38:50 -07:00
Linus Torvalds
cd2e49e90f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
  eCryptfs: Flush dirty pages in setattr
  eCryptfs: Handle failed metadata read in lookup
  eCryptfs: Add reference counting to lower files
  eCryptfs: dput dentries returned from dget_parent
  eCryptfs: Remove extra d_delete in ecryptfs_rmdir
2011-04-25 19:01:12 -07:00
Christoph Hellwig
1879fd6a26 add hlist_bl_lock/unlock helpers
Now that the whole dcache_hash_bucket crap is gone, go all the way and
also remove the weird locking layering violations for locking the hash
buckets.  Add hlist_bl_lock/unlock helpers to move the locking into the
list abstraction instead of requiring each caller to open code it.
After all allowing for the bit locks is the whole point of these helpers
over the plain hlist variant.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-25 18:14:10 -07:00
Tyler Hicks
5be79de2e1 eCryptfs: Flush dirty pages in setattr
After 57db4e8d73 changed eCryptfs to
write-back caching, eCryptfs page writeback updates the lower inode
times due to the use of vfs_write() on the lower file.

To preserve inode metadata changes, such as 'cp -p' does with
utimensat(), we need to flush all dirty pages early in
ecryptfs_setattr() so that the user-updated lower inode metadata isn't
clobbered later in writeback.

https://bugzilla.kernel.org/show_bug.cgi?id=33372

Reported-by: Rocko <rockorequin@hotmail.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-04-25 18:49:46 -05:00
Tyler Hicks
3aeb86ea4c eCryptfs: Handle failed metadata read in lookup
When failing to read the lower file's crypto metadata during a lookup,
eCryptfs must continue on without throwing an error. For example, there
may be a plaintext file in the lower mount point that the user wants to
delete through the eCryptfs mount.

If an error is encountered while reading the metadata in lookup(), the
eCryptfs inode's size could be incorrect. We must be sure to reread the
plaintext inode size from the metadata when performing an open() or
setattr(). The metadata is already being read in those paths, so this
adds minimal performance overhead.

This patch introduces a flag which will track whether or not the
plaintext inode size has been read so that an incorrect i_size can be
fixed in the open() or setattr() paths.

https://bugs.launchpad.net/bugs/509180

Cc: <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-04-25 18:45:06 -05:00
Tsutomu Itoh
7cf96da3ec Btrfs: cleanup error handling in inode.c
The error processing of several places is changed like setting the
error number only at the error.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:53 -04:00
Josef Bacik
64728bbbf8 Btrfs: put the right bio if we have an error
In btrfs_submit_direct_hook if the first btrfs_map_block fails we need to put
the orig_bio, not bio.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:52 -04:00
Josef Bacik
a4f0162fd4 Btrfs: free bitmaps properly when evicting the cache
If our space cache is wrong, we do the right thing and free up everything that
we loaded, however we don't reset the total_bitmaps counter or the thresholds or
anything.  So in btrfs_remove_free_space_cache make sure to call free_bitmap()
if it's a bitmap, this will keep us from panicing when we check to make sure we
don't have too many bitmaps.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:52 -04:00
Li Zefan
f789b684bd Btrfs: Free free_space item properly in btrfs_trim_block_group()
Since commit dc89e98244, we've changed
to use a specific slab for alocation of free_space items.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:52 -04:00
David Sterba
cfece4db11 btrfs: add missing spin_unlock to a rare exit path
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:52 -04:00
Tsutomu Itoh
8d413713ca Btrfs: check return value of kmalloc()
The check on the return value of kmalloc() is added to some places.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:52 -04:00
Itaru Kitayama
43e817a1fd btrfs: fix wrong allocating flag when reading page
the space cache use extent_readpages() to read free space information,
so we can not use GFP_KERNEL flag to allocate memory, or it may lead
to deadlock.

Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:51 -04:00
Tsutomu Itoh
a62f44a5f4 Btrfs: fix missing mutex_unlock in btrfs_del_dir_entries_in_log()
It is necessary to unlock mutex_lock before it return an error when
btrfs_alloc_path() fails.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:51 -04:00
Tyler Hicks
332ab16f83 eCryptfs: Add reference counting to lower files
For any given lower inode, eCryptfs keeps only one lower file open and
multiplexes all eCryptfs file operations through that lower file. The
lower file was considered "persistent" and stayed open from the first
lookup through the lifetime of the inode.

This patch keeps the notion of a single, per-inode lower file, but adds
reference counting around the lower file so that it is closed when not
currently in use. If the reference count is at 0 when an operation (such
as open, create, etc.) needs to use the lower file, a new lower file is
opened. Since the file is no longer persistent, all references to the
term persistent file are changed to lower file.

Locking is added around the sections of code that opens the lower file
and assign the pointer in the inode info, as well as the code the fputs
the lower file when all eCryptfs users are done with it.

This patch is needed to fix issues, when mounted on top of the NFSv3
client, where the lower file is left silly renamed until the eCryptfs
inode is destroyed.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-04-25 18:32:37 -05:00
Tyler Hicks
dd55c89852 eCryptfs: dput dentries returned from dget_parent
Call dput on the dentries previously returned by dget_parent() in
ecryptfs_rename(). This is needed for supported eCryptfs mounts on top
of the NFSv3 client.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-04-25 18:32:36 -05:00
Tyler Hicks
35ffa948b2 eCryptfs: Remove extra d_delete in ecryptfs_rmdir
vfs_rmdir() already calls d_delete() on the lower dentry. That was being
duplicated in ecryptfs_rmdir() and caused a NULL pointer dereference
when NFSv3 was the lower filesystem.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2011-04-25 18:32:35 -05:00
Li Zefan
82d5902d9c Btrfs: Support reading/writing on disk free ino cache
This is similar to block group caching.

We dedicate a special inode in fs tree to save free ino cache.

At the very first time we create/delete a file after mount, the free ino
cache will be loaded from disk into memory. When the fs tree is commited,
the cache will be written back to disk.

To keep compatibility, we check the root generation against the generation
of the special inode when loading the cache, so the loading will fail
if the btrfs filesystem was mounted in an older kernel before.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:11 +08:00
Li Zefan
33345d0152 Btrfs: Always use 64bit inode number
There's a potential problem in 32bit system when we exhaust 32bit inode
numbers and start to allocate big inode numbers, because btrfs uses
inode->i_ino in many places.

So here we always use BTRFS_I(inode)->location.objectid, which is an
u64 variable.

There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
and inode->i_ino will be used in those cases.

Another reason to make this change is I'm going to use a special inode
to save free ino cache, and the inode number must be > (u64)-256.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:09 +08:00
Li Zefan
0414efae79 Btrfs: Make the code for reading/writing free space cache generic
Extract out block group specific code from lookup_free_space_inode(),
create_free_space_inode(), load_free_space_cache() and
btrfs_write_out_cache(), so the code can be used to read/write
free ino cache.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:07 +08:00
Li Zefan
581bb05094 Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.

This fixes it, and it works similarly to how we cache free space in block
cgroups.

We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.

Because we are searching the commit root, we have to carefully handle the
cross-transaction case.

The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:04 +08:00
Li Zefan
34d52cb6c5 Btrfs: Make free space cache code generic
So we can re-use the code to cache free inode numbers.

The change is quite straightforward. Two new structures are introduced.

- struct btrfs_free_space_ctl

  We move those variables that are used for caching free space from
  struct btrfs_block_group_cache to this new struct.

- struct btrfs_free_space_op

  We do block group specific work (e.g. calculation of extents threshold)
  through functions registered in this struct.

And then we can remove references to struct btrfs_block_group_cache.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:03 +08:00
Li Zefan
f38b6e754d Btrfs: Use bitmap_set/clear()
No functional change.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:01 +08:00
Li Zefan
92c4231181 Btrfs: Remove unused btrfs_block_group_free_space()
We've already recorded the value in block_group->frees_space.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:45:59 +08:00
Trond Myklebust
1bd714f2a1 NFSv4: Ensure that clientid and session establishment can time out
The following patch ensures that we do not get permanently trapped in
the RPC layer when trying to establish a new client id or session.
This again ensures that the state manager can finish in a timely
fashion when the last filesystem to reference the nfs_client exits.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-24 14:29:33 -04:00
Trond Myklebust
fd954ae124 NFSv4.1: Don't loop forever in nfs4_proc_create_session
If a server for some reason keeps sending NFS4ERR_DELAY errors, we can end
up looping forever inside nfs4_proc_create_session, and so the usual
mechanisms for detecting if the nfs_client is dead don't work.

Fix this by ensuring that we loop inside the nfs4_state_manager thread
instead.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-24 14:28:18 -04:00
Linus Torvalds
5dd12af05c Merge branch 'dcache-cleanup'
* dcache-cleanup:
  vfs: get rid of insane dentry hashing rules
2011-04-24 08:51:15 -07:00
Linus Torvalds
1f91f48b65 Merge branch 'for-linus' of git://git.infradead.org/ubifs-2.6
* 'for-linus' of git://git.infradead.org/ubifs-2.6:
  UBIFS: fix master node recovery
  UBIFS: fix false assertion warning in case of I/O failures
  UBIFS: fix false space checking failure
2011-04-24 08:42:15 -07:00
Linus Torvalds
dea3667bc3 vfs: get rid of insane dentry hashing rules
The dentry hashing rules have been really quite complicated for a long
while, in odd ways.  That made functions like __d_drop() very fragile
and non-obvious.

In particular, whether a dentry was hashed or not was indicated with an
explicit DCACHE_UNHASHED bit.  That's despite the fact that the hash
abstraction that the dentries use actually have a 'is this entry hashed
or not' model (which is a simple test of the 'pprev' pointer).

The reason that was done is because we used the normal 'is this entry
unhashed' model to mark whether the dentry had _ever_ been hashed in the
dentry hash tables, and that logic goes back many years (commit
b3423415fb: "dcache: avoid RCU for never-hashed dentries").

That, in turn, meant that __d_drop had totally different unhashing logic
for the dentry hash table case and for the anonymous dcache case,
because in order to use the "is this dentry hashed" logic as a flag for
whether it had ever been on the RCU hash table, we had to unhash such a
dentry differently so that we'd never think that it wasn't 'unhashed'
and wouldn't be free'd correctly.

That's just insane.  It made the logic really hard to follow, when there
were two different kinds of "unhashed" states, and one of them (the one
that used "list_bl_unhashed()") really had nothing at all to do with
being unhashed per se, but with a very subtle lifetime rule instead.

So turn all of it around, and make it logical.

Instead of having a DENTRY_UNHASHED bit in d_flags to indicate whether
the dentry is on the hash chains or not, use the hash chain unhashed
logic for that.  Suddenly "d_unhashed()" just uses "list_bl_unhashed()",
and everything makes sense.

And for the lifetime rule, just use an explicit DENTRY_RCUACCEES bit.
If we ever insert the dentry into the dentry hash table so that it is
visible to RCU lookup, we mark it DENTRY_RCUACCESS to show that it now
needs the RCU lifetime rules.  Now suddently that test at dentry free
time makes sense too.

And because unhashing now is sane and doesn't depend on where the dentry
got unhashed from (because the dentry hash chain details doesn't have
some subtle side effects), we can re-unify the __d_drop() logic and use
common code for the unhashing.

Also fix one more open-coded hash chain bit_spin_lock() that I missed in
the previous chain locking cleanup commit.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-24 07:58:46 -07:00
Linus Torvalds
b07ad9967f vfs: get rid of 'struct dcache_hash_bucket' abstraction
It's a useless abstraction for 'hlist_bl_head', and it doesn't actually
help anything - quite the reverse.  All the users end up having to know
about the hlist_bl_head details anyway, using 'struct hlist_bl_node *'
etc. So it just makes the code look confusing.

And the cost of it is extra '&b->head' syntactic noise, but more
importantly it spuriously makes the hash table dentry list look
different from the per-superblock DCACHE_DISCONNECTED dentry list.

As a result, the code ended up using ad-hoc locking for one case and
special helper functions for what is really another totally identical
case in the very same function.

Make it all look and work the same.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-23 22:32:03 -07:00
Pavel Shilovsky
4906e50b37 CIFS: Fix memory over bound bug in cifs_parse_mount_options
While password processing we can get out of options array bound if
the next character after array is delimiter. The patch adds a check
if we reach the end.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-21 17:22:43 +00:00
Linus Torvalds
37fc67c9f0 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: fix duplicate message output
2011-04-21 10:01:26 -07:00
Jan Kara
df7e130384 vfs: Pass setxattr(2) flags properly
For some reason generic_setxattr() did not pass flags (XATTR_CREATE,
XATTR_REPLACE) to the filesystem specific helper. This caused that
setxattr(2) syscall just ignored these flags.

Fix the bug by passing flags correctly.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-21 07:34:44 -07:00
Artem Bityutskiy
6e0d9fd38b UBIFS: fix master node recovery
This patch fixes the following symptoms:
1. Unmount UBIFS cleanly.
2. Start mounting UBIFS R/W and have a power cut immediately
3. Start mounting UBIFS R/O, this succeeds
4. Try to re-mount UBIFS R/W - this fails immediately or later on,
   because UBIFS will write the master node to the flash area
   which has been written before.

The analysis of the problem:

1. UBIFS is unmounted cleanly, both copies of the master node are clean.
2. UBIFS is being mounter R/W, starts changing master node copy 1, and
   a power cut happens. The copy N1 becomes corrupted.
3. UBIFS is being mounted R/O. It notices the copy N1 is corrupted and
   reads copy N2. Copy N2 is clean.
4. Because of R/O mode, UBIFS cannot recover copy 1.
5. The mount code (ubifs_mount()) sees that the master node is clean,
   so it decides that no recovery is needed.
6. We are re-mounting R/W. UBIFS believes no recovery is needed and
   starts updating the master node, but copy N1 is still corrupted
   and was not recovered!

Fix this problem by marking the master node as dirty every time we
recover it and we are in R/O mode. This forces further recovery and
the UBIFS cleans-up the corruptions and recovers the copy N1 when
re-mounting R/W later.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: stable@kernel.org
2011-04-21 15:27:21 +03:00
Artem Bityutskiy
1a067a22e4 UBIFS: fix false assertion warning in case of I/O failures
When UBIFS switches to R/O mode because it detects I/O failures, then
when we unmount, we still may have allocated budget, and the assertions
which verify that we have not budget will fire. But it is expected to
have the budget in case of I/O failures, so the assertion warnings will
be false. Suppress them for the I/O failure case.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-04-21 15:27:12 +03:00
Linus Torvalds
18995ba5ab Merge branch 'for-2.6.39' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.39' of git://linux-nfs.org/~bfields/linux:
  Open with O_CREAT flag set fails to open existing files on non writable directories
  nfsd4: Fix filp leak
  nfsd4: fix struct file leak on delegation
2011-04-20 17:41:06 -07:00
Dave Chinner
3eff126899 xfs: fix duplicate message output
Commit 957935dc ("xfs: fix xfs_debug warnings" broke the logic in
__xfs_printk(). Instead of only printing one of two possible output
strings based on whether the fs has a name or not, it outputs both.
Fix it to only output one message again.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-04-20 11:36:49 -05:00
Artem Bityutskiy
8c230d9a5b UBIFS: fix false space checking failure
This patch fixes UBIFS mount failure when the debugging support is enabled,
we are recovering from a power cut, we were first mounter R/O and we are
re-mounting R/W. In this case we should not assume that the amount of free
space before we have re-mounted R/W and after are equivalent, because
when we have mounted R/O the file-system is in a non-committed state so
the amount of free space is slightly smaller, due to the fact that we cannot
predict the amount of free space precisely before we commit.

This patch fixes the issue by skipping the debugging check in case of
recovery. This issue was reported by Caizhiyong <caizhiyong@huawei.com>
here: http://thread.gmane.org/gmane.linux.drivers.mtd/34350/focus=34387

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reported-by: Caizhiyong <caizhiyong@huawei.com>
Cc: stable@kernel.org [2.6.30+]
2011-04-20 18:16:37 +03:00
Sachin Prabhu
1574dff899 Open with O_CREAT flag set fails to open existing files on non writable directories
An open on a NFS4 share using the O_CREAT flag on an existing file for
which we have permissions to open but contained in a directory with no
write permissions will fail with EACCES.

A tcpdump shows that the client had set the open mode to UNCHECKED which
indicates that the file should be created if it doesn't exist and
encountering an existing flag is not an error. Since in this case the
file exists and can be opened by the user, the NFS server is wrong in
attempting to check create permissions on the parent directory.

The patch adds a conditional statement to check for create permissions
only if the file doesn't exist.

Signed-off-by: Sachin S. Prabhu <sprabhu@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-04-20 11:03:01 -04:00
Chris Mason
211588ad19 Btrfs: do some plugging in the submit_bio threads
The Btrfs submit bio threads have a small number of
threads responsible for pushing down bios we've collected
for a large number of devices.

Since we do all the bios for a single device at once,
we want to make sure we unplug and send down the bios
for each device as we're done processing them.

The new plugging API removed the btrfs code to
unplug while processing bios, this adds it back with
the new API.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-19 20:12:40 -04:00
OGAWA Hirofumi
a96e5b9080 nfsd4: Fix filp leak
23fcf2ec93 (nfsd4: fix oops on lock failure)

The above patch breaks free path for stp->st_file. If stp was inserted
into sop->so_stateids, we have to free stp->st_file refcount. Because
stp->st_file refcount itself is taken whether or not any refcounts are
taken on the stp->st_file->fi_fds[].

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-04-19 17:31:13 -04:00
Linus Torvalds
f28c6179e5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  GFS2: filesystem hang caused by incorrect lock order
  GFS2: Don't try to deallocate unlinked inodes when mounted ro
  GFS2: directly write blocks past i_size
  GFS2: write_end error path fails to unlock transaction lock
2011-04-19 10:52:51 -07:00
Bryan Schumaker
fb8a5ba811 NFSv4: Handle NFS4ERR_WRONGSEC outside of nfs4_handle_exception()
I only want to try other secflavors during an initial mount if
NFS4ERR_WRONGSEC is returned.  nfs4_handle_exception() could
potentially map other errors to EPERM, so we should handle this
error specially for correctness.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-18 17:06:00 -04:00
Bryan Schumaker
468f86134e NFSv4.1: Don't update sequence number if rpc_task is not sent
If we fail to contact the gss upcall program, then no message will
be sent to the server.  The client still updated the sequence number,
however, and this lead to NFS4ERR_SEQ_MISMATCH for the next several
RPC calls.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-18 17:05:48 -04:00
Linus Torvalds
adff377bb1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
  Btrfs: fix free space cache leak
  Btrfs: avoid taking the chunk_mutex in do_chunk_alloc
  Btrfs end_bio_extent_readpage should look for locked bits
  Btrfs: don't force chunk allocation in find_free_extent
  Btrfs: Check validity before setting an acl
  Btrfs: Fix incorrect inode nlink in btrfs_link()
  Btrfs: Check if btrfs_next_leaf() returns error in btrfs_real_readdir()
  Btrfs: Check if btrfs_next_leaf() returns error in btrfs_listxattr()
  Btrfs: make uncache_state unconditional
  btrfs: using cached extent_state in set/unlock combinations
  Btrfs: avoid taking the trans_mutex in btrfs_end_transaction
  Btrfs: fix subvolume mount by name problem when default mount subvolume is set
  fix user annotation in ioctl.c
  Btrfs: check for duplicate iov_base's when doing dio reads
  btrfs: properly handle overlapping areas in memmove_extent_buffer
  Btrfs: fix memory leaks in btrfs_new_inode()
  Btrfs: check for duplicate iov_base's when doing dio reads
  Btrfs: reuse the extent_map we found when calling btrfs_get_extent
  Btrfs: do not use async submit for small DIO io's
  Btrfs: don't split dio bios if we don't have to
  ...
2011-04-18 12:24:05 -07:00
Linus Torvalds
d8bdc59f21 proc: do proper range check on readdir offset
Rather than pass in some random truncated offset to the pid-related
functions, check that the offset is in range up-front.

This is just cleanup, the previous commit fixed the real problem.

Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-18 10:36:54 -07:00
J. Bruce Fields
4ee63624fd nfsd4: fix struct file leak on delegation
Introduced by acfdf5c383.

Cc: stable@kernel.org
Reported-by: Gerhard Heift <ml-nfs-linux-20110412-ef47@gheift.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-04-18 13:30:56 -04:00
Bob Peterson
44ad37d69b GFS2: filesystem hang caused by incorrect lock order
This patch fixes a deadlock in GFS2 where two processes are trying
to reclaim an unlinked dinode:
One holds the inode glock and calls gfs2_lookup_by_inum trying to look
up the inode, which it can't, due to I_FREEING.  The other has set
I_FREEING from vfs and is at the beginning of gfs2_delete_inode
waiting for the glock, which is held by the first.  The solution is to
add a new non_block parameter to the gfs2_iget function that causes it
to return -ENOENT if the inode is being freed.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-04-18 15:23:50 +01:00
Steven Whitehouse
001e8e8df4 GFS2: Don't try to deallocate unlinked inodes when mounted ro
This adds a couple of missing tests to avoid read-only nodes
from attempting to deallocate unlinked inodes.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Michel Andre de la Porte <madelaporte@ubi.com>
2011-04-18 15:23:12 +01:00
Benjamin Marzinski
0ee532062f GFS2: directly write blocks past i_size
GFS2 was relying on the writepage code to write out the zeroed data for
fallocate.  However, with FALLOC_FL_KEEP_SIZE set, this may be past i_size.
If it is, it will be ignored.  To work around this, gfs2 now calls
write_dirty_buffer directly on the buffer_heads when FALLOC_FL_KEEP_SIZE
is set, and it's writing past i_size.

This version is just a cleanup of my last version

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-04-18 15:22:52 +01:00
Bob Peterson
deab72d379 GFS2: write_end error path fails to unlock transaction lock
I did an audit of gfs2's transaction glock for bugzilla bug
658619 and ran across this:

In function gfs2_write_end, in the unlikely event that
gfs2_meta_inode_buffer returns an error, the code may forget
to unlock the transaction lock because the "failed" label
appears after the call to function gfs2_trans_end.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-04-18 15:22:35 +01:00
Chris Mason
f65647c29b Btrfs: fix free space cache leak
The free space caching code was recently reworked to
cache all the pages it needed instead of using find_get_page everywhere.

One loop was missed though, so it ended up leaking pages.  This fixes
it to use our page array instead of find_get_page.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-18 08:55:34 -04:00
Milton Miller
fff3e5ade4 fs: synchronize_rcu when unregister_filesystem success not failure
While checking unregister_filesystem for saftey vs extra calls for
"ext4: register ext2 and ext3 alias after ext4" I realized that
the synchronize_rcu() was called on the error path but not on
the success path.

Cc: stable (2.6.38)
Signed-off-by: Milton Miller <miltonm@bga.com>
[ This probably won't really make a difference since commit d863b50ab0
  ("vfs: call rcu_barrier after ->kill_sb()"), but it's the right thing
  to do.  - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-17 10:42:01 -07:00
Josef Bacik
6d74119f1a Btrfs: avoid taking the chunk_mutex in do_chunk_alloc
Everytime we try to allocate disk space we try and see if we can pre-emptively
allocate a chunk, but in the common case we don't allocate anything, so there is
no sense in taking the chunk_mutex at all.  So instead if we are allocating a
chunk, mark it in the space_info so we don't get two people trying to allocate
at the same time.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Reviewed-by: Liu Bo <liubo2009@cn.fujitsu.com>
2011-04-16 07:10:56 -04:00
Chris Mason
0d399205ed Btrfs end_bio_extent_readpage should look for locked bits
A recent commit caches the extent state in end_bio_extent_readpage,
but the search it does should look for locked extents.  This
fixes things to make it more effective.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-16 06:55:39 -04:00
Linus Torvalds
0ebc115da3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  net/9p: nwname should be an unsigned int
  9p: Fix sparse error
  fs/9p: Fix error reported by coccicheck
  9p: revert tsyncfs related changes
  fs/9p: Use write_inode for data sync on server
  fs/9p: Fix revalidate to return correct value
2011-04-15 20:31:15 -07:00
Trond Myklebust
47c2199b6e NFSv4.1: Ensure state manager thread dies on last umount
Currently, the state manager may continue to try recovering state forever
even after the last filesystem to reference that nfs_client has umounted.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
2011-04-15 18:28:22 -04:00
Tim Chen
c1530019e3 vfs: Fix absolute RCU path walk failures due to uninitialized seq number
During RCU walk in path_lookupat and path_openat, the rcu lookup
frequently failed if looking up an absolute path, because when root
directory was looked up, seq number was not properly set in nameidata.

We dropped out of RCU walk in nameidata_drop_rcu due to mismatch in
directory entry's seq number.  We reverted to slow path walk that need
to take references.

With the following patch, I saw a 50% increase in an exim mail server
benchmark throughput on a 4-socket Nehalem-EX system.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: stable@kernel.org (v2.6.38)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-15 15:28:12 -07:00
Aneesh Kumar K.V
936bb2d703 fs/9p: Fix error reported by coccicheck
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-04-15 15:26:14 -05:00
Aneesh Kumar K.V
df5d8c80f1 9p: revert tsyncfs related changes
Now that we use write_inode to flush server
cache related to fid, we don't need tsyncfs either fort dotl or dotu
protocols. For dotu this helps to do a more efficient server flush.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-04-15 15:26:14 -05:00
Aneesh Kumar K.V
c2ed388021 fs/9p: Use write_inode for data sync on server
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-04-15 15:26:14 -05:00
Aneesh Kumar K.V
f2eda2c6cc fs/9p: Fix revalidate to return correct value
revalidate should return > 0 on success. Also return 0 on ENOENT
to force do_revalidate to return NULL dentry;

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-04-15 15:26:13 -05:00
Chris Mason
0e4f8f8888 Btrfs: don't force chunk allocation in find_free_extent
find_free_extent likes to allocate in contiguous clusters,
which makes writeback faster, especially on SSD storage.  As
the FS fragments, these clusters become harder to find and we have
to decide between allocating a new chunk to make more clusters
or giving up on the cluster to allocate from the free space
we have.

Right now it creates too many chunks, and you can end up with
a whole FS that is mostly empty metadata chunks.  This commit
changes the allocation code to be more strict and only
allocate new chunks when we've made good use of the chunks we
already have.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-15 16:05:44 -04:00
Linus Torvalds
a970f5d513 Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6
* 'linux-next' of git://git.infradead.org/ubifs-2.6:
  UBIFS: fix compilation warnings when compiling with gcc 4.5
  UBIFS: fix oops when R/O file-system is fsync'ed
2011-04-15 07:44:22 -07:00
Linus Torvalds
7ebfa57f6d vfs: fix incorrect dentry_update_name_case() BUG_ON() test
The case we should be verifying when updating the dentry name is that
the _parent_ inode (the directory) semaphore is held, not the semaphore
for the dentry itself.  It's the directory locking that rename and
readdir() etc all care about.

The comment just above even says so - but then the BUG_ON() still
checked the dentry inode itself.

Very few people noticed, because this helper function really isn't used
for very much, so you had to be using ncpfs to ever hit it.

I think I should just remove the BUG_ON (the function really has just
one user), but let's run with it fixed for a while before getting rid of
it entirely.

Reported-and-tested-by: Bongani Hlope <bonganih@bankservafrica.com>
Reported-and-tested-by: Bernd Feige <bernd.feige@uniklinik-freiburg.de>
Cc: Petr Vandrovec <petr@vandrovec.name>,
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-15 07:34:26 -07:00
Bob Liu
b836aec53e ramfs: fix memleak on no-mmu arch
On no-mmu arch, there is a memleak during shmem test.  The cause of this
memleak is ramfs_nommu_expand_for_mapping() added page refcount to 2
which makes iput() can't free that pages.

The simple test file is like this:

  int main(void)
  {
	int i;
	key_t k = ftok("/etc", 42);

	for ( i=0; i<100; ++i) {
		int id = shmget(k, 10000, 0644|IPC_CREAT);
		if (id == -1) {
			printf("shmget error\n");
		}
		if(shmctl(id, IPC_RMID, NULL ) == -1) {
			printf("shm  rm error\n");
			return -1;
		}
	}
	printf("run ok...\n");
	return 0;
  }

And the result:

  root:/> free
               total         used         free       shared      buffers
  Mem:         60320        17912        42408            0            0
  -/+ buffers:              17912        42408
  root:/> shmem
  run ok...
  root:/> free
               total         used         free       shared      buffers
  Mem:         60320        19096        41224            0            0
  -/+ buffers:              19096        41224
  root:/> shmem
  run ok...
  root:/> free
               total         used         free       shared      buffers
  Mem:         60320        20296        40024            0            0
  -/+ buffers:              20296        40024
  ...

After this patch the test result is:(no memleak anymore)

  root:/> free
               total         used         free       shared      buffers
  Mem:         60320        16668        43652            0            0
  -/+ buffers:              16668        43652
  root:/> shmem
  run ok...
  root:/> free
               total         used         free       shared      buffers
  Mem:         60320        16668        43652            0            0
  -/+ buffers:              16668        43652

Signed-off-by: Bob Liu <lliubbo@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14 16:06:56 -07:00
Jeff Mahoney
ed5afeaf42 fs/fhandle.c: add <linux/personality.h> for ia64
force_o_largefile() on ia64 is defined in <asm/fcntl.h> and requires
<linux/personality.h>.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14 16:06:56 -07:00
Jiri Kosina
4471a675df brk: COMPAT_BRK: fix detection of randomized brk
5520e89 ("brk: fix min_brk lower bound computation for COMPAT_BRK")
tried to get the whole logic of brk randomization for legacy
(libc5-based) applications finally right.

It turns out that the way to detect whether brk has actually been
randomized in the end or not introduced by that patch still doesn't work
for those binaries, as reported by Geert:

: /sbin/init from my old m68k ramdisk exists prematurely.
:
: Before the patch:
:
: | brk(0x80005c8e)                         = 0x80006000
:
: After the patch:
:
: | brk(0x80005c8e)                         = 0x80005c8e
:
: Old libc5 considers brk() to have failed if the return value is not
: identical to the requested value.

I don't like it, but currently see no better option than a bit flag in
task_struct to catch the CONFIG_COMPAT_BRK && randomize_va_space == 2
case.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14 16:06:55 -07:00
Timo Warns
c340b1d640 fs/partitions/ldm.c: fix oops caused by corrupted partition table
The kernel automatically evaluates partition tables of storage devices.
The code for evaluating LDM partitions (in fs/partitions/ldm.c) contains
a bug that causes a kernel oops on certain corrupted LDM partitions.
A kernel subsystem seems to crash, because, after the oops, the kernel no
longer recognizes newly connected storage devices.

The patch validates the value of vblk_size.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Timo Warns <warns@pre-sense.de>
Cc: Eugene Teo <eugeneteo@kernel.sg>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Richard Russon <rich@flatcap.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14 16:06:54 -07:00
Bryan Schumaker
c3dfc2808a NFS: Use correct variable for page bounds checking
While decoding a secinfo reply, I store the list of supported sec
flavors on a page accessible through res->flavors.  Before reading
each new flavor, I do some math to determine if there is enough
space left on this page, and I break out of my read look if there
isn't.  In order to perform this check correctly, I need to use the
address of res->flavors, rather than the address of res.

When this loop was broken early I lied to the caller and told them
that the entire list had been decoded.  This could lead to problems
if the caller tries to use any the garbage data claiming to be a
valid sec flavor.  I fixed this by using res->flavors->num_flavors
as a counter, incrementing it every time a sec flavor is
successfully decoded.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-13 15:12:23 -04:00
Bryan Schumaker
9b7160c55a NFS: don't negotiate when user specifies sec flavor
We were always attempting sec flavor negotiation, even if the user
told us a specific sec flavor to use.  If that sec flavor fails,
we should return an error rather than continuing with sec flavor
negotiation.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-13 15:12:23 -04:00
Bryan Schumaker
801a16dc7b NFS: Attempt mount with default sec flavor first
nfs4_lookup_root() is already configured to use either RPC_AUTH_UNIX
or a user specified flavor (through -o sec=<whatever>).  We should
use this flavor first, and only attempt negotiation if it fails
with -EPERM.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-13 15:12:23 -04:00
Bryan Schumaker
0fabee243a NFS: flav_array honors NFS_MAX_SECFLAVORS
NFS_MAX_SECFLAVORS should already take into account RPC_AUTH_UNIX
and RPC_AUTH_NULL, so we don't need to set aside extra slots
for them.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-13 15:12:22 -04:00
Bryan Schumaker
d1a8016a2d NFS: Fix infinite loop in gss_create_upcall()
There can be an infinite loop if gss_create_upcall() is called without
the userspace program running.  To prevent this, we return -EACCES if
we notice that pipe_version hasn't changed (indicating that the pipe
has not been opened).

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-13 15:12:22 -04:00
Weston Andros Adamson
79a48a1f5d Don't mark_inode_dirty_sync() while holding lock
mark_inode_dirty_sync() grabs the same inode lock!

race conditions between holding the lock in pnfs_set_layoutcommit() and in
mark_inode_dirty_sync() can result in a second call to pnfs_layoutcommit_inode(), but
this will be a noop as NFS_INO_LAYOUTCOMMIT won't be set in the second call

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-13 13:15:51 -04:00
Maksim Rayskiy
1dcffad741 UBIFS: fix compilation warnings when compiling with gcc 4.5
When compiling UBIFS with CONFIG_UBIFS_FS_DEBUG not set,
gcc-4.5.2 generates a slew of "warning: statement with no effect"
on references to non-void functions defined as 0.
To avoid these warnings, replace #defines with dummy inline functions.

Artem: massage the patch a bit, also remove the duplicate
       'dbg_check_lprops()' prototype.

Signed-off-by: Maksim Rayskiy <maksim.rayskiy@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-04-13 11:59:09 +03:00
Artem Bityutskiy
78530bf7f2 UBIFS: fix oops when R/O file-system is fsync'ed
This patch fixes severe UBIFS bug: UBIFS oopses when we 'fsync()' an
file on R/O-mounter file-system. We (the UBIFS authors) incorrectly
thought that VFS would not propagate 'fsync()' down to the file-system
if it is read-only, but this is not the case.

It is easy to exploit this bug using the following simple perl script:

use strict;
use File::Sync qw(fsync sync);

die "File path is not specified" if not defined $ARGV[0];
my $path = $ARGV[0];

open FILE, "<", "$path" or die "Cannot open $path: $!";
fsync(\*FILE) or die "cannot fsync $path: $!";
close FILE or die "Cannot close $path: $!";

Thanks to Reuben Dowle <Reuben.Dowle@navico.com> for reporting about this
issue.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reported-by: Reuben Dowle <Reuben.Dowle@navico.com>
Cc: stable@kernel.org
2011-04-13 10:43:32 +03:00
Miao Xie
329c5056be Btrfs: Check validity before setting an acl
Call posix_acl_valid() to check if an acl is valid or not.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-13 14:25:35 +08:00
Miao Xie
3153495d8e Btrfs: Fix incorrect inode nlink in btrfs_link()
Link count of the inode is not decreased if btrfs_set_inode_index()
fails.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Singed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-13 14:25:32 +08:00
Li Zefan
b9e03af0bc Btrfs: Check if btrfs_next_leaf() returns error in btrfs_real_readdir()
btrfs_next_leaf() can return -errno, and we should propagate
it to userspace.

This also simplifies how we walk the btree path.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-13 14:25:31 +08:00
Li Zefan
2e6a00356a Btrfs: Check if btrfs_next_leaf() returns error in btrfs_listxattr()
btrfs_next_leaf() can return -errno, and we should propagate
it to userspace.

This also simplifies how we walk the btree path.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-13 14:25:28 +08:00
Chris Mason
109b36a2bb Btrfs: make uncache_state unconditional
The extent_io code can take cached pointers into the extent state trees,
and these can make lookups much faster in common operations.  The
caching only happens when specific bits are set that prevent merging
and splitting of the extent state.

A help function was added to uncache the state, and it was testing
the same set of conditionals.  This can leak in very strange corner
cases where the lock bit goes away unexpectedly.

The uncaching should be unconditional.  Once we have a ref on the
extent we should always give it up.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-12 20:51:26 -04:00
Trond Myklebust
c0d0e96b84 NFS: Get rid of pointless test in nfs_commit_done
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-12 19:34:23 -04:00
Bryan Schumaker
561f0b0ad0 NFS: Remove unused argument from nfs_find_best_sec()
The inode was used in an earlier version of the code, but it isn't
used anymore.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-12 19:34:23 -04:00
Trond Myklebust
4b38a6db01 NFS: Eliminate duplicate call to nfs_mark_request_dirty
We only need to call nfs_mark_request_dirty() once in nfs_writepage_setup().

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-12 19:34:22 -04:00
Jesper Juhl
160bc1604f NFS: Remove dead code from nfs_fs_mount()
In fs/nfs/super.c::nfs_fs_mount() we test for a NULL 'data':

...
 		if (data == NULL || mntfh == NULL)
 			goto out_free_fh;
...

and then further down in the function we test 'data' again:

...
 			nfs_fscache_get_super_cookie(
 				s, data ? data->fscache_uniq : NULL, NULL);
...

this second check is just dead code since there is no way 'data' could
possibly be NULL here.
We also rely on a non-NULL 'data' in more than one location between these
two tests, further proving the point that the second test is bogus.

This patch removes the dead code.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-04-12 19:34:21 -04:00
Linus Torvalds
6faf9a5415 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: don't allow mmap'ed pages to be dirtied while under writeback (try #3)
  [CIFS] Warn on requesting default security (ntlm) on mount
  [CIFS] cifs: clarify the meaning of tcpStatus == CifsGood
  cifs: wrap received signature check in srv_mutex
  cifs: clean up various nits in unicode routines (try #2)
  cifs: clean up length checks in check2ndT2
  cifs: set ra_pages in backing_dev_info
  cifs: fix broken BCC check in is_valid_oplock_break
  cifs: always do is_path_accessible check in cifs_mount
  various endian fixes to cifs
  Elminate sparse __CHECK_ENDIAN__ warnings on port conversion
  Max share size is too small
  Allow user names longer than 32 bytes
  cifs: replace /proc/fs/cifs/Experimental with a module parm
  cifs: check for private_data before trying to put it
2011-04-12 15:24:42 -07:00
Dave Chinner
0d88f6e804 nfs: don't call __mark_inode_dirty while holding i_lock
nfs_scan_commit() is called with the inode->i_lock held, but it then
calls __mark_inode_dirty() while still holding the lock. This causes
a deadlock.

Push the inode->i_lock into nfs_scan_commit() so it can protect only
the parts of the code it needs to and can be dropped before the call
to __mark_inode_dirty() to avoid the deadlock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Will Simoneau <simoneau@ele.uri.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-12 14:17:24 -07:00
Linus Torvalds
be85bccaa5 Revert "vfs: Export file system uuid via /proc/<pid>/mountinfo"
This reverts commit 93f1c20bc8.

It turns out that libmount misparses it because it adds a '-' character
in the uuid string, which libmount then incorrectly confuses with the
separator string (" - ") at the end of all the optional arguments.

Upstream libmount (in the util-linux tree) has been fixed, but until
that fix actually percolates up to users, we'd better not expose this
change in the kernel.

Let's revisit this later (possibly by exposing the UUID without any '-'
characters in it, avoiding the user-space bug).

Reported-by: Dave Jones <davej@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Karel Zak <kzak@redhat.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-12 13:35:56 -07:00
Jeff Layton
ca83ce3d5b cifs: don't allow mmap'ed pages to be dirtied while under writeback (try #3)
This is more or less the same patch as before, but with some merge
conflicts fixed up.

If a process has a dirty page mapped into its page tables, then it has
the ability to change it while the client is trying to write the data
out to the server. If that happens after the signature has been
calculated then that signature will then be wrong, and the server will
likely reset the TCP connection.

This patch adds a page_mkwrite handler for CIFS that simply takes the
page lock. Because the page lock is held over the life of writepage and
writepages, this prevents the page from becoming writeable until
the write call has completed.

With this, we can also remove the "sign_zero_copy" module option and
always inline the pages when writing.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-12 14:19:55 +00:00
Steve French
d9b9420137 [CIFS] Warn on requesting default security (ntlm) on mount
Warn once if default security (ntlm) requested. We will
update the default to the stronger security mechanism
(ntlmv2) in 2.6.41.  Kerberos is also stronger than
ntlm, but more servers support ntlmv2 and ntlmv2
does not require an upcall, so ntlmv2 is a better
default.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
CC: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-12 01:27:45 +00:00
Steve French
fd88ce9313 [CIFS] cifs: clarify the meaning of tcpStatus == CifsGood
When the TCP_Server_Info is first allocated and connected, tcpStatus ==
CifsGood means that the NEGOTIATE_PROTOCOL request has completed and the
socket is ready for other calls. cifs_reconnect however sets tcpStatus
to CifsGood as soon as the socket is reconnected and the optional
RFC1001 session setup is done. We have no clear way to tell the
difference between these two states, and we need to know this in order
to know whether we can send an echo or not.

Resolve this by adding a new statusEnum value -- CifsNeedNegotiate. When
the socket has been connected but has not yet had a NEGOTIATE_PROTOCOL
request done, set it to this value. Once the NEGOTIATE is done,
cifs_negotiate_protocol will set tcpStatus to CifsGood.

This also fixes and cleans the logic in cifs_reconnect and
cifs_reconnect_tcon. The old code checked for specific states when what
it really wants to know is whether the state has actually changed from
CifsNeedReconnect.

Reported-and-Tested-by: JG <jg@cms.ac>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-12 01:01:14 +00:00
Jeff Layton
157c249114 cifs: wrap received signature check in srv_mutex
While testing my patchset to fix asynchronous writes, I hit a bunch
of signature problems when testing with signing on. The problem seems
to be that signature checks on receive can be running at the same
time as a process that is sending, or even that multiple receives can
be checking signatures at the same time, clobbering the same data
structures.

While we're at it, clean up the comments over cifs_calculate_signature
and add a note that the srv_mutex should be held when calling this
function.

This patch seems to fix the problems for me, but I'm not clear on
whether it's the best approach. If it is, then this should probably
go to stable too.

Cc: stable@kernel.org
Cc: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-12 00:58:28 +00:00
Jeff Layton
581ade4d1c cifs: clean up various nits in unicode routines (try #2)
Minor revision to the original patch. Don't abuse the __le16 variable
on the stack by casting it to wchar_t and handing it off to char2uni.
Declare an actual wchar_t on the stack instead. This fixes a valid
sparse warning.

Fix the spelling of UNI_ASTERISK. Eliminate the unneeded len_remaining
variable in cifsConvertToUCS.

Also, as David Howells points out. We were better off making
cifsConvertToUCS *not* use put_unaligned_le16 since it means that we
can't optimize the mapped characters at compile time. Switch them
instead to use cpu_to_le16, and simply use put_unaligned to set them
in the string.

Reported-and-acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-12 00:57:12 +00:00
Jeff Layton
c0c7b905e9 cifs: clean up length checks in check2ndT2
Thus spake David Howells:

The code that follows this:

  	remaining = total_data_size - data_in_this_rsp;
	if (remaining == 0)
		return 0;
	else if (remaining < 0) {

generates better code if you drop the 'remaining' variable and compare
the values directly.

Clean it up per his recommendation...

Reported-and-acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-12 00:56:46 +00:00
Jeff Layton
2b6c26a0a6 cifs: set ra_pages in backing_dev_info
Commit 522440ed made cifs set backing_dev_info on the mapping attached
to new inodes. This change caused a fairly significant read performance
regression, as cifs started doing page-sized reads exclusively.

By virtue of the fact that they're allocated as part of cifs_sb_info by
kzalloc, the ra_pages on cifs BDIs get set to 0, which prevents any
readahead. This forces the normal read codepaths to use readpage instead
of readpages causing a four-fold increase in the number of read calls
with the default rsize.

Fix it by setting ra_pages in the BDI to the same value as that in the
default_backing_dev_info.

Fixes https://bugzilla.kernel.org/show_bug.cgi?id=31662

Cc: stable@kernel.org
Reported-and-Tested-by: Till <till2.schaefer@uni-dortmund.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-12 00:56:00 +00:00
Jeff Layton
8679b0dba7 cifs: fix broken BCC check in is_valid_oplock_break
The BCC is still __le16 at this point, and in any case we need to
use the get_bcc_le macro to make sure we don't hit alignment
problems.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-12 00:54:30 +00:00
Jeff Layton
7094564372 cifs: always do is_path_accessible check in cifs_mount
Currently, we skip doing the is_path_accessible check in cifs_mount if
there is no prefixpath. I have a report of at least one server however
that allows a TREE_CONNECT to a share that has a DFS referral at its
root. The reporter in this case was using a UNC that had no prefixpath,
so the is_path_accessible check was not triggered and the box later hit
a BUG() because we were chasing a DFS referral on the root dentry for
the mount.

This patch fixes this by removing the check for a zero-length
prefixpath.  That should make the is_path_accessible check be done in
this situation and should allow the client to chase the DFS referral at
mount time instead.

Cc: stable@kernel.org
Reported-and-Tested-by: Yogesh Sharma <ysharma@cymer.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-12 00:52:08 +00:00
Steve French
5443d130aa various endian fixes to cifs
make modules C=2 M=fs/cifs CF=-D__CHECK_ENDIAN__

Found for example:

 CHECK   fs/cifs/cifssmb.c
fs/cifs/cifssmb.c:728:22: warning: incorrect type in assignment (different base types)
fs/cifs/cifssmb.c:728:22:    expected unsigned short [unsigned] [usertype] Tid
fs/cifs/cifssmb.c:728:22:    got restricted __le16 [usertype] <noident>
fs/cifs/cifssmb.c:1883:45: warning: incorrect type in assignment (different base types)
fs/cifs/cifssmb.c:1883:45:    expected long long [signed] [usertype] fl_start
fs/cifs/cifssmb.c:1883:45:    got restricted __le64 [usertype] start
fs/cifs/cifssmb.c:1884:54: warning: restricted __le64 degrades to integer
fs/cifs/cifssmb.c:1885:58: warning: restricted __le64 degrades to integer
fs/cifs/cifssmb.c:1886:43: warning: incorrect type in assignment (different base types)
fs/cifs/cifssmb.c:1886:43:    expected unsigned int [unsigned] fl_pid
fs/cifs/cifssmb.c:1886:43:    got restricted __le32 [usertype] pid

In checking new smb2 code for missing endian conversions, I noticed
some endian errors had crept in over the last few releases into the
cifs code (symlink, ntlmssp, posix lock, and also a less problematic warning
in fscache).  A followon patch will address a few smb2 endian
problems.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-12 00:51:35 +00:00
Steve French
6da9791061 Elminate sparse __CHECK_ENDIAN__ warnings on port conversion
Ports are __be16 not unsigned short int

Eliminates the remaining fixable endian warnings:

~/cifs-2.6$ make modules C=1 M=fs/cifs CF=-D__CHECK_ENDIAN__
  CHECK   fs/cifs/connect.c
fs/cifs/connect.c:2408:23: warning: incorrect type in assignment (different base types)
fs/cifs/connect.c:2408:23:    expected unsigned short *sport
fs/cifs/connect.c:2408:23:    got restricted __be16 *<noident>
fs/cifs/connect.c:2410:23: warning: incorrect type in assignment (different base types)
fs/cifs/connect.c:2410:23:    expected unsigned short *sport
fs/cifs/connect.c:2410:23:    got restricted __be16 *<noident>
fs/cifs/connect.c:2416:24: warning: incorrect type in assignment (different base types)
fs/cifs/connect.c:2416:24:    expected unsigned short [unsigned] [short] <noident>
fs/cifs/connect.c:2416:24:    got restricted __be16 [usertype] <noident>
fs/cifs/connect.c:2423:24: warning: incorrect type in assignment (different base types)
fs/cifs/connect.c:2423:24:    expected unsigned short [unsigned] [short] <noident>
fs/cifs/connect.c:2423:24:    got restricted __be16 [usertype] <noident>
fs/cifs/connect.c:2326:23: warning: incorrect type in assignment (different base types)
fs/cifs/connect.c:2326:23:    expected unsigned short [unsigned] sport
fs/cifs/connect.c:2326:23:    got restricted __be16 [usertype] sin6_port
fs/cifs/connect.c:2330:23: warning: incorrect type in assignment (different base types)
fs/cifs/connect.c:2330:23:    expected unsigned short [unsigned] sport
fs/cifs/connect.c:2330:23:    got restricted __be16 [usertype] sin_port
fs/cifs/connect.c:2394:22: warning: restricted __be16 degrades to integer

Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-12 00:49:08 +00:00
Steve French
2e325d5973 Max share size is too small
Max share name was set to 64, and (at least for Windows)
can be 80.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-12 00:47:14 +00:00
Chris Mason
874d0d2633 Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into for-linus 2011-04-11 20:46:03 -04:00
Arne Jansen
507903b818 btrfs: using cached extent_state in set/unlock combinations
In several places the sequence (set_extent_uptodate, unlock_extent) is used.
This leads to a duplicate lookup of the extent state. This patch lets
set_extent_uptodate return a cached extent_state which can be passed to
unlock_extent_cached.
The occurences of the above sequences are updated to use the cache. Only
end_bio_extent_readpage is updated that it first gets a cached state to
pass it to the readpage_end_io_hook as the prototype requested and is later
on being used for set/unlock.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-11 20:45:36 -04:00
Josef Bacik
13c5a93e70 Btrfs: avoid taking the trans_mutex in btrfs_end_transaction
I've been working on making our O_DIRECT latency not suck and I noticed we were
taking the trans_mutex in btrfs_end_transaction.  So to do this we convert
num_writers and use_count to atomic_t's and just decrement them in
btrfs_end_transaction.  Instead of deleting the transaction from the trans list
in put_transaction we do that in btrfs_commit_transaction() since that's the
only time it actually needs to be removed from the list.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-04-11 20:43:52 -04:00
Steve French
8727c8a85f Allow user names longer than 32 bytes
We artificially limited the user name to 32 bytes, but modern servers handle
larger.  Set the maximum length to a reasonable 256, and make the user name
string dynamically allocated rather than a fixed size in session structure.
Also clean up old checkpatch warning.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-12 00:42:06 +00:00
Jeff Layton
bdf1b03e09 cifs: replace /proc/fs/cifs/Experimental with a module parm
This flag currently only affects whether we allow "zero-copy" writes
with signing enabled. Typically we map pages in the pagecache directly
into the write request. If signing is enabled however and the contents
of the page change after the signature is calculated but before the
write is sent then the signature will be wrong. Servers typically
respond to this by closing down the socket.

Still, this can provide a performance benefit so the "Experimental" flag
was overloaded to allow this. That's really not a good place for this
option however since it's not clear what that flag does.

Move that flag instead to a new module parameter that better describes
its purpose. That's also better since it can be set at module insertion
time by configuring modprobe.d.

Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-12 00:40:43 +00:00
Jeff Layton
7797069305 cifs: check for private_data before trying to put it
cifs_close doesn't check that the filp->private_data is non-NULL before
trying to put it. That can cause an oops in certain error conditions
that can occur on open or lookup before the private_data is set.

Reported-by: Ben Greear <greearb@candelatech.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-04-12 00:39:05 +00:00
Xin Zhong
e15d054242 Btrfs: fix subvolume mount by name problem when default mount subvolume is set
We create two subvolumes (meego_root and meego_home) in
btrfs root directory. And set meego_root as default mount
subvolume. After we remount btrfs, meego_root is mounted
to top directory by default. Then when we try to mount
meego_home (subvol=meego_home) to a subdirectory, it failed.
The problem is when default mount subvolume is set to
meego_root, we search meego_home in meego_root but can not find
it. So the solution is to add a new mount option (subvolrootid)
to specify subvol id of root and search subvol name in it. For
our case, now we can use "-o subvolrootid=0,subvol=meego_home)
to mount meego_home.

Detail information can be found in meego bugzilla:
https://bugs.meego.com/show_bug.cgi?id=15055

Signed-off-by: Zhong, Xin <xin.zhong@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-11 20:26:50 -04:00
Daniel J Blueman
13f2696f1d fix user annotation in ioctl.c
Fix address space annotation correct in ioctl.c.

Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>

 		       BTRFS_BLOCK_GROUP_SYSTEM,
@@ -2387,7 +2387,7 @@ long btrfs_ioctl_space_info(struct btrfs_root
*root, void __user *arg)
 		up_read(&info->groups_sem);
 	}

-	user_dest = (struct btrfs_ioctl_space_info *)
+	user_dest = (struct btrfs_ioctl_space_info __user *)
 		(arg + sizeof(struct btrfs_ioctl_space_args));

 	if (copy_to_user(user_dest, dest_orig, alloc_size))
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-11 20:25:46 -04:00
Josef Bacik
a1b75f7d96 Btrfs: check for duplicate iov_base's when doing dio reads
Apparently it is ok to submit a read to an IDE device with the same target page
for different offsets.  This is what Windows does under qemu.  The problem is
under DIO we expect them to be different buffers for checksumming reasons, and
so this sort of thing will result in checksum errors, when in reality the file
is fine.  So when reading, check to make sure that all iov bases are different,
and if they aren't fall back to buffered mode, since that will work out right.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-11 20:25:06 -04:00
Sergei Trofimovich
3387206f26 btrfs: properly handle overlapping areas in memmove_extent_buffer
Fix data corruption caused by memcpy() usage on overlapping data.
I've observed it first when found out usermode linux crash on btrfs.

?all chain is the following:
------------[ cut here ]------------
WARNING: at /home/slyfox/linux-2.6/fs/btrfs/extent_io.c:3900 memcpy_extent_buffer+0x1a5/0x219()
Call Trace:
6fa39a58:  [<601b495e>] _raw_spin_unlock_irqrestore+0x18/0x1c
6fa39a68:  [<60029ad9>] warn_slowpath_common+0x59/0x70
6fa39aa8:  [<60029b05>] warn_slowpath_null+0x15/0x17
6fa39ab8:  [<600efc97>] memcpy_extent_buffer+0x1a5/0x219
6fa39b48:  [<600efd9f>] memmove_extent_buffer+0x94/0x208
6fa39bc8:  [<600becbf>] btrfs_del_items+0x214/0x473
6fa39c78:  [<600ce1b0>] btrfs_delete_one_dir_name+0x7c/0xda
6fa39cc8:  [<600dad6b>] __btrfs_unlink_inode+0xad/0x25d
6fa39d08:  [<600d7864>] btrfs_start_transaction+0xe/0x10
6fa39d48:  [<600dc9ff>] btrfs_unlink_inode+0x1b/0x3b
6fa39d78:  [<600e04bc>] btrfs_unlink+0x70/0xef
6fa39dc8:  [<6007f0d0>] vfs_unlink+0x58/0xa3
6fa39df8:  [<60080278>] do_unlinkat+0xd4/0x162
6fa39e48:  [<600517db>] call_rcu_sched+0xe/0x10
6fa39e58:  [<600452a8>] __put_cred+0x58/0x5a
6fa39e78:  [<6007446c>] sys_faccessat+0x154/0x166
6fa39ed8:  [<60080317>] sys_unlink+0x11/0x13
6fa39ee8:  [<60016b80>] handle_syscall+0x58/0x70
6fa39f08:  [<60021377>] userspace+0x2d4/0x381
6fa39fc8:  [<60014507>] fork_handler+0x62/0x69
---[ end trace 70b0ca2ef0266b93 ]---

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg09302.html

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-11 20:25:06 -04:00
Yoshinori Sano
8fb27640d0 Btrfs: fix memory leaks in btrfs_new_inode()
This patch fixes memory leaks in btrfs_new_inode().

Signed-off-by: Yoshinori Sano <yoshinori.sano@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-11 20:25:06 -04:00
Linus Torvalds
1e05ff020f Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: use proper interfaces for on-stack plugging
  xfs: fix xfs_debug warnings
  xfs: fix variable set but not used warnings
  xfs: convert log tail checking to a warning
  xfs: catch bad block numbers freeing extents.
  xfs: push the AIL from memory reclaim and periodic sync
  xfs: clean up code layout in xfs_trans_ail.c
  xfs: convert the xfsaild threads to a workqueue
  xfs: introduce background inode reclaim work
  xfs: convert ENOSPC inode flushing to use new syncd workqueue
  xfs: introduce a xfssyncd workqueue
  xfs: fix extent format buffer allocation size
  xfs: fix unreferenced var error in xfs_buf.c

Also, applied patch from Tony Luck that fixes ia64:
  xfs_destroy_workqueues() should not be tagged with__exit
in the branch before merging.
2011-04-11 15:48:57 -07:00
Luck, Tony
39411f81ee xfs_destroy_workqueues() should not be tagged with__exit
ia64 throws away .exit sections for the built-in CONFIG case, so routines
that are used in other circumstances should not be tagged as __exit.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-11 15:47:20 -07:00
Linus Torvalds
a97b52022a Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix data corruption regression by reverting commit 6de9843dab
  ext4: Allow indirect-block file to grow the file size to max file size
  ext4: allow an active handle to be started when freezing
  ext4: sync the directory inode in ext4_sync_parent()
  ext4: init timer earlier to avoid a kernel panic in __save_error_info
  jbd2: fix potential memory leak on transaction commit
  ext4: fix a double free in ext4_register_li_request
  ext4: fix credits computing for indirect mapped files
  ext4: remove unnecessary [cm]time update of quota file
  jbd2: move bdget out of critical section
2011-04-11 15:45:47 -07:00
Linus Torvalds
18770c7c3a Merge branch 'for-2.6.39' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.39' of git://linux-nfs.org/~bfields/linux:
  nfsd4: fix oops on lock failure
  nfsd: fix auth_domain reference leak on nlm operations
2011-04-11 15:45:17 -07:00
Theodore Ts'o
c820563602 ext4: fix data corruption regression by reverting commit 6de9843dab
Revert commit 6de9843dab, since it
caused a data corruption regression with BitTorrent downloads.  Thanks
to Damien for discovering and bisecting to find the problem commit.

https://bugzilla.kernel.org/show_bug.cgi?id=32972

Reported-by: Damien Grassart <damien@grassart.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-04-10 22:30:07 -04:00
Kazuya Mio
f80da1e70f ext4: Allow indirect-block file to grow the file size to max file size
We can create 4402345721856 byte file with indirect block mapping.
However, if we grow an indirect-block file to the size with ftruncate(),
we can see an ext4 warning. The following patch fixes this problem.

How to reproduce:
# dd if=/dev/zero of=/mnt/mp1/hoge bs=1 count=0 seek=4402345721856
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000221428 s, 0.0 kB/s
# tail -n 1 /var/log/messages
Nov 25 15:10:27 test kernel: EXT4-fs warning (device sda8): ext4_block_to_path:345: block 1074791436 > max in inode 12

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-04-10 22:06:36 -04:00
Yongqiang Yang
be4f27d324 ext4: allow an active handle to be started when freezing
ext4_journal_start_sb() should not prevent an active handle from being
started due to s_frozen.  Otherwise, deadlock is easy to happen, below
is a situation.

================================================
     freeze         |       truncate
================================================
                    |  ext4_ext_truncate()
    freeze_super()  |   starts a handle
    sets s_frozen   |
                    |  ext4_ext_truncate()
                    |  holds i_data_sem
  ext4_freeze()     |
  waits for updates |
                    |  ext4_free_blocks()
                    |  calls dquot_free_block()
                    |
                    |  dquot_free_blocks()
                    |  calls ext4_dirty_inode()
                    |
                    |  ext4_dirty_inode()
                    |  trys to start an active
                    |  handle
                    |
                    |  block due to s_frozen
================================================

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Amir Goldstein <amir73il@users.sf.net>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2011-04-10 22:06:07 -04:00
Curt Wohlgemuth
0893ed458b ext4: sync the directory inode in ext4_sync_parent()
ext4 has taken the stance that, in the absence of a journal,
when an fsync/fdatasync of an inode is done, the parent
directory should be sync'ed if this inode entry is new.
ext4_sync_parent(), which implements this, does indeed sync
the dirent pages for parent directories, but it does not
sync the directory *inode*.  This patch fixes this.

Also now return error status from ext4_sync_parent().

I tested this using a power fail test, which panics a
machine running a file server getting requests from a
client.  Without this patch, on about every other test run,
the server is missing many, many files that had been synced.
With this patch, on > 6 runs, I see zero files being lost.

Google-Bug-Id: 4179519
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-04-10 22:05:31 -04:00
J. Bruce Fields
23fcf2ec93 nfsd4: fix oops on lock failure
Lock stateid's can have access_bmap 0 if they were only partially
initialized (due to a failed lock request); handle that case in
free_generic_stateid.

------------[ cut here ]------------
kernel BUG at fs/nfsd/nfs4state.c:380!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/run
Modules linked in: nfs fscache md4 nls_utf8 cifs ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat bridge stp llc nfsd lockd nfs_acl auth_rpcgss sunrpc ipv6 ppdev parport_pc parport pcnet32 mii pcspkr microcode i2c_piix4 BusLogic floppy [last unloaded: mperf]

Pid: 1468, comm: nfsd Not tainted 2.6.38+ #120 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
EIP: 0060:[<e24f180d>] EFLAGS: 00010297 CPU: 0
EIP is at nfs4_access_to_omode+0x1c/0x29 [nfsd]
EAX: ffffffff EBX: dd758120 ECX: 00000000 EDX: 00000004
ESI: dd758120 EDI: ddfe657c EBP: dd54dde0 ESP: dd54dde0
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process nfsd (pid: 1468, ti=dd54c000 task=ddc92580 task.ti=dd54c000)
Stack:
 dd54ddf0 e24f19ca 00000000 ddfe6560 dd54de08 e24f1a5d dd758130 deee3a20
 ddfe6560 31270000 dd54df1c e24f52fd 0000000f dd758090 e2505dd0 0be304cf
 dbb51d68 0000000e ddfe657c ddcd8020 dd758130 dd758128 dd7580d8 dd54de68
Call Trace:
 [<e24f19ca>] free_generic_stateid+0x1c/0x3e [nfsd]
 [<e24f1a5d>] release_lockowner+0x71/0x8a [nfsd]
 [<e24f52fd>] nfsd4_lock+0x617/0x66c [nfsd]
 [<e24e57b6>] ? nfsd_setuser+0x199/0x1bb [nfsd]
 [<e24e056c>] ? nfsd_setuser_and_check_port+0x65/0x81 [nfsd]
 [<c07a0052>] ? _cond_resched+0x8/0x1c
 [<c04ca61f>] ? slab_pre_alloc_hook.clone.33+0x23/0x27
 [<c04cac01>] ? kmem_cache_alloc+0x1a/0xd2
 [<c04835a0>] ? __call_rcu+0xd7/0xdd
 [<e24e0dfb>] ? fh_verify+0x401/0x452 [nfsd]
 [<e24f0b61>] ? nfsd4_encode_operation+0x52/0x117 [nfsd]
 [<e24ea0d7>] ? nfsd4_putfh+0x33/0x3b [nfsd]
 [<e24f4ce6>] ? nfsd4_delegreturn+0xd4/0xd4 [nfsd]
 [<e24ea2c9>] nfsd4_proc_compound+0x1ea/0x33e [nfsd]
 [<e24de6ee>] nfsd_dispatch+0xd1/0x1a5 [nfsd]
 [<e1d6e1c7>] svc_process_common+0x282/0x46f [sunrpc]
 [<e1d6e578>] svc_process+0xdc/0xfa [sunrpc]
 [<e24de0fa>] nfsd+0xd6/0x115 [nfsd]
 [<e24de024>] ? nfsd_shutdown+0x24/0x24 [nfsd]
 [<c0454322>] kthread+0x62/0x67
 [<c04542c0>] ? kthread_worker_fn+0x114/0x114
 [<c07a6ebe>] kernel_thread_helper+0x6/0x10
Code: eb 05 b8 00 00 27 4f 8d 65 f4 5b 5e 5f 5d c3 83 e0 03 55 83 f8 02 89 e5 74 17 83 f8 03 74 05 48 75 09 eb 09 b8 02 00 00 00 eb 0b <0f> 0b 31 c0 eb 05 b8 01 00 00 00 5d c3 55 89 e5 57 56 89 d6 8d
EIP: [<e24f180d>] nfs4_access_to_omode+0x1c/0x29 [nfsd] SS:ESP 0068:dd54dde0
---[ end trace 2b0bf6c6557cb284 ]---

The trace route is:

 -> nfsd4_lock()
   -> if (lock->lk_is_new) {
     -> alloc_init_lock_stateid()

        3739: stp->st_access_bmap = 0;

   ->if (status && lock->lk_is_new && lock_sop)
     -> release_lockowner()
      -> free_generic_stateid()
       -> nfs4_access_bmap_to_omode()
          -> nfs4_access_to_omode()

        380: BUG();   *****

This problem was introduced by 0997b17360.

Reported-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Tested-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-04-10 12:21:27 -04:00
Linus Torvalds
94c8a984ae Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: Change initial mount authflavor only when server returns NFS4ERR_WRONGSEC
  NFS: Fix a signed vs. unsigned secinfo bug
  Revert "net/sunrpc: Use static const char arrays"
2011-04-08 11:47:35 -07:00
Josef Bacik
93a54bc4c2 Btrfs: check for duplicate iov_base's when doing dio reads
Apparently it is ok to submit a read to an IDE device with the same target page
for different offsets.  This is what Windows does under qemu.  The problem is
under DIO we expect them to be different buffers for checksumming reasons, and
so this sort of thing will result in checksum errors, when in reality the file
is fine.  So when reading, check to make sure that all iov bases are different,
and if they aren't fall back to buffered mode, since that will work out right.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-04-08 13:00:43 -04:00
Josef Bacik
16d299ac74 Btrfs: reuse the extent_map we found when calling btrfs_get_extent
In btrfs_get_block_direct we call btrfs_get_extent to lookup the extent for the
range that we are looking for.  If we don't find an extent, btrfs_get_extent
will insert a extent_map for that area and mark it as a hole.  So it does the
job of allocating a new extent map and inserting it into the io tree.  But if
we're creating a new extent we free it up and redo all of that work.  So instead
pass the em to btrfs_new_extent_direct(), and if it will work just allocate the
disk space and set it up properly and bypass the freeing/allocating of a new
extent map and the expensive operation of inserting the thing into the io_tree.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-04-08 13:00:41 -04:00
Josef Bacik
1ae3993825 Btrfs: do not use async submit for small DIO io's
When looking at our DIO performance Chris said that for small IO's doing the
async submit stuff tends to be more overhead than it's worth.  With this on top
of my other fixes I get about a 17-20% speedup doing a sequential dd with 4k
IO's.  Basically if we don't have to split the bio for the map length it's small
enough to be directly submitted, otherwise go back to the async submit.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-04-08 13:00:39 -04:00
Josef Bacik
02f57c7aed Btrfs: don't split dio bios if we don't have to
We have been unconditionally allocating a new bio and re-adding all pages from
our original bio to the new bio.  This is needed if our original bio is larger
than our stripe size, but if it is smaller than the stripe size then there is no
need to do this.  So check the map length and if we are under that then go ahead
and submit the original bio.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-04-08 13:00:37 -04:00
Josef Bacik
1ef30be142 Btrfs: do not call btrfs_update_inode in endio if nothing changed
In the DIO code we often don't update the i_disk_size because the i_size isn't
updated until after the DIO is completed, so basically we are allocating a path,
doing a search, and updating the inode item for no reason since nothing changed.
btrfs_ordered_update_i_size will return 1 if it didn't update i_disk_size, so
only run btrfs_update_inode if btrfs_ordered_update_i_size returns 0.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-04-08 13:00:36 -04:00
Josef Bacik
12ddb96cb6 Btrfs: map the inode item when doing fill_inode_item
Instead of calling kmap_atomic for every thing we set in the inode item, map the
entire inode item at the start and unmap it at the end.  This makes a sequential
dd of 400mb O_DIRECT something like 1% faster.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-04-08 13:00:34 -04:00
Josef Bacik
06d5a5899d Btrfs: only retry transaction reservation once
I saw a lockup where we kept getting into this start transaction->commit
transaction loop because of enospce.  The fact is if we fail to make our
reservation, we've tried _everything_ several times, so we only need to try and
commit the transaction once, and if that doesn't work then we really are out of
space and need to just exit.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-04-08 13:00:32 -04:00
Josef Bacik
be1a12a0df Btrfs: deal with the case that we run out of space in the cache
Currently we don't handle running out of space in the cache, so to fix this we
keep track of how far in the cache we are.  Then we only dirty the pages if we
successfully modify all of them, otherwise if we have an error or run out of
space we can just drop them and not worry about the vm writing them out.
Thanks,

Tested-by Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-04-08 13:00:27 -04:00
Linus Torvalds
3d762ca1cd Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  quota: Don't write quota info in dquot_commit()
  ext3: Fix writepage credits computation for ordered mode
2011-04-08 07:35:17 -07:00
Christoph Hellwig
a1b7ea5d58 xfs: use proper interfaces for on-stack plugging
Add proper blk_start_plug/blk_finish_plug pairs for the two places where
we issue buffer I/O, and remove the blk_flush_plug in xfs_buf_lock and
xfs_buf_iowait, given that context switches already flush the per-process
plugging lists.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-04-08 08:09:28 -05:00
Christoph Hellwig
957935dcd8 xfs: fix xfs_debug warnings
For a CONFIG_XFS_DEBUG=n build gcc complains about statements with no
effect in xfs_debug:

fs/xfs/quota/xfs_qm_syscalls.c: In function 'xfs_qm_scall_trunc_qfiles':
fs/xfs/quota/xfs_qm_syscalls.c:291:3: warning: statement with no effect

The reason for that is that the various new xfs message functions have a
return value which is never used, and in case of the non-debug build
xfs_debug the macro evaluates to a plain 0 which produces the above
warnings.  This can be fixed by turning xfs_debug into an inline function
instead of a macro, but in addition to that I've also changed all the
message helpers to return void as we never use their return values.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-04-08 08:09:24 -05:00
Christoph Hellwig
ecb697c16c xfs: fix variable set but not used warnings
GCC 4.6 now warnings about variables set but not used.  Fix the trivially
fixable warnings of this sort.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-04-08 08:09:12 -05:00
Dave Chinner
da8a1a4a4d xfs: convert log tail checking to a warning
On the Power platform, the log tail debug checks fire excessively
causing the system to panic early in testing. The debug checks are
known to be racy, though on x86_64 there is no evidence that they
trigger at all.

We want to keep the checks active on debug systems to alert us to
problems with log space accounting, but we need to reduce the impact
of a racy check on testing on the Power platform.

As a result, convert the ASSERT conditions to warnings, and
allow them to fire only once per filesystem mount. This will prevent
false positives from interfering with testing, whilst still
providing us with the indication that they may be a problem with log
space accounting should that occur.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-04-08 12:45:07 +10:00
Dave Chinner
be65b18a10 xfs: catch bad block numbers freeing extents.
A fuzzed filesystem crashed a kernel when freeing an extent with a
block number beyond the end of the filesystem. Convert all the debug
asserts in xfs_free_extent() to active checks so that we catch bad
extents and return that the filesytsem is corrupted rather than
crashing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-04-08 12:45:07 +10:00
Dave Chinner
fd074841cf xfs: push the AIL from memory reclaim and periodic sync
When we are short on memory, we want to expedite the cleaning of
dirty objects.  Hence when we run short on memory, we need to kick
the AIL flushing into action to clean as many dirty objects as
quickly as possible.  To implement this, sample the lsn of the log
item at the head of the AIL and use that as the push target for the
AIL flush.

Further, we keep items in the AIL that are dirty that are not
tracked any other way, so we can get objects sitting in the AIL that
don't get written back until the AIL is pushed. Hence to get the
filesystem to the idle state, we might need to push the AIL to flush
out any remaining dirty objects sitting in the AIL. This requires
the same push mechanism as the reclaim push.

This patch also renames xfs_trans_ail_tail() to xfs_ail_min_lsn() to
match the new xfs_ail_max_lsn() function introduced in this patch.
Similarly for xfs_trans_ail_push -> xfs_ail_push.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-04-08 12:45:07 +10:00
Dave Chinner
cd4a3c503c xfs: clean up code layout in xfs_trans_ail.c
This patch rearranges the location of functions in xfs_trans_ail.c
to remove the need for forward declarations of those functions in
preparation for adding new functions without the need for forward
declarations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-04-08 12:45:07 +10:00
Dave Chinner
0bf6a5bd4b xfs: convert the xfsaild threads to a workqueue
Similar to the xfssyncd, the per-filesystem xfsaild threads can be
converted to a global workqueue and run periodically by delayed
works. This makes sense for the AIL pushing because it uses
variable timeouts depending on the work that needs to be done.

By removing the xfsaild, we simplify the AIL pushing code and
remove the need to spread the code to implement the threading
and pushing across multiple files.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-04-08 12:45:07 +10:00
Dave Chinner
a7b339f1b8 xfs: introduce background inode reclaim work
Background inode reclaim needs to run more frequently that the XFS
syncd work is run as 30s is too long between optimal reclaim runs.
Add a new periodic work item to the xfs syncd workqueue to run a
fast, non-blocking inode reclaim scan.

Background inode reclaim is kicked by the act of marking inodes for
reclaim.  When an AG is first marked as having reclaimable inodes,
the background reclaim work is kicked. It will continue to run
periodically untill it detects that there are no more reclaimable
inodes. It will be kicked again when the first inode is queued for
reclaim.

To ensure shrinker based inode reclaim throttles to the inode
cleaning and reclaim rate but still reclaim inodes efficiently, make it kick the
background inode reclaim so that when we are low on memory we are
trying to reclaim inodes as efficiently as possible. This kick shoul
d not be necessary, but it will protect against failures to kick the
background reclaim when inodes are first dirtied.

To provide the rate throttling, make the shrinker pass do
synchronous inode reclaim so that it blocks on inodes under IO. This
means that the shrinker will reclaim inodes rather than just
skipping over them, but it does not adversely affect the rate of
reclaim because most dirty inodes are already under IO due to the
background reclaim work the shrinker kicked.

These two modifications solve one of the two OOM killer invocations
Chris Mason reported recently when running a stress testing script.
The particular workload trigger for the OOM killer invocation is
where there are more threads than CPUs all unlinking files in an
extremely memory constrained environment. Unlike other solutions,
this one does not have a performance impact on performance when
memory is not constrained or the number of concurrent threads
operating is <= to the number of CPUs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-04-08 12:45:07 +10:00
Dave Chinner
89e4cb550a xfs: convert ENOSPC inode flushing to use new syncd workqueue
On of the problems with the current inode flush at ENOSPC is that we
queue a flush per ENOSPC event, regardless of how many are already
queued. Thi can result in    hundreds of queued flushes, most of
which simply burn CPU scanned and do no real work. This simply slows
down allocation at ENOSPC.

We really only need one active flush at a time, and we can easily
implement that via the new xfs_syncd_wq. All we need to do is queue
a flush if one is not already active, then block waiting for the
currently active flush to complete. The result is that we only ever
have a single ENOSPC inode flush active at a time and this greatly
reduces the overhead of ENOSPC processing.

On my 2p test machine, this results in tests exercising ENOSPC
conditions running significantly faster - 042 halves execution time,
083 drops from 60s to 5s, etc - while not introducing test
regressions.

This allows us to remove the old xfssyncd threads and infrastructure
as they are no longer used.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-04-08 12:45:07 +10:00
Dave Chinner
c6d09b666d xfs: introduce a xfssyncd workqueue
All of the work xfssyncd does is background functionality. There is
no need for a thread per filesystem to do this work - it can al be
managed by a global workqueue now they manage concurrency
effectively.

Introduce a new gglobal xfssyncd workqueue, and convert the periodic
work to use this new functionality. To do this, use a delayed work
construct to schedule the next running of the periodic sync work
for the filesystem. When the sync work is complete, queue a new
delayed work for the next running of the sync work.

For laptop mode, we wait on completion for the sync works, so ensure
that the sync work queuing interface can flush and wait for work to
complete to enable the work queue infrastructure to replace the
current sequence number and wakeup that is used.

Because the sync work does non-trivial amounts of work, mark the
new work queue as CPU intensive.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-04-08 12:45:07 +10:00
Dave Chinner
e828776a8a xfs: fix extent format buffer allocation size
When formatting an inode item, we have to allocate a separate buffer
to hold extents when there are delayed allocation extents on the
inode and it is in extent format. The allocation size is derived
from the in-core data fork representation, which accounts for
delayed allocation extents, while the on-disk representation does
not contain any delalloc extents.

As a result of this mismatch, the allocated buffer can be far larger
than needed to hold the real extent list which, due to the fact the
inode is in extent format, is limited to the size of the literal
area of the inode. However, we can have thousands of delalloc
extents, resulting in an allocation size orders of magnitude larger
than is needed to hold all the real extents.

Fix this by limiting the size of the buffer being allocated to the
size of the literal area of the inodes in the filesystem (i.e. the
maximum size an inode fork can grow to).

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
2011-04-08 12:45:07 +10:00