Commit Graph

20667 Commits (255614c45943d43a3778a04b214692346b9d5049)

Author SHA1 Message Date
Nick Piggin 31e6b01f41 fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.

This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.

The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
  of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
  not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
  access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
  refcounts are not required for persistence. Also we are free to perform mount
  lookups, and to assume dentry mount points and mount roots are stable up and
  down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
  so we can load this tuple atomically, and also check whether any of its
  members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
  sequence after the child is found in case anything changed in the parent
  during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
  limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.

When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.

Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).

The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links

In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.

Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:27 +11:00
Nick Piggin ff0c7d15f9 fs: avoid inode RCU freeing for pseudo fs
Pseudo filesystems that don't put inode on RCU list or reachable by
rcu-walk dentries do not need to RCU free their inodes.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:26 +11:00
Nick Piggin fa0d7e3de6 fs: icache RCU free inodes
RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
  permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
  to take i_lock no longer need to take sb_inode_list_lock to walk the list in
  the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
  page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:26 +11:00
Nick Piggin 77812a1ef1 fs: consolidate dentry kill sequence
The tricky locking for disposing of a dentry is duplicated 3 times in the
dcache (dput, pruning a dentry from the LRU, and pruning its ancestors).
Consolidate them all into a single function dentry_kill.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:25 +11:00
Nick Piggin ec33679d78 fs: use RCU in shrink_dentry_list to reduce lock nesting
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:25 +11:00
Nick Piggin be182bff72 fs: reduce dcache_inode_lock width in lru scanning
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:25 +11:00
Nick Piggin 89e6054836 fs: dcache reduce prune_one_dentry locking
prune_one_dentry can avoid quite a bit of locking in the common case where
ancestors have an elevated refcount. Alternatively, we could have gone the
other way and made fewer trylocks in the case where d_count goes to zero, but
is probably less common.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:25 +11:00
Nick Piggin a734eb458a fs: dcache reduce d_parent locking
Use RCU to simplify locking in dget_parent.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:24 +11:00
Nick Piggin dc0474be3e fs: dcache rationalise dget variants
dget_locked was a shortcut to avoid the lazy lru manipulation when we already
held dcache_lock (lru manipulation was relatively cheap at that point).
However, how that the lru lock is an innermost one, we never hold it at any
caller, so the lock cost can now be avoided. We already have well working lazy
dcache LRU, so it should be fine to defer LRU manipulations to scan time.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:24 +11:00
Nick Piggin 357f8e658b fs: dcache reduce dcache_inode_lock
dcache_inode_lock can be avoided in d_delete() and d_materialise_unique()
in cases where it is not required.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:24 +11:00
Nick Piggin 89ad485f01 fs: dcache reduce locking in d_alloc
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:24 +11:00
Nick Piggin 61f3dee4af fs: dcache reduce dput locking
It is possible to run dput without taking data structure locks up-front. In
many cases where we don't kill the dentry anyway, these locks are not required.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:23 +11:00
Nick Piggin 58db63d086 fs: dcache avoid starvation in dcache multi-step operations
Long lived dcache "multi-step" operations which retry on rename seq can
be starved with a lot of rename activity. If they fail after the 1st pass,
take the rename_lock for writing to avoid further starvation.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:23 +11:00
Nick Piggin b5c84bf6f6 fs: dcache remove dcache_lock
dcache_lock no longer protects anything. remove it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:23 +11:00
Nick Piggin 949854d024 fs: Use rename lock and RCU for multi-step operations
The remaining usages for dcache_lock is to allow atomic, multi-step read-side
operations over the directory tree by excluding modifications to the tree.
Also, to walk in the leaf->root direction in the tree where we don't have
a natural d_lock ordering.

This could be accomplished by taking every d_lock, but this would mean a
huge number of locks and actually gets very tricky.

Solve this instead by using the rename seqlock for multi-step read-side
operations, retry in case of a rename so we don't walk up the wrong parent.
Concurrent dentry insertions are not serialised against.  Concurrent deletes
are tricky when walking up the directory: our parent might have been deleted
when dropping locks so also need to check and retry for that.

We can also use the rename lock in cases where livelock is a worry (and it
is introduced in subsequent patch).

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:22 +11:00
Nick Piggin 9abca36087 fs: increase d_name lock coverage
Cover d_name with d_lock in more cases, where there may be concurrent
modification to it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:22 +11:00
Nick Piggin b23fb0a603 fs: scale inode alias list
Add a new lock, dcache_inode_lock, to protect the inode's i_dentry list
from concurrent modification. d_alias is also protected by d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:22 +11:00
Nick Piggin 2fd6b7f507 fs: dcache scale subdirs
Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).

Note: if we change the locking rule in future so that ->d_child protection is
provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
But it would be an exception to an otherwise regular locking scheme, so we'd
have to see some good results. Probably not worthwhile.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:21 +11:00
Nick Piggin da5029563a fs: dcache scale d_unhashed
Protect d_unhashed(dentry) condition with d_lock. This means keeping
DCACHE_UNHASHED bit in synch with hash manipulations.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:21 +11:00
Nick Piggin b7ab39f631 fs: dcache scale dentry refcount
Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:21 +11:00
Nick Piggin 2304450783 fs: dcache scale lru
Add a new lock, dcache_lru_lock, to protect the dcache LRU list from concurrent
modification. d_lru is also protected by d_lock, which allows LRU lists to be
accessed without the lru lock, using RCU in future patches.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:20 +11:00
Nick Piggin 789680d1ee fs: dcache scale hash
Add a new lock, dcache_hash_lock, to protect the dcache hash table from
concurrent modification. d_hash is also protected by d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:20 +11:00
Nick Piggin ec2447c278 hostfs: simplify locking
Remove dcache_lock locking from hostfs filesystem, and move it into dcache
helpers. All that is required is a coherent path name. Protection from
concurrent modification of the namespace after path name generation is not
provided in current code, because dcache_lock is dropped before the path is
used.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:20 +11:00
Nick Piggin b1e6a015a5 fs: change d_hash for rcu-walk
Change d_hash so it may be called from lock-free RCU lookups. See similar
patch for d_compare for details.

For in-tree filesystems, this is just a mechanical change.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:20 +11:00
Nick Piggin 621e155a35 fs: change d_compare for rcu-walk
Change d_compare so it may be called from lock-free RCU lookups. This
does put significant restrictions on what may be done from the callback,
however there don't seem to have been any problems with in-tree fses.
If some strange use case pops up that _really_ cannot cope with the
rcu-walk rules, we can just add new rcu-unaware callbacks, which would
cause name lookup to drop out of rcu-walk mode.

For in-tree filesystems, this is just a mechanical change.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:19 +11:00
Nick Piggin fb2d5b86af fs: name case update method
smpfs and ncpfs want to update a live dentry name in-place. Rather than
have them open code the locking, provide a documented dcache API.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:19 +11:00
Nick Piggin 2bc334dcc7 jfs: dont overwrite dentry name in d_revalidate
Use vfat's method for dealing with negative dentries to preserve case,
rather than overwrite dentry name in d_revalidate, which is a bit ugly
and also gets in the way of doing lock-free path walking.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:18 +11:00
Nick Piggin 79eb4dde74 cifs: dont overwrite dentry name in d_revalidate
Use vfat's method for dealing with negative dentries to preserve case,
rather than overwrite dentry name in d_revalidate, which is a bit ugly
and also gets in the way of doing lock-free path walking.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:18 +11:00
Nick Piggin fe15ce446b fs: change d_delete semantics
Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.

This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:18 +11:00
Nick Piggin fbc8d4c046 config fs: avoid switching ->d_op on live dentry
Switching d_op on a live dentry is racy in general, so avoid it. In this case
it is a negative dentry, which is safer, but there are still concurrent ops
which may be called on d_op in that case (eg. d_revalidate). So in general
a filesystem may not do this. Fix configfs so as not to do this.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:17 +11:00
Nick Piggin 3e880fb5e4 fs: use fast counters for vfs caches
percpu_counter library generates quite nasty code, so unless you need
to dynamically allocate counters or take fast approximate value, a
simple per cpu set of counters is much better.

The percpu_counter can never be made to work as well, because it has an
indirection from pointer to percpu memory, and it can't use direct
this_cpu_inc interfaces because it doesn't use static PER_CPU data, so
code will always be worse.

In the fastpath, it is the difference between this:

        incl %gs:nr_dentry      # nr_dentry

and this:

        movl    percpu_counter_batch(%rip), %edx        # percpu_counter_batch,
        movl    $1, %esi        #,
        movq    $nr_dentry, %rdi        #,
        call    __percpu_counter_add    # (plus I clobber registers)

__percpu_counter_add:
        pushq   %rbp    #
        movq    %rsp, %rbp      #,
        subq    $32, %rsp       #,
        movq    %rbx, -24(%rbp) #,
        movq    %r12, -16(%rbp) #,
        movq    %r13, -8(%rbp)  #,
        movq    %rdi, %rbx      # fbc, fbc
#APP
# 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
        movq %gs:kernel_stack,%rax      #, pfo_ret__
# 0 "" 2
#NO_APP
        incl    -8124(%rax)     # <variable>.preempt_count
        movq    32(%rdi), %r12  # <variable>.counters, tcp_ptr__
#APP
# 78 "lib/percpu_counter.c" 1
        add %gs:this_cpu_off, %r12      # this_cpu_off, tcp_ptr__
# 0 "" 2
#NO_APP
        movslq  (%r12),%r13     #* tcp_ptr__, tmp73
        movslq  %edx,%rax       # batch, batch
        addq    %rsi, %r13      # amount, count
        cmpq    %rax, %r13      # batch, count
        jge     .L27    #,
        negl    %edx    # tmp76
        movslq  %edx,%rdx       # tmp76, tmp77
        cmpq    %rdx, %r13      # tmp77, count
        jg      .L28    #,
.L27:
        movq    %rbx, %rdi      # fbc,
        call    _raw_spin_lock  #
        addq    %r13, 8(%rbx)   # count, <variable>.count
        movq    %rbx, %rdi      # fbc,
        movl    $0, (%r12)      #,* tcp_ptr__
        call    _raw_spin_unlock        #
.L29:
#APP
# 216 "/home/npiggin/usr/src/linux-2.6/arch/x86/include/asm/thread_info.h" 1
        movq %gs:kernel_stack,%rax      #, pfo_ret__
# 0 "" 2
#NO_APP
        decl    -8124(%rax)     # <variable>.preempt_count
        movq    -8136(%rax), %rax       #, D.14625
        testb   $8, %al #, D.14625
        jne     .L32    #,
.L31:
        movq    -24(%rbp), %rbx #,
        movq    -16(%rbp), %r12 #,
        movq    -8(%rbp), %r13  #,
        leave
        ret
        .p2align 4,,10
        .p2align 3
.L28:
        movl    %r13d, (%r12)   # count,*
        jmp     .L29    #
.L32:
        call    preempt_schedule        #
        .p2align 4,,6
        jmp     .L31    #
        .size   __percpu_counter_add, .-__percpu_counter_add
        .p2align 4,,15

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:17 +11:00
Nick Piggin 86c8749ede vfs: revert per-cpu nr_unused counters for dentry and inodes
The nr_unused counters count the number of objects on an LRU, and as such they
are synchronized with LRU object insertion and removal and scanning, and
protected under the LRU lock.

Making it per-cpu does not actually get any concurrency improvements because of
this lock, and summing the counter is much slower, and
incrementing/decrementing it costs more code size and is slower too.

These counters should stay per-LRU, which currently means global.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:17 +11:00
Nick Piggin 786a5e15b6 fs: d_validate fixes
d_validate has been broken for a long time.

kmem_ptr_validate does not guarantee that a pointer can be dereferenced
if it can go away at any time. Even rcu_read_lock doesn't help, because
the pointer might be queued in RCU callbacks but not executed yet.

So the parent cannot be checked, nor the name hashed. The dentry pointer
can not be touched until it can be verified under lock. Hashing simply
cannot be used.

Instead, verify the parent/child relationship by traversing parent's
d_child list. It's slow, but only ncpfs and the destaged smbfs care
about it, at this point.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:16 +11:00
Linus Torvalds 9e9bc97367 Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6: (255 commits)
  [media] radio-aimslab.c: Fix gcc 4.5+ bug
  [media] cx25821: Fix compilation breakage due to BKL dependency
  [media] v4l2-compat-ioctl32: fix compile warning
  [media] zoran: fix compiler warning
  [media] tda18218: fix compile warning
  [media] ngene: fix compile warning
  [media] DVB: IR support for TechnoTrend CT-3650
  [media] cx23885, cimax2.c: Fix case of two CAM insertion irq
  [media] ir-nec-decoder: fix repeat key issue
  [media] staging: se401 depends on USB
  [media] staging: usbvideo/vicam depends on USB
  [media] soc_camera: Add the ability to bind regulators to soc_camedra devices
  [media] V4L2: Add a v4l2-subdev (soc-camera) driver for OmniVision OV2640 sensor
  [media] v4l: soc-camera: switch to .unlocked_ioctl
  [media] v4l: ov772x: simplify pointer dereference
  [media] ov9640: fix OmniVision OV9640 sensor driver's priv data retrieving
  [media] ov9640: use macro to request OmniVision OV9640 sensor private data
  [media] ivtv-i2c: Fix two warnings
  [media] staging/lirc: Update lirc TODO files
  [media] cx88: Remove the obsolete i2c_adapter.id field
  ...
2011-01-06 18:32:12 -08:00
Linus Torvalds 3c0cb7c31c Merge branch 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm
* 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm: (416 commits)
  ARM: DMA: add support for DMA debugging
  ARM: PL011: add DMA burst threshold support for ST variants
  ARM: PL011: Add support for transmit DMA
  ARM: PL011: Ensure IRQs are disabled in UART interrupt handler
  ARM: PL011: Separate hardware FIFO size from TTY FIFO size
  ARM: PL011: Allow better handling of vendor data
  ARM: PL011: Ensure error flags are clear at startup
  ARM: PL011: include revision number in boot-time port printk
  ARM: vexpress: add sched_clock() for Versatile Express
  ARM i.MX53: Make MX53 EVK bootable
  ARM i.MX53: Some bug fix about MX53 MSL code
  ARM: 6607/1: sa1100: Update platform device registration
  ARM: 6606/1: sa1100: Fix platform device registration
  ARM i.MX51: rename IPU irqs
  ARM i.MX51: Add ipu clock support
  ARM: imx/mx27_3ds: Add PMIC support
  ARM: DMA: Replace page_to_dma()/dma_to_page() with pfn_to_dma()/dma_to_pfn()
  mx51: fix usb clock support
  MX51: Add support for usb host 2
  arch/arm/plat-mxc/ehci.c: fix errors/typos
  ...
2011-01-06 16:50:35 -08:00
Pavel Shilovsky 7e12eddb73 CIFS: Simplify cifs_open code
Make the code more general for use in posix and non-posix open.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-06 19:07:54 +00:00
Pavel Shilovsky eeb910a6d4 CIFS: Simplify non-posix open stuff (try #2)
Delete cifs_open_inode_helper and move non-posix open related things
to cifs_nt_open function.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-06 19:07:53 +00:00
Pavel Shilovsky 4b886136df CIFS: Add match_port check during looking for an existing connection (try #4)
If we have a share mounted by non-standard port and try to mount another share
on the same host with standard port, we connect to the first share again -
that's wrong. This patch fixes this bug.

Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-06 19:07:53 +00:00
Pavel Shilovsky a9f1b85e5b CIFS: Simplify ipv*_connect functions into one (try #4)
Make connect logic more ip-protocol independent and move RFC1001 stuff into
a separate function. Also replace union addr in TCP_Server_Info structure
with sockaddr_storage.

Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-and-Tested-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-06 19:07:53 +00:00
Shirish Pargaonkar df8fbc241a cifs: Support NTLM2 session security during NTLMSSP authentication [try #5]
Indicate to the server a capability of NTLM2 session security (NTLM2 Key)
during ntlmssp protocol exchange in one of the bits of the flags field.
If server supports this capability, send NTLM2 key even if signing is not
required on the server.

If the server requires signing, the session keys exchanged for NTLMv2
and NTLM2 session security in auth packet of the nlmssp exchange are same.

Send the same flags in authenticate message (type 3) that client sent in
negotiate message (type 1).

Remove function setup_ntlmssp_neg_req

Make sure ntlmssp negotiate and authenticate messages are zero'ed
before they are built.

Reported-and-Tested-by: Robbert Kouprie <robbert@exx.nl>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-06 19:07:52 +00:00
Nick Piggin 262f86adcc cifs: don't overwrite dentry name in d_revalidate
Instead, use fatfs's method for dealing with negative dentries to
preserve case, rather than overwrite dentry name in d_revalidate, which
is a bit ugly and also gets in the way of doing lock-free path walking.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-01-06 19:07:52 +00:00
Linus Torvalds 65b2074f84 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
  sched: Change wait_for_completion_*_timeout() to return a signed long
  sched, autogroup: Fix reference leak
  sched, autogroup: Fix potential access to freed memory
  sched: Remove redundant CONFIG_CGROUP_SCHED ifdef
  sched: Fix interactivity bug by charging unaccounted run-time on entity re-weight
  sched: Move periodic share updates to entity_tick()
  printk: Use this_cpu_{read|write} api on printk_pending
  sched: Make pushable_tasks CONFIG_SMP dependant
  sched: Add 'autogroup' scheduling feature: automated per session task groups
  sched: Fix unregister_fair_sched_group()
  sched: Remove unused argument dest_cpu to migrate_task()
  mutexes, sched: Introduce arch_mutex_cpu_relax()
  sched: Add some clock info to sched_debug
  cpu: Remove incorrect BUG_ON
  cpu: Remove unused variable
  sched: Fix UP build breakage
  sched: Make task dump print all 15 chars of proc comm
  sched: Update tg->shares after cpu.shares write
  sched: Allow update_cfs_load() to update global load
  sched: Implement demand based update_cfs_load()
  ...
2011-01-06 10:23:33 -08:00
Linus Torvalds b08b272133 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw:
  GFS2: Don't flush delete workqueue when releasing the transaction lock
  GFS2: fsck.gfs2 reported statfs error after gfs2_grow
  GFS2: Merge glock state fields into a bitfield
  GFS2: Fix uninitialised error value in previous patch
  GFS2: fix recursive locking during rindex truncates
  GFS2: reread rindex when necessary to grow rindex
  GFS2: Remove duplicate #defines from glock.h
  GFS2: Clean up of gdlm_lock function
  GFS2: Allow gfs2 to update quota usage values through the quotactl interface
  GFS2: fs/gfs2/glock.h: Add __attribute__((format(printf,2,3)) to gfs2_print_dbg
  GFS2: fs/gfs2/glock.c: Use printf extension %pV
  GFS2: Clean up duplicated setattr code
  GFS2: Remove unreachable calls to vmtruncate
  GFS2: fs/gfs2/glock.c: Convert sprintf_symbol to %pS
  GFS2: Change two WQ_RESCUERs into WQ_MEM_RECLAIM
2011-01-06 10:01:23 -08:00
Russell King 31edf274f9 Merge branches 'ftrace', 'gic', 'io', 'kexec', 'mod', 'sa11x0', 'sh' and 'versatile' into devel 2011-01-05 18:08:10 +00:00
Ingo Molnar 27066fd484 Merge commit 'v2.6.37' into sched/core
Merge reason: Merge the final .37 tree.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-05 14:14:46 +01:00
Nick Piggin d3a23e1678 Revert "fs: use RCU read side protection in d_validate"
This reverts commit 3825bdb7ed.

You cannot dget() a dentry without having a reference, or holding
a lock that guarantees it remains valid.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-05 20:01:21 +11:00
Mauro Carvalho Chehab 88ae7624a6 [media] V4L1 removal: Remove linux/videodev.h
There's no sense on keeping it on 2.6.38, as nobody is using it
anymore, at the kernel tree, and installing it at the userspace
API.

As two deprecated drivers still need it, move it to their internal
directories.

Reviewed-by: Hans Verkuil <hverkuil@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-12-29 08:17:11 -02:00
Tejun Heo 5d8e4bddc6 ncpfs: don't use flush_scheduled_work()
flush_scheduled_work() is deprecated and scheduled to be removed.
Directly flush the used works on stop instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Petr Vandrovec <petr@vandrovec.name>
2010-12-24 15:59:06 +01:00
Tejun Heo 9b00a81829 ocfs2: don't use flush_scheduled_work()
flush_scheduled_work() is deprecated and scheduled to be removed.

* cancel_delayed_work() + flush_schedule_work() ->
  cancel_delayed_work_sync().

* flush qs->qs_work directly on exit instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Joel Becker <joel.becker@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
2010-12-24 15:59:06 +01:00
Linus Torvalds eda4b716ea Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  ocfs2: Fix system inodes cache overflow.
  ocfs2: Hold ip_lock when set/clear flags for indexed dir.
  ocfs2: Adjust masklog flag values
  Ocfs2: Teach 'coherency=full' O_DIRECT writes to correctly up_read i_alloc_sem.
  ocfs2/dlm: Migrate lockres with no locks if it has a reference
2010-12-23 16:36:48 -08:00
Linus Torvalds 55fb78a3a8 Merge branch 'linus-hot-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'linus-hot-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix on-line resizing regression
2010-12-23 16:25:31 -08:00
Theodore Ts'o 8a7411a243 ext4: fix on-line resizing regression
https://bugzilla.kernel.org/show_bug.cgi?id=25352

This regression was caused by commit a31437b85: "ext4: use
sb_issue_zeroout in setup_new_group_blocks", by accidentally dropping
the code which reserved the block group descriptor and inode table
blocks.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-23 15:00:54 -05:00
Prasad Joshi f06328d772 logfs: fix "Kernel BUG at readwrite.c:1193"
This happens when __logfs_create() tries to write a new inode to the disk
which is full.

__logfs_create() associates the transaction pointer with inode.  During
the logfs_write_inode() function call chain this transaction pointer is
moved from inode to page->private using function move_inode_to_page
(do_write_inode() -> inode_to_page() -> move_inode_to_page)

When the write inode fails, the transaction is aborted and iput is called
on the failed inode.  During delete_inode the same transaction pointer
associated with the page is getting used.  Thus causing kernel BUG.

The patch checks for error in write_inode() and restores the page->private
to NULL.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=20162

Signed-off-by: Prasad Joshi <prasadjoshi124@gmail.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Florian Mickler <florian@mickler.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-22 19:43:33 -08:00
Prasad Joshi eabb26cacd logfs: fix deadlock in logfs_get_wblocks, hold and wait on super->s_write_mutex
do_logfs_journal_wl_pass() should use GFP_NOFS for memory allocation GC
code calls btree_insert32 with GFP_KERNEL while holding a mutex
super->s_write_mutex.

The same mutex is used in address_space_operations->writepage(), and a
call to writepage() could be triggered as a result of memory allocation
in btree_insert32, causing a deadlock.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=20342

Signed-off-by: Prasad Joshi <prasadjoshi124@gmail.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Florian Mickler <florian@mickler.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-22 19:43:33 -08:00
Tao Ma 7d8f98769e ocfs2: Fix system inodes cache overflow.
When we store system inodes cache in ocfs2_super,
we use a array for global system inodes. But unfortunately,
the range is calculated wrongly which makes it overflow and
pollute ocfs2_super->local_system_inodes.
This patch fix it by setting the range properly.

The corresponding bug is ossbug1303.
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1303

Cc: stable@kernel.org
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-22 02:35:36 -08:00
Linus Torvalds 9d5004fcf6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: handle partial result from get_user_pages
  ceph: mark user pages dirty on direct-io reads
  ceph: fix null pointer dereference in ceph_init_dentry for nfs reexport
  ceph: fix direct-io on non-page-aligned buffers
  ceph: fix msgr_init error path
2010-12-20 21:32:20 -08:00
Al Viro 3cb50ddf97 Fix btrfs b0rkage
Buggered-in: 76dda93c6a ("Btrfs: add snapshot/subvolume destroy
ioctl")

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-20 09:09:57 -08:00
Ingo Molnar ca680888d5 Merge commit 'v2.6.37-rc6' into sched/core
Merge reason: Update to the latest -rc.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-19 16:35:14 +01:00
Henry C Chang b6aa5901c7 ceph: mark user pages dirty on direct-io reads
For read operation, we have to set the argument _write_ of get_user_pages
to 1 since we will write data to pages. Also, we need to SetPageDirty before
releasing these pages.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 09:54:40 -08:00
Sage Weil 92cf765237 ceph: fix null pointer dereference in ceph_init_dentry for nfs reexport
The fh_to_dentry etc. methods use ceph_init_dentry(), which assumes that
d_parent is defined.  It isn't for those callers, so check!

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 09:53:48 -08:00
Christoph Lameter ee1be86263 fs: Use this_cpu_inc_return in buffer.c
__this_cpu_inc can create a single instruction with the same effect
as the _get_cpu_var(..)++ construct in buffer.c.

Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-12-17 15:18:05 +01:00
Tejun Heo 275c8b9328 Merge branch 'this_cpu_ops' into for-2.6.38 2010-12-17 15:16:46 +01:00
Christoph Lameter c7b92516a9 fs: Use this_cpu_xx operations in buffer.c
Optimize various per cpu area operations through these new percpu
operations.  These operations avoid address calculations through the
use of segment prefixes and multiple memory references through RMW
instructions etc.

Reduces code size:

Before:

christoph@linux-2.6$ size fs/buffer.o
   text	   data	    bss	    dec	    hex	filename
  19169	     80	     28	  19277	   4b4d	fs/buffer.o

After:

christoph@linux-2.6$ size fs/buffer.o
   text	   data	    bss	    dec	    hex	filename
  19138	     80	     28	  19246	   4b2e	fs/buffer.o

V3->V4:
	- Move the use of this_cpu_inc_return into a later patch so that
	  this one can go in without percpu infrastructure changes.

Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-12-17 15:07:19 +01:00
Werner Fink b7b8de0873 TTY: Add tty ioctl to figure device node of the system console.
This has been in the SuSE kernels for a very long time.

Signed-off-by: Werner Fink <werner@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-12-16 16:18:28 -08:00
Linus Torvalds a3383e8372 Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
  fanotify: fill in the metadata_len field on struct fanotify_event_metadata
  fanotify: split version into version and metadata_len
  fanotify: Dont try to open a file descriptor for the overflow event
  fanotify: Introduce FAN_NOFD
  fanotify: do not leak user reference on allocation failure
  inotify: stop kernel memory leak on file creation failure
  fanotify: on group destroy allow all waiters to bypass permission check
  fanotify: Dont allow a mask of 0 if setting or removing a mark
  fanotify: correct broken ref counting in case adding a mark failed
  fanotify: if set by user unset FMODE_NONOTIFY before fsnotify_perm() is called
  fanotify: remove packed from access response message
  fanotify: deny permissions when no event was sent
2010-12-16 15:45:49 -08:00
Anton Salikhmetov b2837fcf49 hfsplus: %L-to-%ll, macro correction, and remove unneeded braces
Clean-up based on checkpatch.pl report against unnecessary braces
(`{' and `}'), non-standard format option %Lu (%llu recommended)
as well as one trailing statement in a macro definition which
should have been on the next line.

Signed-off-by: Anton Salikhmetov <alexo@tuxera.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-12-16 18:08:46 +01:00
Anton Salikhmetov 20b7643d8e hfsplus: spaces/indentation clean-up
Fix incorrect spaces and indentation reported by checkpatch.pl.

Signed-off-by: Anton Salikhmetov <alexo@tuxera.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-12-16 18:08:46 +01:00
Anton Salikhmetov 21f2296a59 hfsplus: C99 comments clean-up
Match coding style restriction against C99 comments where
checkpatch.pl reported errors about their usage.

Signed-off-by: Anton Salikhmetov <alexo@tuxera.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-12-16 18:08:46 +01:00
Anton Salikhmetov 2753cc281c hfsplus: over 80 character lines clean-up
Match coding style line length limitation where checkpatch.pl
reported over-80-character-line warnings.

Signed-off-by: Anton Salikhmetov <alexo@tuxera.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-12-16 18:08:45 +01:00
Anton Salikhmetov 596276c357 hfsplus: fix an artifact in ioctl flag checking
Fix a flag checking artifact in hfsplus_ioctl_getflags() routine
found while doing clean-up against assignments inside `if's.

Signed-off-by: Anton Salikhmetov <alexo@tuxera.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-12-16 18:08:43 +01:00
Steven Whitehouse 846f404552 GFS2: Don't flush delete workqueue when releasing the transaction lock
There is no requirement to flush the delete workqueue before a
gfs2 filesystem is suspended. The workqueue's work will just
be suspended along with the rest of the tasks on the filesystem.

The resolves a deadlock situation where the transaction lock's
demotion code was trying to flush the delete workqueue while at
the same time, the workqueue was waiting for the transaction
lock.

The delete workqueue is flushed by gfs2_make_fs_ro() already, so
that umount/remount are correctly protected anyway.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-12-16 15:18:48 +00:00
Tao Ma 8ac33dc86d ocfs2: Hold ip_lock when set/clear flags for indexed dir.
When we set/clear the dyn_features for an inode we hold the ip_lock.
So do it when we set/clear OCFS2_INDEXED_DIR_FL also.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-16 00:36:15 -08:00
Sunil Mushran 41b41a26d4 ocfs2: Adjust masklog flag values
Two masklogs had the same flag value.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-16 00:36:11 -08:00
Ryusuke Konishi 947b10ae0a nilfs2: fix regression of garbage collection ioctl
On 2.6.37-rc1, garbage collection ioctl of nilfs was broken due to the
commit 263d90cefc ("nilfs2: remove own inode hash used for GC"),
and leading to filesystem corruption.

The patch doesn't queue gc-inodes for log writer if they are reused
through the vfs inode cache.  Here, gc-inode is the inode which
buffers blocks to be relocated on GC.  That patch queues gc-inodes in
nilfs_init_gcinode() function, but this function is not called when
they don't have I_NEW flag.  Thus, some of live blocks are wrongly
overrode without being moved to new logs.

This resolves the problem by moving the gc-inode queueing to an outer
function to ensure it's done right.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-12-16 14:35:18 +09:00
Henry C Chang ab226e21ad ceph: fix direct-io on non-page-aligned buffers
The user buffer may be 512-byte aligned, not page-aligned.  We were
assuming the buffer was page-aligned and only accounting for
non-page-aligned io offsets.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-15 20:46:16 -08:00
Linus Torvalds a4851d8f7d Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix typo which broke '..' detection in ext4_find_entry()
  ext4: Turn off multiple page-io submission by default
2010-12-15 12:41:17 -08:00
Tavis Ormandy 462e635e5b install_special_mapping skips security_file_mmap check.
The install_special_mapping routine (used, for example, to setup the
vdso) skips the security check before insert_vm_struct, allowing a local
attacker to bypass the mmap_min_addr security restriction by limiting
the available pages for special mappings.

bprm_mm_init() also skips the check, and although I don't think this can
be used to bypass any restrictions, I don't see any reason not to have
the security check.

  $ uname -m
  x86_64
  $ cat /proc/sys/vm/mmap_min_addr
  65536
  $ cat install_special_mapping.s
  section .bss
      resb BSS_SIZE
  section .text
      global _start
      _start:
          mov     eax, __NR_pause
          int     0x80
  $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
  $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
  $ ./install_special_mapping &
  [1] 14303
  $ cat /proc/14303/maps
  0000f000-00010000 r-xp 00000000 00:00 0                                  [vdso]
  00010000-00011000 r-xp 00001000 00:19 2453665                            /home/taviso/install_special_mapping
  00011000-ffffe000 rwxp 00000000 00:00 0                                  [stack]

It's worth noting that Red Hat are shipping with mmap_min_addr set to
4096.

Signed-off-by: Tavis Ormandy <taviso@google.com>
Acked-by: Kees Cook <kees@ubuntu.com>
Acked-by: Robert Swiecki <swiecki@google.com>
[ Changed to not drop the error code - akpm ]
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-15 12:30:36 -08:00
Eric Paris 7d13162332 fanotify: fill in the metadata_len field on struct fanotify_event_metadata
The fanotify_event_metadata now has a field which is supposed to
indicate the length of the metadata portion of the event.  Fill in that
field as well.

Based-in-part-on-patch-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-15 13:58:18 -05:00
Tejun Heo afe2c511fb workqueue: convert cancel_rearming_delayed_work[queue]() users to cancel_delayed_work_sync()
cancel_rearming_delayed_work[queue]() has been superceded by
cancel_delayed_work_sync() quite some time ago.  Convert all the
in-kernel users.  The conversions are completely equivalent and
trivial.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: netdev@vger.kernel.org
Cc: Anton Vorontsov <cbou@mail.ru>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: xfs-masters@oss.sgi.com
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: netfilter-devel@vger.kernel.org
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: linux-nfs@vger.kernel.org
2010-12-15 10:56:11 +01:00
Aaro Koskinen 6d5c3aa84b ext4: fix typo which broke '..' detection in ext4_find_entry()
There should be a check for the NUL character instead of '0'.

Fortunately the only thing that cares about this is NFS serving, which
is why we didn't notice this in the merge window testing.

Reported-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-14 21:45:31 -05:00
Theodore Ts'o 1449032be1 ext4: Turn off multiple page-io submission by default
Jon Nelson has found a test case which causes postgresql to fail with
the error:

psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581

Under memory pressure, it looks like part of a file can end up getting
replaced by zero's.  Until we can figure out the cause, we'll roll
back the change and use block_write_full_page() instead of
ext4_bio_write_page().  The new, more efficient writing function can
be used via the mount option mblk_io_submit, so we can test and fix
the new page I/O code.

To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
memory such that the system just at the end of triggering writeback
before running the following sql script:

begin;
create temporary table foo as select x as a, ARRAY[x] as b FROM
generate_series(1, 10000000 ) AS x;
create index foo_a_idx on foo (a);
create index foo_b_idx on foo USING GIN (b);
rollback;

If the temporary table is created on a hard drive partition which is
encrypted using dm_crypt, then under memory pressure, approximately
30-40% of the time, pgsql will issue the above failure.

This patch should fix this problem, and the problem will come back if
the file system is mounted with the mblk_io_submit mount option.

Reported-by: Jon Nelson <jnelson@jamponi.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-14 15:27:50 -05:00
Linus Torvalds 5111711d3e Merge branch 'for-2.6.37' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.37' of git://linux-nfs.org/~bfields/linux:
  nfsd: Fix possible BUG_ON firing in set_change_info
  sunrpc: prevent use-after-free on clearing XPT_BUSY
2010-12-14 11:09:05 -08:00
Linus Torvalds e13cf63f2b Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: prevent RAID level downgrades when space is low
  Btrfs: account for missing devices in RAID allocation profiles
  Btrfs: EIO when we fail to read tree roots
  Btrfs: fix compiler warnings
  Btrfs: Make async snapshot ioctl more generic
  Btrfs: pwrite blocked when writing from the mmaped buffer of the same page
  Btrfs: Fix a crash when mounting a subvolume
  Btrfs: fix sync subvol/snapshot creation
  Btrfs: Fix page leak in compressed writeback path
  Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots
  Btrfs: fixup return code for btrfs_del_orphan_item
  Btrfs: do not do fast caching if we are allocating blocks for tree_root
  Btrfs: deal with space cache errors better
  Btrfs: fix use after free in O_DIRECT
2010-12-14 11:08:13 -08:00
Linus Torvalds 073f21ae13 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: verify ioctl retries
  fuse: fix ioctl when server is 32bit
2010-12-14 11:07:39 -08:00
Linus Torvalds 497b5b13c9 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: log timestamp changes to the source inode in rename
2010-12-14 11:06:17 -08:00
Linus Torvalds e97b71ded9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: fix ioctl magic
  ceph: Behave better when handling file lock replies.
  ceph: pass lock information by struct file_lock instead of as individual params.
  ceph: Handle file locks in replies from the MDS.
  ceph: avoid possible null deref in readdir after dir llseek
2010-12-14 11:02:15 -08:00
Linus Torvalds 38971ce2fa Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: Fix panic after nfs_umount()
  nfs: remove extraneous and problematic calls to nfs_clear_request
  nfs: kernel should return EPROTONOSUPPORT when not support NFSv4
  NFS: Fix fcntl F_GETLK not reporting some conflicts
  nfs: Discard ACL cache on mode update
  NFS: Readdir cleanups
  NFS: nfs_readdir_search_for_cookie() don't mark as eof if cookie not found
  NFS: Fix a memory leak in nfs_readdir
  Call the filesystem back whenever a page is removed from the page cache
  NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler
2010-12-14 08:51:12 -08:00
Linus Torvalds caa4a59574 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: remove bogus remapping of error in cifs_filldir()
  cifs: allow calling cifs_build_path_to_root on incomplete cifs_sb
  cifs: fix check of error return from is_path_accessable
  cifs: remove Local_System_Name
  cifs: fix use of CONFIG_CIFS_ACL
  cifs: add attribute cache timeout (actimeo) tunable
2010-12-14 08:49:15 -08:00
Chris Mason 83a50de97f Btrfs: prevent RAID level downgrades when space is low
The extent allocator has code that allows us to fill
allocations from any available block group, even if it doesn't
match the raid level we've requested.

This was put in because adding a new drive to a filesystem
made with the default mkfs options actually upgrades the metadata from
single spindle dup to full RAID1.

But, the code also allows us to allocate from a raid0 chunk when we
really want a raid1 or raid10 chunk.  This can cause big trouble because
mkfs creates a small (4MB) raid0 chunk for data and metadata which then
goes unused for raid1/raid10 installs.

The allocator will happily wander in and allocate from that chunk when
things get tight, which is not correct.

The fix here is to make sure that we provide duplication when the
caller has asked for it.  It does all the dups to be any raid level,
which preserves the dup->raid1 upgrade abilities.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-13 20:07:01 -05:00
Chris Mason cd02dca564 Btrfs: account for missing devices in RAID allocation profiles
When we mount in RAID degraded mode without adding a new device to
replace the failed one, we can end up using the wrong RAID flags for
allocations.

This results in strange combinations of block groups (raid1 in a raid10
filesystem) and corruptions when we try to allocate blocks from single
spindle chunks on drives that are actually missing.

The first device has two small 4MB chunks in it that mkfs creates and
these are usually unused in a raid1 or raid10 setup.  But, in -o degraded,
the allocator will fall back to these because the mask of desired raid groups
isn't correct.

The fix here is to count the missing devices as we build up the list
of devices in the system.  This count is used when picking the
raid level to make sure we continue using the same levels that were
in place before we lost a drive.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-13 20:06:52 -05:00
Chris Mason 68433b73b1 Btrfs: EIO when we fail to read tree roots
If we just get a plain IO error when we read tree roots, the code
wasn't properly sending that error up the chain.  This allowed mounts to
continue when they should failed, and allowed operations
on partially setup root structs.  The end result was usually oopsen
on spinlocks that hadn't been spun up correctly.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-13 14:47:58 -05:00
Namhyung Kim b9d4105279 dlm: sanitize work_start() in lowcomms.c
The create_workqueue() returns NULL if failed rather than ERR_PTR().
Fix error checking and remove unnecessary variable 'error'.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: David Teigland <teigland@redhat.com>
2010-12-13 13:42:24 -06:00
Jan Beulich 3dd1462e82 Btrfs: fix compiler warnings
... regarding an unused function when !MIGRATION, and regarding a
printk() format string vs argument mismatch.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:11 -05:00
Li Zefan fdfb1e4f6c Btrfs: Make async snapshot ioctl more generic
If we had reserved some bytes in struct btrfs_ioctl_vol_args, we
wouldn't have to create a new structure for async snapshot creation.

Here we convert async snapshot ioctl to use a more generic ABI, as
we'll add more ioctls for snapshots/subvolumes in the future, readonly
snapshots for example.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:11 -05:00
Xin Zhong 914ee295af Btrfs: pwrite blocked when writing from the mmaped buffer of the same page
This problem is found in meego testing:
http://bugs.meego.com/show_bug.cgi?id=6672
A file in btrfs is mmaped and the mmaped buffer is passed to pwrite to write to the same page
of the same file. In btrfs_file_aio_write(), the pages is locked by prepare_pages(). So when
btrfs_copy_from_user() is called, page fault happens and the same page needs to be locked again
in filemap_fault(). The fix is to move iov_iter_fault_in_readable() before prepage_pages() to make page
fault happen before pages are locked. And also disable page fault in critical region in
btrfs_copy_from_user().

Reviewed-by: Yan, Zheng<zheng.z.yan@intel.com>
Signed-off-by: Zhong, Xin <xin.zhong@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:10 -05:00
Li Zefan f106e82caa Btrfs: Fix a crash when mounting a subvolume
We should drop dentry before deactivating the superblock, otherwise
we can hit this bug:

BUG: Dentry f349a690{i=100,n=/} still in use (1) [unmount of btrfs loop1]
...

Steps to reproduce the bug:

  # mount /dev/loop1 /mnt
  # mkdir save
  # btrfs subvolume snapshot /mnt save/snap1
  # umount /mnt
  # mount -o subvol=save/snap1 /dev/loop1 /mnt
  (crash)

Reported-by: Michael Niederle <mniederle@gmx.at>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:10 -05:00
Sage Weil 75eaa0e22c Btrfs: fix sync subvol/snapshot creation
We were incorrectly taking the async path even for the sync ioctls by
passing in &transid unconditionally.

There's ample room for further cleanup here, but this keeps the fix simple.

Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:10 -05:00
Yan, Zheng 24ae63656a Btrfs: Fix page leak in compressed writeback path
"start + num_bytes >= actual_end" can happen when compressed page writeback races
with file truncation. In that case we need unlock and release pages past the end
of file.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:09 -05:00
Josef Bacik 84cd948cb1 Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots
Not being able to delete an orphan item isn't a horrible thing.  The worst that
happens is the next time around we try and do the orphan cleanup and we can't
find the referenced object and just delete the item and move on.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-12-10 16:29:04 -05:00
Chuck Lever 5b362ac379 NFS: Fix panic after nfs_umount()
After a few unsuccessful NFS mount attempts in which the client and
server cannot agree on an authentication flavor both support, the
client panics.  nfs_umount() is invoked in the kernel in this case.

Turns out nfs_umount()'s UMNT RPC invocation causes the RPC client to
write off the end of the rpc_clnt's iostat array.  This is because the
mount client's nrprocs field is initialized with the count of defined
procedures (two: MNT and UMNT), rather than the size of the client's
proc array (four).

The fix is to use the same initialization technique used by most other
upper layer clients in the kernel.

Introduced by commit 0b524123, which failed to update nrprocs when
support was added for UMNT in the kernel.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=24302
BugLink: http://bugs.launchpad.net/bugs/683938

Reported-by: Stefan Bader <stefan.bader@canonical.com>
Tested-by: Stefan Bader <stefan.bader@canonical.com>
Cc: stable@kernel.org # >= 2.6.32
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-10 13:01:50 -05:00