Commit graph

20566 commits

Author SHA1 Message Date
Linus Torvalds
eda4b716ea Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  ocfs2: Fix system inodes cache overflow.
  ocfs2: Hold ip_lock when set/clear flags for indexed dir.
  ocfs2: Adjust masklog flag values
  Ocfs2: Teach 'coherency=full' O_DIRECT writes to correctly up_read i_alloc_sem.
  ocfs2/dlm: Migrate lockres with no locks if it has a reference
2010-12-23 16:36:48 -08:00
Linus Torvalds
55fb78a3a8 Merge branch 'linus-hot-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'linus-hot-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix on-line resizing regression
2010-12-23 16:25:31 -08:00
Theodore Ts'o
8a7411a243 ext4: fix on-line resizing regression
https://bugzilla.kernel.org/show_bug.cgi?id=25352

This regression was caused by commit a31437b85: "ext4: use
sb_issue_zeroout in setup_new_group_blocks", by accidentally dropping
the code which reserved the block group descriptor and inode table
blocks.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-23 15:00:54 -05:00
Prasad Joshi
f06328d772 logfs: fix "Kernel BUG at readwrite.c:1193"
This happens when __logfs_create() tries to write a new inode to the disk
which is full.

__logfs_create() associates the transaction pointer with inode.  During
the logfs_write_inode() function call chain this transaction pointer is
moved from inode to page->private using function move_inode_to_page
(do_write_inode() -> inode_to_page() -> move_inode_to_page)

When the write inode fails, the transaction is aborted and iput is called
on the failed inode.  During delete_inode the same transaction pointer
associated with the page is getting used.  Thus causing kernel BUG.

The patch checks for error in write_inode() and restores the page->private
to NULL.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=20162

Signed-off-by: Prasad Joshi <prasadjoshi124@gmail.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Florian Mickler <florian@mickler.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-22 19:43:33 -08:00
Prasad Joshi
eabb26cacd logfs: fix deadlock in logfs_get_wblocks, hold and wait on super->s_write_mutex
do_logfs_journal_wl_pass() should use GFP_NOFS for memory allocation GC
code calls btree_insert32 with GFP_KERNEL while holding a mutex
super->s_write_mutex.

The same mutex is used in address_space_operations->writepage(), and a
call to writepage() could be triggered as a result of memory allocation
in btree_insert32, causing a deadlock.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=20342

Signed-off-by: Prasad Joshi <prasadjoshi124@gmail.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Florian Mickler <florian@mickler.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-22 19:43:33 -08:00
Tao Ma
7d8f98769e ocfs2: Fix system inodes cache overflow.
When we store system inodes cache in ocfs2_super,
we use a array for global system inodes. But unfortunately,
the range is calculated wrongly which makes it overflow and
pollute ocfs2_super->local_system_inodes.
This patch fix it by setting the range properly.

The corresponding bug is ossbug1303.
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1303

Cc: stable@kernel.org
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-22 02:35:36 -08:00
Linus Torvalds
9d5004fcf6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: handle partial result from get_user_pages
  ceph: mark user pages dirty on direct-io reads
  ceph: fix null pointer dereference in ceph_init_dentry for nfs reexport
  ceph: fix direct-io on non-page-aligned buffers
  ceph: fix msgr_init error path
2010-12-20 21:32:20 -08:00
Al Viro
3cb50ddf97 Fix btrfs b0rkage
Buggered-in: 76dda93c6a ("Btrfs: add snapshot/subvolume destroy
ioctl")

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-20 09:09:57 -08:00
Henry C Chang
b6aa5901c7 ceph: mark user pages dirty on direct-io reads
For read operation, we have to set the argument _write_ of get_user_pages
to 1 since we will write data to pages. Also, we need to SetPageDirty before
releasing these pages.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 09:54:40 -08:00
Sage Weil
92cf765237 ceph: fix null pointer dereference in ceph_init_dentry for nfs reexport
The fh_to_dentry etc. methods use ceph_init_dentry(), which assumes that
d_parent is defined.  It isn't for those callers, so check!

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 09:53:48 -08:00
Linus Torvalds
a3383e8372 Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
  fanotify: fill in the metadata_len field on struct fanotify_event_metadata
  fanotify: split version into version and metadata_len
  fanotify: Dont try to open a file descriptor for the overflow event
  fanotify: Introduce FAN_NOFD
  fanotify: do not leak user reference on allocation failure
  inotify: stop kernel memory leak on file creation failure
  fanotify: on group destroy allow all waiters to bypass permission check
  fanotify: Dont allow a mask of 0 if setting or removing a mark
  fanotify: correct broken ref counting in case adding a mark failed
  fanotify: if set by user unset FMODE_NONOTIFY before fsnotify_perm() is called
  fanotify: remove packed from access response message
  fanotify: deny permissions when no event was sent
2010-12-16 15:45:49 -08:00
Tao Ma
8ac33dc86d ocfs2: Hold ip_lock when set/clear flags for indexed dir.
When we set/clear the dyn_features for an inode we hold the ip_lock.
So do it when we set/clear OCFS2_INDEXED_DIR_FL also.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-16 00:36:15 -08:00
Sunil Mushran
41b41a26d4 ocfs2: Adjust masklog flag values
Two masklogs had the same flag value.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-16 00:36:11 -08:00
Ryusuke Konishi
947b10ae0a nilfs2: fix regression of garbage collection ioctl
On 2.6.37-rc1, garbage collection ioctl of nilfs was broken due to the
commit 263d90cefc ("nilfs2: remove own inode hash used for GC"),
and leading to filesystem corruption.

The patch doesn't queue gc-inodes for log writer if they are reused
through the vfs inode cache.  Here, gc-inode is the inode which
buffers blocks to be relocated on GC.  That patch queues gc-inodes in
nilfs_init_gcinode() function, but this function is not called when
they don't have I_NEW flag.  Thus, some of live blocks are wrongly
overrode without being moved to new logs.

This resolves the problem by moving the gc-inode queueing to an outer
function to ensure it's done right.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-12-16 14:35:18 +09:00
Henry C Chang
ab226e21ad ceph: fix direct-io on non-page-aligned buffers
The user buffer may be 512-byte aligned, not page-aligned.  We were
assuming the buffer was page-aligned and only accounting for
non-page-aligned io offsets.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-15 20:46:16 -08:00
Linus Torvalds
a4851d8f7d Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix typo which broke '..' detection in ext4_find_entry()
  ext4: Turn off multiple page-io submission by default
2010-12-15 12:41:17 -08:00
Tavis Ormandy
462e635e5b install_special_mapping skips security_file_mmap check.
The install_special_mapping routine (used, for example, to setup the
vdso) skips the security check before insert_vm_struct, allowing a local
attacker to bypass the mmap_min_addr security restriction by limiting
the available pages for special mappings.

bprm_mm_init() also skips the check, and although I don't think this can
be used to bypass any restrictions, I don't see any reason not to have
the security check.

  $ uname -m
  x86_64
  $ cat /proc/sys/vm/mmap_min_addr
  65536
  $ cat install_special_mapping.s
  section .bss
      resb BSS_SIZE
  section .text
      global _start
      _start:
          mov     eax, __NR_pause
          int     0x80
  $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
  $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
  $ ./install_special_mapping &
  [1] 14303
  $ cat /proc/14303/maps
  0000f000-00010000 r-xp 00000000 00:00 0                                  [vdso]
  00010000-00011000 r-xp 00001000 00:19 2453665                            /home/taviso/install_special_mapping
  00011000-ffffe000 rwxp 00000000 00:00 0                                  [stack]

It's worth noting that Red Hat are shipping with mmap_min_addr set to
4096.

Signed-off-by: Tavis Ormandy <taviso@google.com>
Acked-by: Kees Cook <kees@ubuntu.com>
Acked-by: Robert Swiecki <swiecki@google.com>
[ Changed to not drop the error code - akpm ]
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-15 12:30:36 -08:00
Eric Paris
7d13162332 fanotify: fill in the metadata_len field on struct fanotify_event_metadata
The fanotify_event_metadata now has a field which is supposed to
indicate the length of the metadata portion of the event.  Fill in that
field as well.

Based-in-part-on-patch-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-15 13:58:18 -05:00
Aaro Koskinen
6d5c3aa84b ext4: fix typo which broke '..' detection in ext4_find_entry()
There should be a check for the NUL character instead of '0'.

Fortunately the only thing that cares about this is NFS serving, which
is why we didn't notice this in the merge window testing.

Reported-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-14 21:45:31 -05:00
Theodore Ts'o
1449032be1 ext4: Turn off multiple page-io submission by default
Jon Nelson has found a test case which causes postgresql to fail with
the error:

psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581

Under memory pressure, it looks like part of a file can end up getting
replaced by zero's.  Until we can figure out the cause, we'll roll
back the change and use block_write_full_page() instead of
ext4_bio_write_page().  The new, more efficient writing function can
be used via the mount option mblk_io_submit, so we can test and fix
the new page I/O code.

To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
memory such that the system just at the end of triggering writeback
before running the following sql script:

begin;
create temporary table foo as select x as a, ARRAY[x] as b FROM
generate_series(1, 10000000 ) AS x;
create index foo_a_idx on foo (a);
create index foo_b_idx on foo USING GIN (b);
rollback;

If the temporary table is created on a hard drive partition which is
encrypted using dm_crypt, then under memory pressure, approximately
30-40% of the time, pgsql will issue the above failure.

This patch should fix this problem, and the problem will come back if
the file system is mounted with the mblk_io_submit mount option.

Reported-by: Jon Nelson <jnelson@jamponi.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-14 15:27:50 -05:00
Linus Torvalds
5111711d3e Merge branch 'for-2.6.37' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.37' of git://linux-nfs.org/~bfields/linux:
  nfsd: Fix possible BUG_ON firing in set_change_info
  sunrpc: prevent use-after-free on clearing XPT_BUSY
2010-12-14 11:09:05 -08:00
Linus Torvalds
e13cf63f2b Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: prevent RAID level downgrades when space is low
  Btrfs: account for missing devices in RAID allocation profiles
  Btrfs: EIO when we fail to read tree roots
  Btrfs: fix compiler warnings
  Btrfs: Make async snapshot ioctl more generic
  Btrfs: pwrite blocked when writing from the mmaped buffer of the same page
  Btrfs: Fix a crash when mounting a subvolume
  Btrfs: fix sync subvol/snapshot creation
  Btrfs: Fix page leak in compressed writeback path
  Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots
  Btrfs: fixup return code for btrfs_del_orphan_item
  Btrfs: do not do fast caching if we are allocating blocks for tree_root
  Btrfs: deal with space cache errors better
  Btrfs: fix use after free in O_DIRECT
2010-12-14 11:08:13 -08:00
Linus Torvalds
073f21ae13 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: verify ioctl retries
  fuse: fix ioctl when server is 32bit
2010-12-14 11:07:39 -08:00
Linus Torvalds
497b5b13c9 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: log timestamp changes to the source inode in rename
2010-12-14 11:06:17 -08:00
Linus Torvalds
e97b71ded9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: fix ioctl magic
  ceph: Behave better when handling file lock replies.
  ceph: pass lock information by struct file_lock instead of as individual params.
  ceph: Handle file locks in replies from the MDS.
  ceph: avoid possible null deref in readdir after dir llseek
2010-12-14 11:02:15 -08:00
Linus Torvalds
38971ce2fa Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: Fix panic after nfs_umount()
  nfs: remove extraneous and problematic calls to nfs_clear_request
  nfs: kernel should return EPROTONOSUPPORT when not support NFSv4
  NFS: Fix fcntl F_GETLK not reporting some conflicts
  nfs: Discard ACL cache on mode update
  NFS: Readdir cleanups
  NFS: nfs_readdir_search_for_cookie() don't mark as eof if cookie not found
  NFS: Fix a memory leak in nfs_readdir
  Call the filesystem back whenever a page is removed from the page cache
  NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler
2010-12-14 08:51:12 -08:00
Linus Torvalds
caa4a59574 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: remove bogus remapping of error in cifs_filldir()
  cifs: allow calling cifs_build_path_to_root on incomplete cifs_sb
  cifs: fix check of error return from is_path_accessable
  cifs: remove Local_System_Name
  cifs: fix use of CONFIG_CIFS_ACL
  cifs: add attribute cache timeout (actimeo) tunable
2010-12-14 08:49:15 -08:00
Chris Mason
83a50de97f Btrfs: prevent RAID level downgrades when space is low
The extent allocator has code that allows us to fill
allocations from any available block group, even if it doesn't
match the raid level we've requested.

This was put in because adding a new drive to a filesystem
made with the default mkfs options actually upgrades the metadata from
single spindle dup to full RAID1.

But, the code also allows us to allocate from a raid0 chunk when we
really want a raid1 or raid10 chunk.  This can cause big trouble because
mkfs creates a small (4MB) raid0 chunk for data and metadata which then
goes unused for raid1/raid10 installs.

The allocator will happily wander in and allocate from that chunk when
things get tight, which is not correct.

The fix here is to make sure that we provide duplication when the
caller has asked for it.  It does all the dups to be any raid level,
which preserves the dup->raid1 upgrade abilities.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-13 20:07:01 -05:00
Chris Mason
cd02dca564 Btrfs: account for missing devices in RAID allocation profiles
When we mount in RAID degraded mode without adding a new device to
replace the failed one, we can end up using the wrong RAID flags for
allocations.

This results in strange combinations of block groups (raid1 in a raid10
filesystem) and corruptions when we try to allocate blocks from single
spindle chunks on drives that are actually missing.

The first device has two small 4MB chunks in it that mkfs creates and
these are usually unused in a raid1 or raid10 setup.  But, in -o degraded,
the allocator will fall back to these because the mask of desired raid groups
isn't correct.

The fix here is to count the missing devices as we build up the list
of devices in the system.  This count is used when picking the
raid level to make sure we continue using the same levels that were
in place before we lost a drive.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-13 20:06:52 -05:00
Chris Mason
68433b73b1 Btrfs: EIO when we fail to read tree roots
If we just get a plain IO error when we read tree roots, the code
wasn't properly sending that error up the chain.  This allowed mounts to
continue when they should failed, and allowed operations
on partially setup root structs.  The end result was usually oopsen
on spinlocks that hadn't been spun up correctly.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-13 14:47:58 -05:00
Jan Beulich
3dd1462e82 Btrfs: fix compiler warnings
... regarding an unused function when !MIGRATION, and regarding a
printk() format string vs argument mismatch.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:11 -05:00
Li Zefan
fdfb1e4f6c Btrfs: Make async snapshot ioctl more generic
If we had reserved some bytes in struct btrfs_ioctl_vol_args, we
wouldn't have to create a new structure for async snapshot creation.

Here we convert async snapshot ioctl to use a more generic ABI, as
we'll add more ioctls for snapshots/subvolumes in the future, readonly
snapshots for example.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:11 -05:00
Xin Zhong
914ee295af Btrfs: pwrite blocked when writing from the mmaped buffer of the same page
This problem is found in meego testing:
http://bugs.meego.com/show_bug.cgi?id=6672
A file in btrfs is mmaped and the mmaped buffer is passed to pwrite to write to the same page
of the same file. In btrfs_file_aio_write(), the pages is locked by prepare_pages(). So when
btrfs_copy_from_user() is called, page fault happens and the same page needs to be locked again
in filemap_fault(). The fix is to move iov_iter_fault_in_readable() before prepage_pages() to make page
fault happen before pages are locked. And also disable page fault in critical region in
btrfs_copy_from_user().

Reviewed-by: Yan, Zheng<zheng.z.yan@intel.com>
Signed-off-by: Zhong, Xin <xin.zhong@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:10 -05:00
Li Zefan
f106e82caa Btrfs: Fix a crash when mounting a subvolume
We should drop dentry before deactivating the superblock, otherwise
we can hit this bug:

BUG: Dentry f349a690{i=100,n=/} still in use (1) [unmount of btrfs loop1]
...

Steps to reproduce the bug:

  # mount /dev/loop1 /mnt
  # mkdir save
  # btrfs subvolume snapshot /mnt save/snap1
  # umount /mnt
  # mount -o subvol=save/snap1 /dev/loop1 /mnt
  (crash)

Reported-by: Michael Niederle <mniederle@gmx.at>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:10 -05:00
Sage Weil
75eaa0e22c Btrfs: fix sync subvol/snapshot creation
We were incorrectly taking the async path even for the sync ioctls by
passing in &transid unconditionally.

There's ample room for further cleanup here, but this keeps the fix simple.

Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:10 -05:00
Yan, Zheng
24ae63656a Btrfs: Fix page leak in compressed writeback path
"start + num_bytes >= actual_end" can happen when compressed page writeback races
with file truncation. In that case we need unlock and release pages past the end
of file.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-12-10 16:29:09 -05:00
Josef Bacik
84cd948cb1 Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots
Not being able to delete an orphan item isn't a horrible thing.  The worst that
happens is the next time around we try and do the orphan cleanup and we can't
find the referenced object and just delete the item and move on.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-12-10 16:29:04 -05:00
Chuck Lever
5b362ac379 NFS: Fix panic after nfs_umount()
After a few unsuccessful NFS mount attempts in which the client and
server cannot agree on an authentication flavor both support, the
client panics.  nfs_umount() is invoked in the kernel in this case.

Turns out nfs_umount()'s UMNT RPC invocation causes the RPC client to
write off the end of the rpc_clnt's iostat array.  This is because the
mount client's nrprocs field is initialized with the count of defined
procedures (two: MNT and UMNT), rather than the size of the client's
proc array (four).

The fix is to use the same initialization technique used by most other
upper layer clients in the kernel.

Introduced by commit 0b524123, which failed to update nrprocs when
support was added for UMNT in the kernel.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=24302
BugLink: http://bugs.launchpad.net/bugs/683938

Reported-by: Stefan Bader <stefan.bader@canonical.com>
Tested-by: Stefan Bader <stefan.bader@canonical.com>
Cc: stable@kernel.org # >= 2.6.32
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-10 13:01:50 -05:00
Tristan Ye
39c99f12f1 Ocfs2: Teach 'coherency=full' O_DIRECT writes to correctly up_read i_alloc_sem.
Due to newly-introduced 'coherency=full' O_DIRECT writes also takes the EX
rw_lock like buffered writes did(rw_level == 1), it turns out messing the
usage of 'level' in ocfs2_dio_end_io() up, which caused i_alloc_sem being
failed to get up_read'd correctly.

This patch tries to teach ocfs2_dio_end_io to understand well on all locking
stuffs by explicitly introducing a new bit for i_alloc_sem in iocb's private
data, just like what we did for rw_lock.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-09 15:36:48 -08:00
Sunil Mushran
388c4bcb4e ocfs2/dlm: Migrate lockres with no locks if it has a reference
o2dlm was not migrating resources with zero locks because it assumed that that
resource would get purged by dlm_thread. However, some usage patterns involve
creating and dropping locks at a high rate leading to the migrate thread seeing
zero locks but the purge thread seeing an active reference. When this happens,
the dlm_thread cannot purge the resource and the migrate thread sees no reason
to migrate that resource. The spell is broken when the migrate thread catches
the resource with a lock.

The fix is to make the migrate thread also consider the reference map.

This usage pattern can be triggered by userspace on userdlm locks and flocks.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-12-09 15:36:00 -08:00
Christoph Hellwig
05340d4ab2 xfs: log timestamp changes to the source inode in rename
Now that we don't mark VFS inodes dirty anymore for internal
timestamp changes, but rely on the transaction subsystem to push
them out, we need to explicitly log the source inode in rename after
updating it's timestamps to make sure the changes actually get
forced out by sync/fsync or an AIL push.

We already account for the fourth inode in the log reservation, as a
rename of directories needs to update the nlink field, so just
adding the xfs_trans_log_inode call is enough.

This fixes the xfsqa 065 regression introduced by:

	"xfs: don't use vfs writeback for pure metadata modifications"

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-12-09 17:07:02 -06:00
Josef Bacik
7e1fea731d Btrfs: fixup return code for btrfs_del_orphan_item
If the orphan item doesn't exist, we return 1, which doesn't make any sense to
the callers.  Instead return -ENOENT if we didn't find the item.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-12-09 13:57:15 -05:00
Josef Bacik
b8399dee47 Btrfs: do not do fast caching if we are allocating blocks for tree_root
Since the fast caching uses normal tree locking, we can possibly deadlock if we
get to the caching via a btrfs_search_slot() on the tree_root.  So just check to
see if the root we are on is the tree root, and just don't do the fast caching.

Reported-by: Sage Weil <sage@newdream.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
2010-12-09 13:57:13 -05:00
Josef Bacik
2b20982e31 Btrfs: deal with space cache errors better
Currently if the space cache inode generation number doesn't match the
generation number in the space cache header we will just fail to load the space
cache, but we won't mark the space cache as an error, so we'll keep getting that
error each time somebody tries to cache that block group until we actually clear
the thing.  Fix this by marking the space cache as having an error so we only
get the message once.  This patch also makes it so that we don't try and setup
space cache for a block group that isn't cached, since we won't be able to write
it out anyway.  None of these problems are actual problems, they are just
annoying and sub-optimal.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-12-09 13:57:12 -05:00
Josef Bacik
955256f2c3 Btrfs: fix use after free in O_DIRECT
This fixes a bug where we use dip after we have freed it.  Instead just use the
file_offset that was passed to the function.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-12-09 13:57:10 -05:00
Suresh Jayaraman
545c988b20 cifs: remove bogus remapping of error in cifs_filldir()
As the FIXME points out correctly, now filldir() itself returns -EOVERFLOW if
it not possible to represent the inode number supplied by the filesystem in
the field provided by userspace.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-12-08 18:47:54 +00:00
Neil Brown
c1ac3ffcd0 nfsd: Fix possible BUG_ON firing in set_change_info
If vfs_getattr in fill_post_wcc returns an error, we don't
set fh_post_change.
For NFSv4, this can result in set_change_info triggering a BUG_ON.
i.e. fh_post_saved being zero isn't really a bug.

So:
 - instead of BUGging when fh_post_saved is zero, just clear ->atomic.
 - if vfs_getattr fails in fill_post_wcc, take a copy of i_ctime anyway.
   This will be used i seg_change_info, but not overly trusted.
 - While we are there, remove the pointless 'if' statements in set_change_info.
   There is no harm setting all the values.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-12-08 11:44:04 -05:00
Trond Myklebust
2df485a774 nfs: remove extraneous and problematic calls to nfs_clear_request
When a nfs_page is freed, nfs_free_request is called which also calls
nfs_clear_request to clean out the lock and open contexts and free the
pagecache page.

However, a couple of places in the nfs code call nfs_clear_request
themselves. What happens here if the refcount on the request is still high?
We'll be releasing contexts and freeing pointers while the request is
possibly still in use.

Remove those bare calls to nfs_clear_context. That should only be done when
the request is being freed.

Note that when doing this, we need to watch out for tests of req->wb_page.
Previously, nfs_set_page_tag_locked() and nfs_clear_page_tag_locked()
would check the value of req->wb_page to figure out if the page is mapped
into the nfsi->nfs_page_tree. We now indicate the page is mapped using
the new bit PG_MAPPED in req->wb_flags .

Reported-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-07 23:02:44 -05:00
Mi Jinlong
0de1b7e800 nfs: kernel should return EPROTONOSUPPORT when not support NFSv4
When nfs client(kernel) don't support NFSv4, maybe user build
  kernel without NFSv4, there is a problem.

  Using command "mount SERVER-IP:/nfsv3 /mnt/" to mount NFSv3
  filesystem, mount should should success, but fail and get error:

    "mount.nfs: an incorrect mount option was specified"

  System call mount "nfs"(not "nfs4") with "vers=4",
  if CONFIG_NFS_V4 is not defined, the "vers=4" will be parsed
  as invalid argument and kernel return EINVAL to nfs-utils.

  About that, we really want get EPROTONOSUPPORT rather than
  EINVAL. This path make sure kernel parses argument success,
  and return EPROTONOSUPPORT at nfs_validate_mount_data().

Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-07 19:30:44 -05:00
Sergey Vlasov
21ac19d484 NFS: Fix fcntl F_GETLK not reporting some conflicts
The commit 129a84de23 (locks: fix F_GETLK
regression (failure to find conflicts)) fixed the posix_test_lock()
function by itself, however, its usage in NFS changed by the commit
9d6a8c5c21 (locks: give posix_test_lock
same interface as ->lock) remained broken - subsequent NFS-specific
locking code received F_UNLCK instead of the user-specified lock type.
To fix the problem, fl->fl_type needs to be saved before the
posix_test_lock() call and restored if no local conflicts were reported.

Reference: https://bugzilla.kernel.org/show_bug.cgi?id=23892
Tested-by: Alexander Morozov <amorozov@etersoft.ru>
Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
Cc: <stable@kernel.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-07 19:30:43 -05:00
Aneesh Kumar K.V
08a22b392a nfs: Discard ACL cache on mode update
An update of mode bits can result in ACL value being changed. We need
to mark the acl cache invalid when we update mode. Similarly we need
to update file attribute when we change ACL value

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-07 19:30:42 -05:00
Lino Sanfilippo
fdbf3ceeb6 fanotify: Dont try to open a file descriptor for the overflow event
We should not try to open a file descriptor for the overflow event since this
will always fail.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-07 16:14:24 -05:00
Eric Paris
2637919893 fanotify: do not leak user reference on allocation failure
If fanotify_init is unable to allocate a new fsnotify group it will
return but will not drop its reference on the associated user struct.
Drop that reference on error.

Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-07 16:14:23 -05:00
Eric Paris
a2ae4cc9a1 inotify: stop kernel memory leak on file creation failure
If inotify_init is unable to allocate a new file for the new inotify
group we leak the new group.  This patch drops the reference on the
group on file allocation failure.

Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
cc: stable@kernel.org
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-07 16:14:22 -05:00
Lino Sanfilippo
09e5f14e57 fanotify: on group destroy allow all waiters to bypass permission check
When fanotify_release() is called, there may still be processes waiting for
access permission. Currently only processes for which an event has already been
queued into the groups access list will be woken up.  Processes for which no
event has been queued will continue to sleep and thus cause a deadlock when
fsnotify_put_group() is called.
Furthermore there is a race allowing further processes to be waiting on the
access wait queue after wake_up (if they arrive before clear_marks_by_group()
is called).
This patch corrects this by setting a flag to inform processes that the group
is about to be destroyed and thus not to wait for access permission.

[additional changelog from eparis]
Lets think about the 4 relevant code paths from the PoV of the
'operator' 'listener' 'responder' and 'closer'.  Where operator is the
process doing an action (like open/read) which could require permission.
Listener is the task (or in this case thread) slated with reading from
the fanotify file descriptor.  The 'responder' is the thread responsible
for responding to access requests.  'Closer' is the thread attempting to
close the fanotify file descriptor.

The 'operator' is going to end up in:
fanotify_handle_event()
  get_response_from_access()
    (THIS BLOCKS WAITING ON USERSPACE)

The 'listener' interesting code path
fanotify_read()
  copy_event_to_user()
    prepare_for_access_response()
      (THIS CREATES AN fanotify_response_event)

The 'responder' code path:
fanotify_write()
  process_access_response()
    (REMOVE A fanotify_response_event, SET RESPONSE, WAKE UP 'operator')

The 'closer':
fanotify_release()
  (SUPPOSED TO CLEAN UP THE REST OF THIS MESS)

What we have today is that in the closer we remove all of the
fanotify_response_events and set a bit so no more response events are
ever created in prepare_for_access_response().

The bug is that we never wake all of the operators up and tell them to
move along.  You fix that in fanotify_get_response_from_access().  You
also fix other operators which haven't gotten there yet.  So I agree
that's a good fix.
[/additional changelog from eparis]

[remove additional changes to minimize patch size]
[move initialization so it was inside CONFIG_FANOTIFY_PERMISSION]

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-07 16:14:22 -05:00
Lino Sanfilippo
1734dee4e3 fanotify: Dont allow a mask of 0 if setting or removing a mark
In mark_remove_from_mask() we destroy marks that have their event mask cleared.
Thus we should not allow the creation of those marks in the first place.
With this patch we check if the mask given from user is 0 in case of FAN_MARK_ADD.
If so we return an error. Same for FAN_MARK_REMOVE since this does not have any
effect.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-07 16:14:21 -05:00
Lino Sanfilippo
fa218ab98c fanotify: correct broken ref counting in case adding a mark failed
If adding a mount or inode mark failed fanotify_free_mark() is called explicitly.
But at this time the mark has already been put into the destroy list of the
fsnotify_mark kernel thread. If the thread is too slow it will try to decrease
the reference of a mark, that has already been freed by fanotify_free_mark().
(If its fast enough it will only decrease the marks ref counter from 2 to 1 - note
that the counter has been increased to 2 in add_mark() - which has practically no
effect.)

This patch fixes the ref counting by not calling free_mark() explicitly, but
decreasing the ref counter and rely on the fsnotify_mark thread to cleanup in
case adding the mark has failed.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-07 16:14:21 -05:00
Lino Sanfilippo
b1085ba80c fanotify: if set by user unset FMODE_NONOTIFY before fsnotify_perm() is called
Unsetting FMODE_NONOTIFY in fsnotify_open() is too late, since fsnotify_perm()
is called before. If FMODE_NONOTIFY is set fsnotify_perm() will skip permission
checks, so a user can still disable permission checks by setting this flag
in an open() call.
This patch corrects this by unsetting the flag before fsnotify_perm is called.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-07 16:14:21 -05:00
Eric Paris
ecf6f5e7d6 fanotify: deny permissions when no event was sent
If no event was sent to userspace we cannot expect userspace to respond to
permissions requests.  Today such requests just hang forever. This patch will
deny any permissions event which was unable to be sent to userspace.

Reported-by: Tvrtko Ursulin <tvrtko.ursulin@sophos.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-12-07 16:14:17 -05:00
Jeff Layton
7d161b7f41 cifs: allow calling cifs_build_path_to_root on incomplete cifs_sb
It's possible that cifs_mount will call cifs_build_path_to_root on a
newly instantiated cifs_sb. In that case, it's likely that the
master_tlink pointer has not yet been instantiated.

Fix this by having cifs_build_path_to_root take a cifsTconInfo pointer
as well, and have the caller pass that in.

Reported-and-Tested-by: Robbert Kouprie <robbert@exx.nl>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-12-07 19:25:37 +00:00
Jeff Layton
03ceace5c6 cifs: fix check of error return from is_path_accessable
This function will return 0 if everything went ok. Commit 9d002df4
however added a block of code after the following check for
rc == -EREMOTE. With that change and when rc == 0, doing the
"goto mount_fail_check" here skips that code, leaving the tlink_tree
and master_tlink pointer unpopulated. That causes an oops later
in cifs_root_iget.

Reported-and-Tested-by: Robbert Kouprie <robbert@exx.nl>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-12-07 19:17:59 +00:00
Trond Myklebust
47c716cbf6 NFS: Readdir cleanups
No functional changes, but clarify the code.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-07 14:09:02 -05:00
Trond Myklebust
18fb5fe40c NFS: nfs_readdir_search_for_cookie() don't mark as eof if cookie not found
If we're searching for a specific cookie, and it isn't found in the page
cache, we should try an uncached_readdir(). To do so, we return EBADCOOKIE,
but we don't set desc->eof.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-07 12:41:58 -05:00
Ian Kent
de47de7404 autofs4 - remove ioctl mutex (bz23142)
With the recent changes to remove the BKL a mutex was added to the
ioctl entry point for calls to the old ioctl interface. This mutex
needs to be removed because of the need for the expire ioctl to call
back to the daemon to perform a umount and receive a completion
status (via another ioctl).

This should be fine as the new ioctl interface uses much of the same
code and it has been used without a mutex for around a year without
issue, as was the original intention.

Ref: Bugzilla bug 23142

Signed-off-by: Ian Kent <raven@themaw.net>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-07 07:45:44 -08:00
Linus Torvalds
086b17046c Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  ocfs2_connection_find() returns pointer to bad structure
  ocfs2: char is not always signed
  Ocfs2: Stop tracking a negative dentry after dentry_iput().
  ocfs2: fix memory leak
  fs/ocfs2/dlm: Use GFP_ATOMIC under spin_lock
2010-12-06 20:08:25 -08:00
Jeff Layton
8846399968 cifs: remove Local_System_Name
...this string is zeroed out and nothing ever changes it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-12-06 22:45:19 +00:00
Jeff Layton
79df1baeec cifs: fix use of CONFIG_CIFS_ACL
Some of the code under CONFIG_CIFS_ACL is dependent upon code under
CONFIG_CIFS_EXPERIMENTAL, but the Kconfig options don't reflect that
dependency. Move more of the ACL code out from under
CONFIG_CIFS_EXPERIMENTAL and under CONFIG_CIFS_ACL.

Also move find_readable_file out from other any sort of Kconfig
option and make it a function normally compiled in.

Reported-and-Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-12-06 20:22:39 +00:00
Sage Weil
1cd275f609 ceph: fix ioctl magic
The ioctl magic was inadvertently changed in 571dba52.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-06 09:45:22 -08:00
Eric W. Biederman
7b2a69ba70 Revert "vfs: show unreachable paths in getcwd and proc"
Because it caused a chroot ttyname regression in 2.6.36.

As of 2.6.36 ttyname does not work in a chroot.  It has already been
reported that screen breaks, and for me this breaks an automated
distribution testsuite, that I need to preserve the ability to run the
existing binaries on for several more years.  glibc 2.11.3 which has a
fix for this is not an option.

The root cause of this breakage is:

    commit 8df9d1a414
    Author: Miklos Szeredi <mszeredi@suse.cz>
    Date:   Tue Aug 10 11:41:41 2010 +0200

    vfs: show unreachable paths in getcwd and proc

    Prepend "(unreachable)" to path strings if the path is not reachable
    from the current root.

    Two places updated are
     - the return string from getcwd()
     - and symlinks under /proc/$PID.

    Other uses of d_path() are left unchanged (we know that some old
    software crashes if /proc/mounts is changed).

    Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

So remove the nice sounding, but ultimately ill advised change to how
/proc/fd symlinks work.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-05 16:39:45 -08:00
Steve French
ebb27386ff Merge branch 'master' of /pub/scm/linux/kernel/git/torvalds/linux-2.6 2010-12-03 03:52:43 +00:00
Frederic Weisbecker
238af8751f reiserfs: don't acquire lock recursively in reiserfs_acl_chmod
reiserfs_acl_chmod() can be called by reiserfs_set_attr() and then take
the reiserfs lock a second time.  Thereafter it may call journal_begin()
that definitely requires the lock not to be nested in order to release
it before taking the journal mutex because the reiserfs lock depends on
the journal mutex already.

So, aviod nesting the lock in reiserfs_acl_chmod().

Reported-by: Pawel Zawora <pzawora@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Pawel Zawora <pzawora@gmail.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: <stable@kernel.org>		[2.6.32.x+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-02 14:51:15 -08:00
Suresh Jayaraman
6d20e8406f cifs: add attribute cache timeout (actimeo) tunable
Currently, the attribute cache timeout for CIFS is hardcoded to 1 second. This
means that the client might have to issue a QPATHINFO/QFILEINFO call every 1
second to verify if something has changes, which seems too expensive. On the
other hand, if the timeout is hardcoded to a higher value, workloads that
expect strict cache coherency might see unexpected results.

Making attribute cache timeout as a tunable will allow us to make a tradeoff
between performance and cache metadata correctness depending on the
application/workload needs.

Add 'actimeo' tunable that can be used to tune the attribute cache timeout.
The default timeout is set to 1 second. Also, display actimeo option value in
/proc/mounts.

It appears to me that 'actimeo' and the proposed (but not yet merged)
'strictcache' option cannot coexist, so care must be taken that we reset the
other option if one of them is set.

Changes since last post:
   - fix option parsing and handle possible values correcly

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-12-02 19:32:11 +00:00
Linus Torvalds
8cb280c90f Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: only run xfs_error_test if error injection is active
  xfs: avoid moving stale inodes in the AIL
  xfs: delayed alloc blocks beyond EOF are valid after writeback
  xfs: push stale, pinned buffers on trylock failures
  xfs: fix failed write truncation handling.
2010-12-02 09:13:36 -08:00
Linus Torvalds
8520eeaa12 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: fix parsing of hostname in dfs referrals
  cifs: display fsc in /proc/mounts
  cifs: enable fscache iff fsc mount option is used explicitly
  cifs: allow fsc mount option only if CONFIG_CIFS_FSCACHE is set
  cifs: Handle extended attribute name cifs_acl to generate cifs acl blob (try #4)
  cifs: Misc. cleanup in cifsacl handling [try #4]
  cifs: trivial comment fix for cifs_invalidate_mapping
  [CIFS] fs/cifs/Kconfig: CIFS depends on CRYPTO_HMAC
  cifs: don't take extra tlink reference in initiate_cifs_search
  cifs: Percolate error up to the caller during get/set acls [try #4]
  cifs: fix another memleak, in cifs_root_iget
  cifs: fix potential use-after-free in cifs_oplock_break_put
2010-12-02 08:04:21 -08:00
Trond Myklebust
11de3b11e0 NFS: Fix a memory leak in nfs_readdir
We need to ensure that the entries in the nfs_cache_array get cleared
when the page is removed from the page cache. To do so, we use the
freepage address_space operation.

Change nfs_readdir_clear_array to use kmap_atomic(), so that the
function can be safely called from all contexts.

Finally, modify the cache_page_release helper to call
nfs_readdir_clear_array directly, when dealing with an anonymous
page from 'uncached_readdir'.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-02 09:58:00 -05:00
Herb Shiu
a5b10629ed ceph: Behave better when handling file lock replies.
Fill in the local lock with response data if appropriate,
and don't call posix_lock_file when reading locks.

Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 14:22:34 -08:00
Herb Shiu
637ae8d547 ceph: pass lock information by struct file_lock instead of as individual params.
Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 14:22:34 -08:00
Herb Shiu
25933abdd8 ceph: Handle file locks in replies from the MDS.
Previously the kernel client incorrectly assumed everything was a directory.

Signed-off-by: Herb Shiu <herb_shiu@tcloudcomputing.com>
Acked-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 14:22:27 -08:00
Sage Weil
884ea89276 ceph: avoid possible null deref in readdir after dir llseek
last may be NULL, but we dereference it in the else branch without
checking.  Normally it doesn't trigger because last == NULL when fpos == 2,
but it could happen on a newly opened dir if the user seeks forward.

Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-01 14:15:31 -08:00
Dave Chinner
c76febef57 xfs: only run xfs_error_test if error injection is active
Recent tests writing lots of small files showed the flusher thread
being CPU bound and taking a long time to do allocations on a debug
kernel. perf showed this as the prime reason:

             samples  pcnt function                    DSO
             _______ _____ ___________________________ _________________

           224648.00 36.8% xfs_error_test              [kernel.kallsyms]
            86045.00 14.1% xfs_btree_check_sblock      [kernel.kallsyms]
            39778.00  6.5% prandom32                   [kernel.kallsyms]
            37436.00  6.1% xfs_btree_increment         [kernel.kallsyms]
            29278.00  4.8% xfs_btree_get_rec           [kernel.kallsyms]
            27717.00  4.5% random32                    [kernel.kallsyms]

Walking btree blocks during allocation checking them requires each
block (a cache hit, so no I/O) call xfs_error_test(), which then
does a random32() call as the first operation.  IOWs, ~50% of the
CPU is being consumed just testing whether we need to inject an
error, even though error injection is not active.

Kill this overhead when error injection is not active by adding a
global counter of active error traps and only calling into
xfs_error_test when fault injection is active.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-01 07:40:20 -06:00
Dave Chinner
de25c1818c xfs: avoid moving stale inodes in the AIL
When an inode has been marked stale because the cluster is being
freed, we don't want to (re-)insert this inode into the AIL. There
is a race condition where the cluster buffer may be unpinned before
the inode is inserted into the AIL during transaction committed
processing. If the buffer is unpinned before the inode item has been
committed and inserted, then it is possible for the buffer to be
released and hence processthe stale inode callbacks before the inode
is inserted into the AIL.

In this case, we then insert a clean, stale inode into the AIL which
will never get removed by an IO completion. It will, however, get
reclaimed and that triggers an assert in xfs_inode_free()
complaining about freeing an inode still in the AIL.

This race can be avoided by not moving stale inodes forward in the AIL
during transaction commit completion processing. This closes the
race condition by ensuring we never insert clean stale inodes into
the AIL. It is safe to do this because a dirty stale inode, by
definition, must already be in the AIL.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-01 07:40:20 -06:00
Dave Chinner
309c848002 xfs: delayed alloc blocks beyond EOF are valid after writeback
There is an assumption in the parts of XFS that flushing a dirty
file will make all the delayed allocation blocks disappear from an
inode. That is, that after calling xfs_flush_pages() then
ip->i_delayed_blks will be zero.

This is an invalid assumption as we may have specualtive
preallocation beyond EOF and they are recorded in
ip->i_delayed_blks. A flush of the dirty pages of an inode will not
change the state of these blocks beyond EOF, so a non-zero
deeelalloc block count after a flush is valid.

The bmap code has an invalid ASSERT() that needs to be removed, and
the swapext code has a bug in that while it swaps the data forks
around, it fails to swap the i_delayed_blks counter associated with
the fork and hence can get the block accounting wrong.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-01 07:40:20 -06:00
Dave Chinner
90810b9e82 xfs: push stale, pinned buffers on trylock failures
As reported by Nick Piggin, XFS is suffering from long pauses under
highly concurrent workloads when hosted on ramdisks. The problem is
that an inode buffer is stuck in the pinned state in memory and as a
result either the inode buffer or one of the inodes within the
buffer is stopping the tail of the log from being moved forward.

The system remains in this state until a periodic log force issued
by xfssyncd causes the buffer to be unpinned. The main problem is
that these are stale buffers, and are hence held locked until the
transaction/checkpoint that marked them state has been committed to
disk. When the filesystem gets into this state, only the xfssyncd
can cause the async transactions to be committed to disk and hence
unpin the inode buffer.

This problem was encountered when scaling the busy extent list, but
only the blocking lock interface was fixed to solve the problem.
Extend the same fix to the buffer trylock operations - if we fail to
lock a pinned, stale buffer, then force the log immediately so that
when the next attempt to lock it comes around, it will have been
unpinned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-01 07:40:20 -06:00
Dave Chinner
c726de4409 xfs: fix failed write truncation handling.
Since the move to the new truncate sequence we call xfs_setattr to
truncate down excessively instanciated blocks.  As shown by the testcase
in kernel.org BZ #22452 that doesn't work too well.  Due to the confusion
of the internal inode size, and the VFS inode i_size it zeroes data that
it shouldn't.

But full blown truncate seems like overkill here.  We only instanciate
delayed allocations in the write path, and given that we never released
the iolock we can't have converted them to real allocations yet either.

The only nasty case is pre-existing preallocation which we need to skip.
We already do this for page discard during writeback, so make the delayed
allocation block punching a generic function and call it from the failed
write path as well as xfs_aops_discard_page. The callers are
responsible for ensuring that partial blocks are not truncated away,
and that they hold the ilock.

Based on a fix originally from Christoph Hellwig. This version used
filesystem blocks as the range unit.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-12-01 07:40:19 -06:00
Trond Myklebust
0aded708d1 NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler
We need to use the cookie from the previous array entry, not the
actual cookie that we are searching for (except for the case of
uncached_readdir).

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-12-01 08:16:16 -05:00
Oleg Nesterov
114279be21 exec: copy-and-paste the fixes into compat_do_execve() paths
Note: this patch targets 2.6.37 and tries to be as simple as possible.
That is why it adds more copy-and-paste horror into fs/compat.c and
uglifies fs/exec.c, this will be cleanuped later.

compat_copy_strings() plays with bprm->vma/mm directly and thus has
two problems: it lacks the RLIMIT_STACK check and argv/envp memory
is not visible to oom killer.

Export acct_arg_size() and get_arg_page(), change compat_copy_strings()
to use get_arg_page(), change compat_do_execve() to do acct_arg_size(0)
as do_execve() does.

Add the fatal_signal_pending/cond_resched checks into compat_count() and
compat_copy_strings(), this matches the code in fs/exec.c and certainly
makes sense.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-30 17:56:38 -08:00
Oleg Nesterov
3c77f84572 exec: make argv/envp memory visible to oom-killer
Brad Spengler published a local memory-allocation DoS that
evades the OOM-killer (though not the virtual memory RLIMIT):
http://www.grsecurity.net/~spender/64bit_dos.c

execve()->copy_strings() can allocate a lot of memory, but
this is not visible to oom-killer, nobody can see the nascent
bprm->mm and take it into account.

With this patch get_arg_page() increments current's MM_ANONPAGES
counter every time we allocate the new page for argv/envp. When
do_execve() succeds or fails, we change this counter back.

Technically this is not 100% correct, we can't know if the new
page is swapped out and turn MM_ANONPAGES into MM_SWAPENTS, but
I don't think this really matters and everything becomes correct
once exec changes ->mm or fails.

Reported-by: Brad Spengler <spender@grsecurity.net>
Reviewed-and-discussed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-30 17:56:37 -08:00
Jeff Layton
ba03864872 cifs: fix parsing of hostname in dfs referrals
The DFS referral parsing code does a memchr() call to find the '\\'
delimiter that separates the hostname in the referral UNC from the
sharename. It then uses that value to set the length of the hostname via
pointer subtraction.  Instead of subtracting the start of the hostname
however, it subtracts the start of the UNC, which causes the code to
pass in a hostname length that is 2 bytes too long.

Regression introduced in commit 1a4240f4.

Reported-and-Tested-by: Robbert Kouprie <robbert@exx.nl>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Wang Lei <wang840925@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-30 20:44:05 +00:00
Trond Myklebust
37a09f0745 NFS: Fix a readdirplus bug
When comparing filehandles in the helper nfs_same_file(), we should not be
using 'strncmp()': filehandles are not null terminated strings.

Instead, we should just use the existing helper nfs_compare_fh().

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-30 10:18:49 -08:00
Miklos Szeredi
7572777eef fuse: verify ioctl retries
Verify that the total length of the iovec returned in FUSE_IOCTL_RETRY
doesn't overflow iov_length().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Tejun Heo <tj@kernel.org>
CC: <stable@kernel.org>         [2.6.31+]
2010-11-30 16:39:27 +01:00
Miklos Szeredi
d9d318d39d fuse: fix ioctl when server is 32bit
If a 32bit CUSE server is run on 64bit this results in EIO being
returned to the caller.

The reason is that FUSE_IOCTL_RETRY reply was defined to use 'struct
iovec', which is different on 32bit and 64bit archs.

Work around this by looking at the size of the reply to determine
which struct was used.  This is only needed if CONFIG_COMPAT is
defined.

A more permanent fix for the interface will be to use the same struct
on both 32bit and 64bit.

Reported-by: "ccmail111" <ccmail111@yahoo.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Tejun Heo <tj@kernel.org>
CC: <stable@kernel.org>         [2.6.31+]
2010-11-30 16:39:27 +01:00
Suresh Jayaraman
476428f8c3 cifs: display fsc in /proc/mounts
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-30 05:51:49 +00:00
Suresh Jayaraman
b81209de24 cifs: enable fscache iff fsc mount option is used explicitly
Currently, if CONFIG_CIFS_FSCACHE is set, fscache is enabled on files opened
as read-only irrespective of the 'fsc' mount option. Fix this by enabling
fscache only if 'fsc' mount option is specified explicitly.

Remove an extraneous cFYI debug message and fix a typo while at it.

Reported-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-30 05:49:32 +00:00
Suresh Jayaraman
607a569da4 cifs: allow fsc mount option only if CONFIG_CIFS_FSCACHE is set
Currently, it is possible to specify 'fsc' mount option even if
CONFIG_CIFS_FSCACHE has not been set. The option is being ignored silently
while the user fscache functionality to work. Fix this by raising error when
the CONFIG option is not set.

Reported-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-30 05:49:28 +00:00
Shirish Pargaonkar
fbeba8bb16 cifs: Handle extended attribute name cifs_acl to generate cifs acl blob (try #4)
Add extended attribute name system.cifs_acl

Get/generate cifs/ntfs acl blob and hand over to the invoker however
it wants to parse/process it under experimental configurable option CIFS_ACL.

Do not get CIFS/NTFS ACL for xattr for attribute system.posix_acl_access

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-30 05:49:24 +00:00
Shirish Pargaonkar
78415d2d30 cifs: Misc. cleanup in cifsacl handling [try #4]
Change the name of function mode_to_acl to mode_to_cifs_acl.

Handle return code in functions mode_to_cifs_acl and
cifs_acl_to_fattr.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-30 05:49:17 +00:00
Linus Torvalds
aa3fc52546 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
  Btrfs: don't use migrate page without CONFIG_MIGRATION
  Btrfs: deal with DIO bios that span more than one ordered extent
  Btrfs: setup blank root and fs_info for mount time
  Btrfs: fix fiemap
  Btrfs - fix race between btrfs_get_sb() and umount
  Btrfs: update inode ctime when using links
  Btrfs: make sure new inode size is ok in fallocate
  Btrfs: fix typo in fallocate to make it honor actual size
  Btrfs: avoid NULL pointer deref in try_release_extent_buffer
  Btrfs: make btrfs_add_nondir take parent inode as an argument
  Btrfs: hold i_mutex when calling btrfs_log_dentry_safe
  Btrfs: use dget_parent where we can UPDATED
  Btrfs: fix more ESTALE problems with NFS
  Btrfs: handle NFS lookups properly
  btrfs: make 1-bit signed fileds unsigned
  btrfs: Show device attr correctly for symlinks
  btrfs: Set file size correctly in file clone
  btrfs: Check if dest_offset is block-size aligned before cloning file
  Btrfs: handle the space_cache option properly
  btrfs: Fix early enospc because 'unused' calculated with wrong sign.
  ...
2010-11-29 14:11:08 -08:00
Linus Torvalds
1bfe4eefe5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  GFS2: Userland expects quota limit/warn/usage in 512b blocks
2010-11-29 14:10:22 -08:00
Suresh Jayaraman
523fb8c867 cifs: trivial comment fix for cifs_invalidate_mapping
Only the callers check whether the invalid_mapping flag is set and not
cifs_invalidate_mapping().

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-11-29 17:48:16 +00:00
Chris Mason
5a92bc88ce Btrfs: don't use migrate page without CONFIG_MIGRATION
Fixes compile error

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-29 09:49:11 -05:00