Commit Graph

17466 Commits (72fd0718332e6514fb9db325e89ffc694bb31f6e)

Author SHA1 Message Date
Joern Engel 81def6b986 Simplify and fix pad_wbuf
A comment in the old code read:
        /* The math in this function can surely use some love */

And indeed it did.  In the case that area->a_used_bytes is exactly
4096 bytes below segment size it fell apart.  pad_wbuf is now split
into two helpers that are significantly less complicated.

Signed-off-by: Joern Engel <joern@logfs.org>
2010-03-28 13:00:08 +02:00
Joern Engel 1932191726 Prevent data corruption in logfs_rewrite_block()
The comment was correct, so make the code match the comment.  As the
new comment indicates, we might be able to do a little less work.  But
for the current -rc series let's keep it simple and just fix the bug.

Signed-off-by: Joern Engel <joern@logfs.org>
2010-03-28 12:40:42 +02:00
Joern Engel 6f2e9e6a95 Use deactivate_locked_super
Found by Al Viro.

Signed-off-by: Joern Engel <joern@logfs.org>
2010-03-27 11:19:16 +01:00
Joern Engel 7db8064c17 Fix logfs_get_sb_final error path
rootdir was already allocated, so we must iput it again.
Found by Al Viro.

Signed-off-by: Joern Engel <joern@logfs.org>
2010-03-27 11:19:16 +01:00
Joern Engel faaa27ab91 Write out both superblocks on mismatch
If the first superblock is wrong and the second gets written, there
will still be a mismatch on next mount.  Write both to make sure they
match.

Signed-off-by: Joern Engel <joern@logfs.org>
2010-03-27 11:19:15 +01:00
Joern Engel e326068806 Prevent schedule while atomic in __logfs_readdir
Apparently filldir can sleep, which forbids kmap_atomic.

Signed-off-by: Joern Engel <joern@logfs.org>
2010-03-27 11:19:15 +01:00
Joern Engel e07bf553f3 Plug memory leak in writeseg_end_io
Signed-off-by: Joern Engel <joern@logfs.org>
2010-03-27 11:19:14 +01:00
Joern Engel 59fe27c0a8 Limit max_pages for insane devices
Intel SSDs have a limit of 0xffff as queue_max_hw_sectors(q).  Such a
limit may make sense from a hardware pov, but it causes bio_alloc() to
return NULL.

Signed-off-by: Joern Engel <joern@logfs.org>
2010-03-27 11:19:14 +01:00
Joern Engel 49137f2efb Open segment file before using it
logfs_recover_sb() needs it open.

Signed-off-by: Joern Engel <joern@logfs.org>
2010-03-27 11:19:13 +01:00
Pavel Shilovsky 810627a013 [CIFS] Add mmap for direct, nobrl cifs mount types
without mmap functions in file_ops OpenOffice can't save changes in
existing document. The same situation you can see with gedit. Also, a.out
format of files can't be executed without mmap.

Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-03-27 02:00:49 +00:00
Linus Torvalds e4d50423d7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: fix imperfect completion wait in nilfs_wait_on_logs
  nilfs2: fix hang-up of cleaner after log writer returned with error
  nilfs2: fix duplicate call to nilfs_segctor_cancel_freev
2010-03-26 15:14:29 -07:00
Linus Torvalds e0df9c0b42 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  Restore LOOKUP_DIRECTORY hint handling in final lookup on open()
2010-03-26 15:06:02 -07:00
Al Viro 3e297b6134 Restore LOOKUP_DIRECTORY hint handling in final lookup on open()
Lose want_dir argument, while we are at it - since now
nd->flags & LOOKUP_DIRECTORY is equivalent to it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-26 12:41:05 -04:00
Linus Torvalds 39f1cd635c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Fixed inode allocator to correctly track a flex_bg's used_dirs
  ext4: Don't use delayed allocation by default when used instead of ext3
  ext4: Fix spelling of CONTIG_FS_EXT3 to CONFIG_FS_EXT3
  ext4: Fix estimate of # of blocks needed to write indirect-mapped files
2010-03-25 14:10:53 -07:00
Linus Torvalds 6c75969e22 Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: don't try to decode GETATTR if DELEGRETURN returned error
  sunrpc: handle allocation errors from __rpc_lookup_create()
  SUNRPC: Fix the return value of rpc_run_bc_task()
  SUNRPC: Fix a use after free bug with the NFSv4.1 backchannel
  SUNRPC: Fix a potential memory leak in auth_gss
  NFS: Prevent another deadlock in nfs_release_page()
2010-03-24 16:50:46 -07:00
Dan Carpenter 1147d0f915 fscache: add missing unlock
Sparse complained about this missing spin_unlock()

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-24 16:49:21 -07:00
David Howells 61964eba5c do_sync_read/write() should set kiocb.ki_nbytes to be consistent
do_sync_read/write() should set kiocb.ki_nbytes to be consistent with
do_sync_readv_writev().

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-24 16:43:29 -07:00
David Howells 47568d4c56 FDPIC: For-loop in elf_core_vma_data_size() is incorrect
Fix an incorrect for-loop in elf_core_vma_data_size().  The advance-pointer
statement lacks an assignment:

	  CC      fs/binfmt_elf_fdpic.o
	fs/binfmt_elf_fdpic.c: In function 'elf_core_vma_data_size':
	fs/binfmt_elf_fdpic.c:1593: warning: statement with no effect

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-24 16:43:29 -07:00
OGAWA Hirofumi 8e0cc811e0 fs/partition/msdos: fix unusable extended partition for > 512B sector
Smaller size than a minimum blocksize can't be used, after all it's
handled like 0 size.

For extended partition itself, this makes sure to use bigger size than one
logical sector size at least.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Daniel Taylor <Daniel.Taylor@wdc.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-24 16:31:22 -07:00
Daniel Taylor 3fbf586cf7 fs/partitions/msdos: add support for large disks
In order to use disks larger than 2TiB on Windows XP, it is necessary to
use 4096-byte logical sectors in an MBR.

Although the kernel storage and functions called from msdos.c used
"sector_t" internally, msdos.c still used u32 variables, which results in
the ability to handle XP-compatible large disks.

This patch changes the internal variables to "sector_t".

Daniel said: "In the near future, WD will be releasing products that need
this patch".

[hirofumi@mail.parknet.co.jp: tweaks and fix]
Signed-off-by: Daniel Taylor <daniel.taylor@wdc.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-24 16:31:22 -07:00
Dan Carpenter 4fd2c20d96 kcore: fix test for end of list
"m" is never NULL here.  We need a different test for the end of list
condition.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-24 16:31:22 -07:00
Jeff Mahoney 3f8b5ee332 reiserfs: properly honor read-only devices
The reiserfs journal behaves inconsistently when determining whether to
allow a mount of a read-only device.

This is due to the use of the continue_replay variable to short circuit
the journal scanning.  If it's set, it's assumed that there are
transactions to replay, but there may not be.  If it's unset, it's assumed
that there aren't any, and that may not be the case either.

I've observed two failure cases:
1) Where a clean file system on a read-only device refuses to mount
2) Where a clean file system on a read-only device passes the
   optimization and then tries writing the journal header to update
   the latest mount id.

The former is easily observable by using a freshly created file system on
a read-only loopback device.

This patch moves the check into journal_read_transaction, where it can
bail out before it's about to replay a transaction.  That way it can go
through and skip transactions where appropriate, yet still refuse to mount
a file system with outstanding transactions.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-24 16:31:21 -07:00
Jeff Mahoney 6cb4aff0a7 reiserfs: fix oops while creating privroot with selinux enabled
Commit 57fe60df ("reiserfs: add atomic addition of selinux attributes
during inode creation") contains a bug that will cause it to oops when
mounting a file system that didn't previously contain extended attributes
on a system using security.* xattrs.

The issue is that while creating the privroot during mount
reiserfs_security_init calls reiserfs_xattr_jcreate_nblocks which
dereferences the xattr root.  The xattr root doesn't exist, so we get an
oops.

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=15309

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-24 16:31:21 -07:00
Borislav Petkov 7731d9a5d4 fs/binfmt_aout.c: fix pointer warnings
fs/binfmt_aout.c: In function `aout_core_dump':
fs/binfmt_aout.c:125: warning: passing argument 2 of `dump_write' makes pointer from integer without a cast
include/linux/coredump.h:12: note: expected `const void *' but argument is of type `long unsigned int'
fs/binfmt_aout.c:132: warning: passing argument 2 of `dump_write' makes pointer from integer without a cast
include/linux/coredump.h:12: note: expected `const void *' but argument is of type `long unsigned int'

due to dump_write() expecting a user void *.  Fold casts into the
START_DATA/START_STACK macros and shut up the warnings.

Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
Cc: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-24 16:31:19 -07:00
Eric Sandeen c4caae2518 ext4: Fixed inode allocator to correctly track a flex_bg's used_dirs
When used_dirs was introduced for the flex_groups struct, it looks
like the accounting was not put into place properly, in some places
manipulating free_inodes rather than used_dirs.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-23 21:32:00 -04:00
Jan Kara ba69f9ab7d ext4: Don't use delayed allocation by default when used instead of ext3
When ext4 driver is used to mount a filesystem instead of the ext3 file
system driver (through CONFIG_EXT4_USE_FOR_EXT23), do not enable delayed
allocation by default since some ext3 users and application writers have
developed unfortunate expectations about the safety of writing files on
systems subject to sudden and violent death without using fsync().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-24 20:18:37 -04:00
Theodore Ts'o 37f328eb60 ext4: Fix spelling of CONTIG_FS_EXT3 to CONFIG_FS_EXT3
Oops.  (Blush.)

Thanks to Sedat Dilek for pointing this out.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-24 20:06:41 -04:00
Srinivas Eeda 14741472a0 ocfs2: Fix a race in o2dlm lockres mastery
In o2dlm, the master of a lock resource keeps a map of all interested
nodes.  This prevents the master from purging the resource before an
interested node can create a lock.

A race between the mastery thread and the mastery handler allowed an
interested node to discover who the master is without informing the
master directly.  This is easily fixed by holding the dlm spinlock a
little longer in the mastery handler.

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-03-23 18:22:59 -07:00
Tristan Ye b54c2ca475 Ocfs2: Handle deletion of reflinked oprhan inodes correctly.
The rule is that all inodes in the orphan dir have ORPHANED_FL,
otherwise we treated it as an ERROR.  This rule works well except
for some rare cases of reflink operation:

http://oss.oracle.com/bugzilla/show_bug.cgi?id=1215

The problem is caused by how reflink and our orphan_scan thread
interact.

 * The orphan scan pulls the orphans into a queue first, then runs the
   queue at a later time.  We only hold the orphan_dir's lock
   during scanning.

 * Reflink create a oprhaned target in orphan_dir as its first step.
   It removes the target and clears the flag as the final step.
   These two steps take the orphan_dir's lock, but it is not held for
   the duration.

Based on the above semantics, a reflink inode can be moved out of the
orphan dir and have its ORPHANED_FL cleared before the queue of orphans
is run.  This leads to a ERROR in ocfs2_query_wipde_inode().

This patch teaches ocfs2_query_wipe_inode() to detect previously
orphaned reflink targets.  If a reflink fails or a crash occurs during
the relfink operation, the inode will retain ORPHANED_FL and will be
properly wiped.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-03-23 18:22:55 -07:00
Tristan Ye 3939fda4b3 Ocfs2: Journaling i_flags and i_orphaned_slot when adding inode to orphan dir.
Currently, some callers were missing to journal the dirty inode after
adding it to orphan dir.

Now we're going to journal such modifications within the ocfs2_orphan_add()
itself, It's safe to do so, though some existing caller may duplicate this,
and it makes the logic look more straightforward anyway.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-03-23 18:22:51 -07:00
Mark Fasheh b4414eea0e ocfs2: Clear undo bits when local alloc is freed
When the local alloc file changes windows, unused bits are freed back to the
global bitmap. By defnition, those bits can not be in use by any file. Also,
the local alloc will never have been able to allocate those bits if they
were part of a previous truncate. Therefore it makes sense that we should
clear unused local alloc bits in the undo buffer so that they can be used
immediatly.

[ Modified to call it ocfs2_release_clusters() -- Joel ]

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-03-23 18:22:40 -07:00
Ryusuke Konishi d067633b44 nilfs2: fix imperfect completion wait in nilfs_wait_on_logs
nilfs_wait_on_logs has a potential to slip out before completion of
all bio requests when it met an error.  This synchronization fault may
cause unexpected results, for instance, violative access to freed
segment buffers from an end-bio callback routine.

This fixes the issue by ensuring that nilfs_wait_on_logs waits all
given logs.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-03-24 01:17:20 +09:00
Ryusuke Konishi 110d735a0a nilfs2: fix hang-up of cleaner after log writer returned with error
According to the report from Andreas Beckmann (Message-ID:
<4BA54677.3090902@abeckmann.de>), nilfs in 2.6.33 kernel got stuck
after a disk full error.

This turned out to be a regression by log writer updates merged at
kernel 2.6.33.  nilfs_segctor_abort_construction, which is a cleanup
function for erroneous cases, was skipping writeback completion for
some logs.

This fixes the bug and would resolve the hang issue.

Reported-by: Andreas Beckmann <debian@abeckmann.de>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: stable <stable@kernel.org>                     [2.6.33.x]
2010-03-24 00:03:06 +09:00
Sage Weil 393f662096 ceph: fix possible double-free of mds request reference
Clear pointer to mds request after dropping the reference to
ensure we don't drop it again, as there is at least one error
path through this function that does not reset fi->last_readdir
to a new value.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:06 -07:00
Sage Weil d96d60498f ceph: fix session check on mds reply
Fix a broken check that a reply came back from the same MDS we sent the
request to.  I don't think a case that actually triggers this would ever
come up in practice, but it's clearly wrong and easy to fix.

Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:05 -07:00
Dan Carpenter 4736b009b8 ceph: handle kmalloc() failure
Return ERR_PTR(-ENOMEM) if kmalloc() fails.  We handle allocation
failures the same way later in the function.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:04 -07:00
Sage Weil 9c423956b8 ceph: propagate mds session allocation failures to caller
Return error to original caller if register_session() fails.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:04 -07:00
Sage Weil 8f883c24de ceph: make write_begin wait propagate ERESTARTSYS
Currently, if the wait_event_interruptible is interrupted, we
return EAGAIN unconditionally and loop, such that we aren't, in
fact, interruptible.  So, propagate ERESTARTSYS if we get it.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:03 -07:00
Sage Weil ec4318bcb4 ceph: fix snap rebuild condition
We were rebuilding the snap context when it was not necessary
(i.e. when the realm seq hadn't changed _and_ the parent seq
was still older), which caused page snapc pointers to not match
the realm's snapc pointer (even though the snap context itself
was identical).  This confused begin_write and put it into an
endless loop.

The correct logic is: rebuild snapc if _my_ realm seq changed, or
if my parent realm's seq is newer than mine (and thus mine needs
to be rebuilt too).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:02 -07:00
Sage Weil 87b315a5b5 ceph: avoid reopening osd connections when address hasn't changed
We get a fault callback on _every_ tcp connection fault.  Normally, we
want to reopen the connection when that happens.  If the address we have
is bad, however, and connection attempts always result in a connection
refused or similar error, explicitly closing and reopening the msgr
connection just prevents the messenger's backoff logic from kicking in.
The result can be a console full of

[ 3974.417106] ceph: osd11 10.3.14.138:6800 connection failed
[ 3974.423295] ceph: osd11 10.3.14.138:6800 connection failed
[ 3974.429709] ceph: osd11 10.3.14.138:6800 connection failed

Instead, if we get a fault, and have outstanding requests, but the osd
address hasn't changed and the connection never successfully connected in
the first place, do nothing to the osd connection.  The messenger layer
will back off and retry periodically, because we never connected and thus
the lossy bit is not set.

Instead, touch each request's r_stamp so that handle_timeout can tell the
request is still alive and kicking.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:01 -07:00
Sage Weil 3dd72fc0e6 ceph: rename r_sent_stamp r_stamp
Make variable name slightly more generic, since it will (soon)
reflect either the time the request was sent OR the time it was
last determined to be still retrying.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:00 -07:00
Sage Weil 3c3f2e32ef ceph: fix connection fault con_work reentrancy problem
The messenger fault was clearing the BUSY bit, for reasons unclear.  This
made it possible for the con->ops->fault function to reopen the connection,
and requeue work in the workqueue--even though the current thread was
already in con_work.

This avoids a problem where the client busy loops with connection failures
on an unreachable OSD, but doesn't address the root cause of that problem.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:59 -07:00
Sage Weil e4cb4cb8a0 ceph: prevent dup stale messages to console for restarting mds
Prevent duplicate 'mds0 caps stale' message from spamming the console every
few seconds while the MDS restarts.  Set s_renew_requested earlier, so that
we only print the message once, even if we don't send an actual request.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:58 -07:00
Sage Weil efd7576b23 ceph: fix pg pool decoding from incremental osdmap update
The incremental map decoding of pg pool updates wasn't skipping
the snaps and removed_snaps vectors.  This caused osd requests
to stall when pool snapshots were created or fs snapshots were
deleted.  Use a common helper for full and incremental map
decoders that decodes pools properly.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:57 -07:00
Sage Weil 80fc7314a7 ceph: fix mds sync() race with completing requests
The wait_unsafe_requests() helper dropped the mdsc mutex to wait
for each request to complete, and then examined r_node to get the
next request after retaking the lock.  But the request completion
removes the request from the tree, so r_node was always undefined
at this point.  Since it's a small race, it usually led to a
valid request, but not always.  The result was an occasional
crash in rb_next() while dereferencing node->rb_left.

Fix this by clearing the rb_node when removing the request from
the request tree, and not walking off into the weeds when we
are done waiting for a request.  Since the request we waited on
will _always_ be out of the request tree, take a ref on the next
request, in the hopes that it won't be.  But if it is, it's ok:
we can start over from the beginning (and traverse over older read
requests again).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:56 -07:00
Sage Weil 916623da10 ceph: only release unused caps with mds requests
We were releasing used caps (e.g. FILE_CACHE) from encode_inode_release
with MDS requests (e.g. setattr).  We don't carry refs on most caps, so
this code worked most of the time, but for setattr (utimes) we try to
drop Fscr.

This causes cap state to get slightly out of sync with reality, and may
result in subsequent mds revoke messages getting ignored.

Fix by only releasing unused caps.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:55 -07:00
Sage Weil 15637c8b12 ceph: clean up handle_cap_grant, handle_caps wrt session mutex
Drop session mutex unconditionally in handle_cap_grant, and do the
check_caps from the handle_cap_grant helper.  This avoids using a magic
return value.

Also avoid using a flag variable in the IMPORT case and call
check_caps at the appropriate point.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:54 -07:00
Sage Weil cdc2ce056a ceph: fix session locking in handle_caps, ceph_check_caps
Passing a session pointer to ceph_check_caps() used to mean it would leave
the session mutex locked.  That wasn't always possible if it wasn't passed
CHECK_CAPS_AUTHONLY.   If could unlock the passed session and lock a
differet session mutex, which was clearly wrong, and also emitted a
warning when it a racing CPU retook it and we did an unlock from the wrong
context.

This was only a problem when there was more than one MDS.

First, make ceph_check_caps unconditionally drop the session mutex, so that
it is free to lock other sessions as needed.  Then adjust the one caller
that passes in a session (handle_cap_grant) accordingly.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:53 -07:00
Sage Weil 4ea0043a29 ceph: drop unnecessary WARN_ON in caps migration
If we don't have the exported cap it's because we already released it. No
need to WARN.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:52 -07:00
Sage Weil 12eadc1900 ceph: fix null pointer deref of r_osd in debug output
This causes an oops when debug output is enabled and we kick
an osd request with no current r_osd (sometime after an osd
failure).  Check the pointer before dereferencing.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:51 -07:00
Sage Weil 0a990e7093 ceph: clean up service ticket decoding
Previously we would decode state directly into our current ticket_handler.
This is problematic if for some reason we fail to decode, because we end
up with half new state and half old state.

We are probably already in bad shape if we get an update we can't decode,
but we may as well be tidy anyway.  Decode into new_* temporaries and
update the ticket_handler only on success.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:47 -07:00
Dan Carpenter 99b437a925 AFS: Potential null dereference
It seems clear from the surrounding code that xpermits is allowed to be
NULL here.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-22 09:57:19 -07:00
Jeff Layton 556ae3bb32 NFS: don't try to decode GETATTR if DELEGRETURN returned error
The reply parsing code attempts to decode the GETATTR response even if
the DELEGRETURN portion of the compound returned an error. The GETATTR
response won't actually exist if that's the case and we're asking the
parser to read past the end of the response.

This bug is fairly benign. The parser catches this without reading past
the end of the response and decode_getfattr returns -EIO. Earlier
kernels however had decode_op_hdr using the READ_BUF macro, and this
bug would make this printk pop any time the client got an error from
a delegreturn:

kernel: decode_op_hdr: reply buffer overflowed in line XXXX

More recent kernels seem to have replaced this printk with a dprintk.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-22 05:34:13 -04:00
Ryusuke Konishi 2d8428acae nilfs2: fix duplicate call to nilfs_segctor_cancel_freev
Andreas Beckmann gave me a report that nilfs logged the following
warnings when it got a disk full:

  nilfs_sufile_do_cancel_free: segment 0 must be clean
  nilfs_sufile_do_cancel_free: segment 1 must be clean

These arise from a duplicate call to nilfs_segctor_cancel_freev in an
error path of log writer.  This will fix the issue.

Reported-by: Andreas Beckmann <debian@abeckmann.de>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-03-22 14:41:07 +09:00
Sage Weil 5b3dbb44ab ceph: release old ticket_blob buffer
Release the old ticket_blob buffer when we get an updated service ticket
from the monitor.  Previously these were getting leaked.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-20 21:33:11 -07:00
Sage Weil 807c86e2ce ceph: fix authenticator buffer size calculation
The buffer size was incorrectly calculated for the ceph_x_encrypt()
encapsulated ticket blob.  Use a helper (with correct arithmetic) and
BUG out if we were wrong.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-20 21:33:10 -07:00
Sage Weil 63733a0fc5 ceph: fix authenticator timeout
We were failing to reconnect to services due to an old authenticator, even
though we had the new ticket, because we weren't properly retrying the
connect handshake, because we were calling an old/incorrect helper that
left in_base_pos incorrect.  The result was a failure to reconnect to the
OSD or MDS (with an authentication error) if the MDS restarted after the
service had been up a few hours (long enough for the original authenticator
to be invalid).  This was only a problem if the AUTH_X authentication was
enabled.

Now that the 'negotiate' and 'connect' stages are fully separated, we
should use the prepare_read_connect() helper instead, and remove the
obsolete one.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-20 21:33:09 -07:00
Sage Weil 8b218b8a4a ceph: fix inode removal from snap realm when racing with migration
When an inode was dropped while being migrated between two MDSs,
i_cap_exporting_issued was non-zero such that issue caps were non-zero and
__ceph_is_any_caps(ci) was true.  This prevented the inode from being
removed from the snap realm, even as it was dropped from the cache.

Fix this by dropping any residual i_snap_realm ref in destroy_inode.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-20 21:33:08 -07:00
Sage Weil 052bb34af3 ceph: add missing locking to protect i_snap_realm_item during split
All ci->i_snap_realm_item/realm->inodes_with_caps manipulation should be
protected by realm->inodes_with_caps_lock.  This bug would have only bit
us in a rare race with a realm split (during some snap creations).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-20 21:33:07 -07:00
Sage Weil 978097c907 ceph: implemented caps should always be superset of issued caps
Added assertion, and cleared one case where the implemented caps were
not following the issued caps.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-20 21:33:06 -07:00
Tao Ma b23179681c ocfs2: Init meta_ac properly in ocfs2_create_empty_xattr_block.
You can't store a pointer that you haven't filled in yet and expect it
to work.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-03-19 14:53:52 -07:00
Tao Ma dfe4d3d6a6 ocfs2: Fix the update of name_offset when removing xattrs
When replacing a xattr's value, in some case we wipe its name/value
first and then re-add it. The wipe is done by
ocfs2_xa_block_wipe_namevalue() when the xattr is in the inode or
block. We currently adjust name_offset for all the entries which have
(offset < name_offset). This does not adjust the entrie we're replacing.
Since we are replacing the entry, we don't adjust the total entry count.
When we calculate a new namevalue location, we trust the entries
now-wrong offset in ocfs2_xa_get_free_start().  The solution is to
also adjust the name_offset for the replaced entry, allowing
ocfs2_xa_get_free_start() to calculate the new namevalue location
correctly.

The following script can trigger a kernel panic easily.

echo 'y'|mkfs.ocfs2 --fs-features=local,xattr -b 4K $DEVICE
mount -t ocfs2 $DEVICE $MNT_DIR
FILE=$MNT_DIR/$RANDOM
for((i=0;i<76;i++))
do
string_76="a$string_76"
done
string_78="aa$string_76"
string_82="aaaa$string_78"

touch $FILE
setfattr -n 'user.test1234567890' -v $string_76 $FILE
setfattr -n 'user.test1234567890' -v $string_78 $FILE
setfattr -n 'user.test1234567890' -v $string_82 $FILE

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-03-19 14:53:51 -07:00
Trond Myklebust d812e57582 NFS: Prevent another deadlock in nfs_release_page()
We should not attempt to free the page if __GFP_FS is not set. Otherwise we
can deadlock as per

  http://bugzilla.kernel.org/show_bug.cgi?id=15578

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
2010-03-19 13:55:17 -04:00
Linus Torvalds fc7f99cf36 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (205 commits)
  ceph: update for write_inode API change
  ceph: reset osd after relevant messages timed out
  ceph: fix flush_dirty_caps race with caps migration
  ceph: include migrating caps in issued set
  ceph: fix osdmap decoding when pools include (removed) snaps
  ceph: return EBADF if waiting for caps on closed file
  ceph: set osd request message front length correctly
  ceph: reset front len on return to msgpool; BUG on mismatched front iov
  ceph: fix snaptrace decoding on cap migration between mds
  ceph: use single osd op reply msg
  ceph: reset bits on connection close
  ceph: remove bogus mds forward warning
  ceph: remove fragile __map_osds optimization
  ceph: fix connection fault STANDBY check
  ceph: invalidate_authorizer without con->mutex held
  ceph: don't clobber write return value when using O_SYNC
  ceph: fix client_request_forward decoding
  ceph: drop messages on unregistered mds sessions; cleanup
  ceph: fix comments, locking in destroy_inode
  ceph: move dereference after NULL test
  ...

Fix trivial conflicts in Documentation/ioctl/ioctl-number.txt
2010-03-19 09:43:06 -07:00
Linus Torvalds 0a492fdef8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: trivial white space
  [CIFS] checkpatch cleanup
  cifs: add cifs_revalidate_file
  cifs: add a CIFSSMBUnixQFileInfo function
  cifs: add a CIFSSMBQFileInfo function
  cifs: overhaul cifs_revalidate and rename to cifs_revalidate_dentry
2010-03-19 09:36:18 -07:00
Jens Axboe b4b7a4ef09 Merge branch 'master' into for-linus
Conflicts:
	block/Kconfig

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-03-19 08:05:10 +01:00
Linus Torvalds 441f4058a0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (30 commits)
  Btrfs: fix the inode ref searches done by btrfs_search_path_in_tree
  Btrfs: allow treeid==0 in the inode lookup ioctl
  Btrfs: return keys for large items to the search ioctl
  Btrfs: fix key checks and advance in the search ioctl
  Btrfs: buffer results in the space_info ioctl
  Btrfs: use __u64 types in ioctl.h
  Btrfs: fix search_ioctl key advance
  Btrfs: fix gfp flags masking in the compression code
  Btrfs: don't look at bio flags after submit_bio
  btrfs: using btrfs_stack_device_id() get devid
  btrfs: use memparse
  Btrfs: add a "df" ioctl for btrfs
  Btrfs: cache the extent state everywhere we possibly can V2
  Btrfs: cache ordered extent when completing io
  Btrfs: cache extent state in find_delalloc_range
  Btrfs: change the ordered tree to use a spinlock instead of a mutex
  Btrfs: finish read pages in the order they are submitted
  btrfs: fix btrfs_mkdir goto for no free objectids
  Btrfs: flush data on snapshot creation
  Btrfs: make df be a little bit more understandable
  ...
2010-03-18 16:50:55 -07:00
Linus Torvalds 7c34691abe Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: ensure bdi_unregister is called on mount failure.
  NFS: Avoid a deadlock in nfs_release_page
  NFSv4: Don't ignore the NFS_INO_REVAL_FORCED flag in nfs_revalidate_inode()
  nfs4: Make the v4 callback service hidden
  nfs: fix unlikely memory leak
  rpc client can not deal with ENOSOCK, so translate it into ENOCONN
2010-03-18 16:50:09 -07:00
Linus Torvalds 01d61d0d64 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: don't warn about page discards on shutdown
  xfs: use scalable vmap API
  xfs: remove old vmap cache
2010-03-18 16:46:05 -07:00
Mark Fasheh b22b63ebaf ocfs2: Always try for maximum bits with new local alloc windows
What we were doing before was to ask for the current window size as the
maximum allocation. This had the effect of limiting the amount of allocation
we could get for the local alloc during times when the window size was
shrunk due to fragmentation. In some cases, that could actually *increase*
fragmentation by artificially limiting the number of bits we can accept. So
while we still want to ask for a minimum number of bits equal to window
size, there is no reason why we should limit the number of bits the local
alloc should accept. Hence always allow the maximum number of local alloc
bits.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-03-18 13:22:42 -07:00
Chris Mason 8ad6fcab56 Btrfs: fix the inode ref searches done by btrfs_search_path_in_tree
This is used by the inode lookup ioctl to follow all the backrefs up
to the subvol root.  But the search being done would sometimes land one
past the last item in the leaf instead of finding the backref.

This changes the search to look for the highest possible backref and hop
back one item.  It also fixes a leaked path on failure to find the root.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-18 12:23:10 -04:00
Chris Mason 1b53ac4d1b Btrfs: allow treeid==0 in the inode lookup ioctl
When a root id of 0 is sent to the inode lookup ioctl, it will
use the root of the file we're ioctling and pass the root id
back to userland along with the results.

This allows userland to do searches based on that root later on.


Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-18 12:17:05 -04:00
Chris Mason 90fdde147f Btrfs: return keys for large items to the search ioctl
The search ioctl was skipping large items entirely (ones that are too
big for the results buffer).  This changes things to at least copy
the item header so that we can send information about the item back to
userland.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-18 12:14:54 -04:00
Chris Mason abc6e1341b Btrfs: fix key checks and advance in the search ioctl
The search ioctl was working well for finding tree roots, but using it for
generic searches requires a few changes to how the keys are advanced.
This treats the search control min fields for objectid, type and offset
more like a key, where we drop the offset to zero once we bump the type,
etc.

The downside of this is that we are changing the min_type and min_offset
fields during the search, and so the ioctl caller needs extra checks to make sure
the keys in the result are the ones it wanted.

This also changes key_in_sk to use btrfs_comp_cpu_keys, just to make
things more readable.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-18 12:10:08 -04:00
Akinobu Mita c4af96449e ntfs: use bitmap_weight
Use bitmap_weight() instead of doing hweight32() for each u32 element in
the page.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-17 18:43:47 -07:00
Venkatesh Pallipadi bcc54e2a6d jffs2: fix up rb_root initializations to use RB_ROOT
jffs2 uses rb_node = NULL; to zero rb_root.

The problem with this is that 17d9ddc72f ("rbtree: Add
support for augmented rbtrees") in the linux-next tree adds a new field
to that struct which needs to be NULL as well.  This patch uses RB_ROOT
as the intializer so all of the relevant fields will be NULL'd.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Eric Paris <eparis@redhat.com>
Acked-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-17 18:43:47 -07:00
Mark Fasheh fcefd25ac8 ocfs2: set i_mode on disk during acl operations
ocfs2_set_acl() and ocfs2_init_acl() were setting i_mode on the in-memory
inode, but never setting it on the disk copy. Thus, acls were some times not
getting propagated between nodes. This patch fixes the issue by adding a
helper function ocfs2_acl_set_mode() which does this the right way.
ocfs2_set_acl() and ocfs2_init_acl() are then updated to call
ocfs2_acl_set_mode().

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-03-17 12:28:22 -07:00
Tao Ma 6527f8f848 ocfs2: Update i_blocks in reflink operations.
In reflink, we need to upate i_blocks for the target inode.

Reported-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-03-17 12:28:00 -07:00
Tao Ma 78c37eb0d5 ocfs2: Change bg_chain check for ocfs2_validate_gd_parent.
In ocfs2_validate_gd_parent, we check bg_chain against the
cl_next_free_rec of the dinode. Actually in resize, we have
the chance of bg_chain == cl_next_free_rec. So add some
additional condition check for it.

I also rename paramter "clean_error" to "resize", since the
old one is not clearly enough to indicate that we should only
meet with this case in resize.

btw, the correpsonding bug is
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1230.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-03-17 12:07:21 -07:00
Sachin Prabhu ee860b6a65 [PATCH] Skip check for mandatory locks when unlocking
ocfs2_lock() will skip locks on file which has mode set to 02666. This
is a problem in cases where the mode of the file is changed after a
process has obtained a lock on the file.

ocfs2_lock() should skip the check for mandatory locks when unlocking a
file.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-03-17 12:07:16 -07:00
Dave Chinner e8c3753ce4 xfs: don't warn about page discards on shutdown
If we are doing a forced shutdown, we can get lots of noise about
delalloc pages being discarded. This is happens by design during a
forced shutdown, so don't spam the logs with these messages.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-16 15:40:53 -05:00
Alex Elder 8a262e573d xfs: use scalable vmap API
Re-apply a commit that had been reverted due to regressions
that have since been fixed.

    From 95f8e302c0 Mon Sep 17 00:00:00 2001
    From: Nick Piggin <npiggin@suse.de>
    Date: Tue, 6 Jan 2009 14:43:09 +1100

    Implement XFS's large buffer support with the new vmap APIs. See the vmap
    rewrite (db64fe02) for some numbers. The biggest improvement that comes from
    using the new APIs is avoiding the global KVA allocation lock on every call.

    Signed-off-by: Nick Piggin <npiggin@suse.de>
    Reviewed-by: Christoph Hellwig <hch@infradead.org>
    Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>

Only modifications here were a minor reformat, plus making the patch
apply given the new use of xfs_buf_is_vmapped().

Modified-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-16 15:40:36 -05:00
Alex Elder cd9640a70d xfs: remove old vmap cache
Re-apply a commit that had been reverted due to regressions
that have since been fixed.

    Original commit: d2859751cd
    Author: Nick Piggin <npiggin@suse.de>
    Date: Tue, 6 Jan 2009 14:40:44 +1100

    XFS's vmap batching simply defers a number (up to 64) of vunmaps,
    and keeps track of them in a list. To purge the batch, it just goes
    through the list and calls vunamp on each one. This is pretty poor:
    a global TLB flush is generally still performed on each vunmap, with
    the most expensive parts of the operation being the broadcast IPIs
    and locking involved in the SMP callouts, and the locking involved
    in the vmap management -- none of these are avoided by just batching
    up the calls. I'm actually surprised it ever made much difference.
    (Now that the lazy vmap allocator is upstream, this description is
    not quite right, but the vunmap batching still doesn't seem to do
    much).

    Rip all this logic out of XFS completely. I will improve vmap
    performance and scalability directly in subsequent patch.

    Signed-off-by: Nick Piggin <npiggin@suse.de>
    Reviewed-by: Christoph Hellwig <hch@infradead.org>
    Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>

The only change I made was to use the "new" xfs_buf_is_vmapped()
function in a place it had been open-coded in the original.

Modified-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-16 15:40:19 -05:00
Chris Mason 7fde62bffb Btrfs: buffer results in the space_info ioctl
The space_info ioctl was using copy_to_user inside rcu_read_lock.  This
commit changes things to copy into a buffer first and then dump the
result down to userland.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-16 15:40:10 -04:00
Sage Weil ce769a2904 Btrfs: use __u64 types in ioctl.h
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-16 14:24:27 -04:00
Sage Weil 854d2c3531 Btrfs: fix search_ioctl key advance
key->type is u8, not u64.

fs/btrfs/ioctl.c: In function 'copy_to_sk':
fs/btrfs/ioctl.c:1024: warning: comparison is always true due to limited range of data type

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-16 14:24:27 -04:00
NeilBrown cfbc0683af NFS: ensure bdi_unregister is called on mount failure.
bdi_unregister is called by nfs_put_super which is only called by
generic_shutdown_super if ->s_root is not NULL.  So if we error out
in a circumstance where we called nfs_bdi_register (i.e. server !=
NULL) but have not set s_root, then we need to call bdi_unregister
explicitly in nfs_get_sb and various other *_get_sb() functions.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-15 15:37:45 -04:00
Dan Carpenter 8212cf7583 cifs: trivial white space
I fixed the indent level.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-03-15 15:19:47 +00:00
Nick Piggin ef5780c018 Btrfs: fix gfp flags masking in the compression code
GFP_FS must be masked out, NOFS can't be or'd in.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:05:57 -04:00
Chris Mason 5ff7ba3a79 Btrfs: don't look at bio flags after submit_bio
After callling submit_bio, the bio can be freed at any time.  The
btrfs submission thread helper was checking the bio flags too late,
which might not give the correct answer.

When CONFIG_DEBUG_PAGE_ALLOC is turned on, it can lead to oopsen.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:15 -04:00
Xiao Guangrong a343832f1a btrfs: using btrfs_stack_device_id() get devid
We can use btrfs_stack_device_id() to get dev_item->devid

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:14 -04:00
Akinobu Mita 91748467a5 btrfs: use memparse
Use memparse() instead of its own private implementation.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:14 -04:00
Josef Bacik 1406e4327b Btrfs: add a "df" ioctl for btrfs
df is a very loaded question in btrfs.  This gives us a way to get the per-space
usage information so we can tell exactly what is in use where.  This will help
us figure out ENOSPC problems, and help users better understand where their disk
space is going.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:14 -04:00
Josef Bacik 2ac55d41b5 Btrfs: cache the extent state everywhere we possibly can V2
This patch just goes through and fixes everybody that does

lock_extent()
blah
unlock_extent()

to use

lock_extent_bits()
blah
unlock_extent_cached()

and pass around a extent_state so we only have to do the searches once per
function.  This gives me about a 3 mb/s boots on my random write test.  I have
not converted some things, like the relocation and ioctl's, since they aren't
heavily used and the relocation stuff is in the middle of being re-written.  I
also changed the clear_extent_bit() to only unset the cached state if we are
clearing EXTENT_LOCKED and related stuff, so we can do things like this

lock_extent_bits()
clear delalloc bits
unlock_extent_cached()

without losing our cached state.  I tested this thoroughly and turned on
LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out
fine.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:13 -04:00
Josef Bacik 5a1a3df1f6 Btrfs: cache ordered extent when completing io
When finishing io we run btrfs_dec_test_ordered_pending, and then immediately
run btrfs_lookup_ordered_extent, but btrfs_dec_test_ordered_pending does that
already, so we're searching twice when we don't have to.  This patch lets us
pass a btrfs_ordered_extent in to btrfs_dec_test_ordered_pending so if we do
complete io on that ordered extent we can just use the one we found then instead
of having to do another btrfs_lookup_ordered_extent.  This made my fio job with
the other patch go from 24 mb/s to 29 mb/s.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:13 -04:00
Josef Bacik c2a128d28a Btrfs: cache extent state in find_delalloc_range
This patch makes us cache the extent state we find in find_delalloc_range since
we'll have to lock the extent later on in the function.  This will keep us from
re-searching for the rang when we try to lock the extent.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:13 -04:00
Josef Bacik 49958fd7db Btrfs: change the ordered tree to use a spinlock instead of a mutex
The ordered tree used to need a mutex, but currently all we use it for is to
protect the rb_tree, and a spin_lock is just fine for that.  Using a spin_lock
instead makes dbench run a little faster, 58 mb/s instead of 51 mb/s, and have
less latency, 3445.138 ms instead of 3820.633 ms.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:12 -04:00
Chris Mason 4125bf761c Btrfs: finish read pages in the order they are submitted
The endio is done at reverse order of bio vectors.

That means for a sequential read, the page first submitted will finish
last in a bio. Considering we will do checksum (making cache hot) for
every page, this does introduce delay (and chance to squeeze cache used
soon) for pages submitted at the begining.

I don't observe obvious performance difference with below patch at my
simple test, but seems more natural to finish read in the order they are
submitted.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:12 -04:00
Miao Xie 0be2e98173 btrfs: fix btrfs_mkdir goto for no free objectids
btrfs_mkdir() must jump to the place of ending transaction after
btrfs_find_free_objectid() failed. Or this transaction can't end.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:11 -04:00
Sage Weil 0bdb1db297 Btrfs: flush data on snapshot creation
Flush any delalloc extents when we create a snapshot, so that recently
written file data is always included in the snapshot.

A later commit will add the ability to snapshot without the flush, but
most people expect flushing.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:11 -04:00