Commit Graph

17363 Commits (ec4318bcb4c59d8b8bf7037c9f444a9887ccb265)

Author SHA1 Message Date
Sage Weil ec4318bcb4 ceph: fix snap rebuild condition
We were rebuilding the snap context when it was not necessary
(i.e. when the realm seq hadn't changed _and_ the parent seq
was still older), which caused page snapc pointers to not match
the realm's snapc pointer (even though the snap context itself
was identical).  This confused begin_write and put it into an
endless loop.

The correct logic is: rebuild snapc if _my_ realm seq changed, or
if my parent realm's seq is newer than mine (and thus mine needs
to be rebuilt too).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:02 -07:00
Sage Weil 87b315a5b5 ceph: avoid reopening osd connections when address hasn't changed
We get a fault callback on _every_ tcp connection fault.  Normally, we
want to reopen the connection when that happens.  If the address we have
is bad, however, and connection attempts always result in a connection
refused or similar error, explicitly closing and reopening the msgr
connection just prevents the messenger's backoff logic from kicking in.
The result can be a console full of

[ 3974.417106] ceph: osd11 10.3.14.138:6800 connection failed
[ 3974.423295] ceph: osd11 10.3.14.138:6800 connection failed
[ 3974.429709] ceph: osd11 10.3.14.138:6800 connection failed

Instead, if we get a fault, and have outstanding requests, but the osd
address hasn't changed and the connection never successfully connected in
the first place, do nothing to the osd connection.  The messenger layer
will back off and retry periodically, because we never connected and thus
the lossy bit is not set.

Instead, touch each request's r_stamp so that handle_timeout can tell the
request is still alive and kicking.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:01 -07:00
Sage Weil 3dd72fc0e6 ceph: rename r_sent_stamp r_stamp
Make variable name slightly more generic, since it will (soon)
reflect either the time the request was sent OR the time it was
last determined to be still retrying.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:47:00 -07:00
Sage Weil 3c3f2e32ef ceph: fix connection fault con_work reentrancy problem
The messenger fault was clearing the BUSY bit, for reasons unclear.  This
made it possible for the con->ops->fault function to reopen the connection,
and requeue work in the workqueue--even though the current thread was
already in con_work.

This avoids a problem where the client busy loops with connection failures
on an unreachable OSD, but doesn't address the root cause of that problem.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:59 -07:00
Sage Weil e4cb4cb8a0 ceph: prevent dup stale messages to console for restarting mds
Prevent duplicate 'mds0 caps stale' message from spamming the console every
few seconds while the MDS restarts.  Set s_renew_requested earlier, so that
we only print the message once, even if we don't send an actual request.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:58 -07:00
Sage Weil efd7576b23 ceph: fix pg pool decoding from incremental osdmap update
The incremental map decoding of pg pool updates wasn't skipping
the snaps and removed_snaps vectors.  This caused osd requests
to stall when pool snapshots were created or fs snapshots were
deleted.  Use a common helper for full and incremental map
decoders that decodes pools properly.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:57 -07:00
Sage Weil 80fc7314a7 ceph: fix mds sync() race with completing requests
The wait_unsafe_requests() helper dropped the mdsc mutex to wait
for each request to complete, and then examined r_node to get the
next request after retaking the lock.  But the request completion
removes the request from the tree, so r_node was always undefined
at this point.  Since it's a small race, it usually led to a
valid request, but not always.  The result was an occasional
crash in rb_next() while dereferencing node->rb_left.

Fix this by clearing the rb_node when removing the request from
the request tree, and not walking off into the weeds when we
are done waiting for a request.  Since the request we waited on
will _always_ be out of the request tree, take a ref on the next
request, in the hopes that it won't be.  But if it is, it's ok:
we can start over from the beginning (and traverse over older read
requests again).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:56 -07:00
Sage Weil 916623da10 ceph: only release unused caps with mds requests
We were releasing used caps (e.g. FILE_CACHE) from encode_inode_release
with MDS requests (e.g. setattr).  We don't carry refs on most caps, so
this code worked most of the time, but for setattr (utimes) we try to
drop Fscr.

This causes cap state to get slightly out of sync with reality, and may
result in subsequent mds revoke messages getting ignored.

Fix by only releasing unused caps.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:55 -07:00
Sage Weil 15637c8b12 ceph: clean up handle_cap_grant, handle_caps wrt session mutex
Drop session mutex unconditionally in handle_cap_grant, and do the
check_caps from the handle_cap_grant helper.  This avoids using a magic
return value.

Also avoid using a flag variable in the IMPORT case and call
check_caps at the appropriate point.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:54 -07:00
Sage Weil cdc2ce056a ceph: fix session locking in handle_caps, ceph_check_caps
Passing a session pointer to ceph_check_caps() used to mean it would leave
the session mutex locked.  That wasn't always possible if it wasn't passed
CHECK_CAPS_AUTHONLY.   If could unlock the passed session and lock a
differet session mutex, which was clearly wrong, and also emitted a
warning when it a racing CPU retook it and we did an unlock from the wrong
context.

This was only a problem when there was more than one MDS.

First, make ceph_check_caps unconditionally drop the session mutex, so that
it is free to lock other sessions as needed.  Then adjust the one caller
that passes in a session (handle_cap_grant) accordingly.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:53 -07:00
Sage Weil 4ea0043a29 ceph: drop unnecessary WARN_ON in caps migration
If we don't have the exported cap it's because we already released it. No
need to WARN.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:52 -07:00
Sage Weil 12eadc1900 ceph: fix null pointer deref of r_osd in debug output
This causes an oops when debug output is enabled and we kick
an osd request with no current r_osd (sometime after an osd
failure).  Check the pointer before dereferencing.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:51 -07:00
Sage Weil 0a990e7093 ceph: clean up service ticket decoding
Previously we would decode state directly into our current ticket_handler.
This is problematic if for some reason we fail to decode, because we end
up with half new state and half old state.

We are probably already in bad shape if we get an update we can't decode,
but we may as well be tidy anyway.  Decode into new_* temporaries and
update the ticket_handler only on success.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-23 07:46:47 -07:00
Sage Weil 5b3dbb44ab ceph: release old ticket_blob buffer
Release the old ticket_blob buffer when we get an updated service ticket
from the monitor.  Previously these were getting leaked.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-20 21:33:11 -07:00
Sage Weil 807c86e2ce ceph: fix authenticator buffer size calculation
The buffer size was incorrectly calculated for the ceph_x_encrypt()
encapsulated ticket blob.  Use a helper (with correct arithmetic) and
BUG out if we were wrong.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-20 21:33:10 -07:00
Sage Weil 63733a0fc5 ceph: fix authenticator timeout
We were failing to reconnect to services due to an old authenticator, even
though we had the new ticket, because we weren't properly retrying the
connect handshake, because we were calling an old/incorrect helper that
left in_base_pos incorrect.  The result was a failure to reconnect to the
OSD or MDS (with an authentication error) if the MDS restarted after the
service had been up a few hours (long enough for the original authenticator
to be invalid).  This was only a problem if the AUTH_X authentication was
enabled.

Now that the 'negotiate' and 'connect' stages are fully separated, we
should use the prepare_read_connect() helper instead, and remove the
obsolete one.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-20 21:33:09 -07:00
Sage Weil 8b218b8a4a ceph: fix inode removal from snap realm when racing with migration
When an inode was dropped while being migrated between two MDSs,
i_cap_exporting_issued was non-zero such that issue caps were non-zero and
__ceph_is_any_caps(ci) was true.  This prevented the inode from being
removed from the snap realm, even as it was dropped from the cache.

Fix this by dropping any residual i_snap_realm ref in destroy_inode.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-20 21:33:08 -07:00
Sage Weil 052bb34af3 ceph: add missing locking to protect i_snap_realm_item during split
All ci->i_snap_realm_item/realm->inodes_with_caps manipulation should be
protected by realm->inodes_with_caps_lock.  This bug would have only bit
us in a rare race with a realm split (during some snap creations).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-20 21:33:07 -07:00
Sage Weil 978097c907 ceph: implemented caps should always be superset of issued caps
Added assertion, and cleared one case where the implemented caps were
not following the issued caps.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-03-20 21:33:06 -07:00
Linus Torvalds fc7f99cf36 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (205 commits)
  ceph: update for write_inode API change
  ceph: reset osd after relevant messages timed out
  ceph: fix flush_dirty_caps race with caps migration
  ceph: include migrating caps in issued set
  ceph: fix osdmap decoding when pools include (removed) snaps
  ceph: return EBADF if waiting for caps on closed file
  ceph: set osd request message front length correctly
  ceph: reset front len on return to msgpool; BUG on mismatched front iov
  ceph: fix snaptrace decoding on cap migration between mds
  ceph: use single osd op reply msg
  ceph: reset bits on connection close
  ceph: remove bogus mds forward warning
  ceph: remove fragile __map_osds optimization
  ceph: fix connection fault STANDBY check
  ceph: invalidate_authorizer without con->mutex held
  ceph: don't clobber write return value when using O_SYNC
  ceph: fix client_request_forward decoding
  ceph: drop messages on unregistered mds sessions; cleanup
  ceph: fix comments, locking in destroy_inode
  ceph: move dereference after NULL test
  ...

Fix trivial conflicts in Documentation/ioctl/ioctl-number.txt
2010-03-19 09:43:06 -07:00
Linus Torvalds 0a492fdef8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: trivial white space
  [CIFS] checkpatch cleanup
  cifs: add cifs_revalidate_file
  cifs: add a CIFSSMBUnixQFileInfo function
  cifs: add a CIFSSMBQFileInfo function
  cifs: overhaul cifs_revalidate and rename to cifs_revalidate_dentry
2010-03-19 09:36:18 -07:00
Linus Torvalds 441f4058a0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (30 commits)
  Btrfs: fix the inode ref searches done by btrfs_search_path_in_tree
  Btrfs: allow treeid==0 in the inode lookup ioctl
  Btrfs: return keys for large items to the search ioctl
  Btrfs: fix key checks and advance in the search ioctl
  Btrfs: buffer results in the space_info ioctl
  Btrfs: use __u64 types in ioctl.h
  Btrfs: fix search_ioctl key advance
  Btrfs: fix gfp flags masking in the compression code
  Btrfs: don't look at bio flags after submit_bio
  btrfs: using btrfs_stack_device_id() get devid
  btrfs: use memparse
  Btrfs: add a "df" ioctl for btrfs
  Btrfs: cache the extent state everywhere we possibly can V2
  Btrfs: cache ordered extent when completing io
  Btrfs: cache extent state in find_delalloc_range
  Btrfs: change the ordered tree to use a spinlock instead of a mutex
  Btrfs: finish read pages in the order they are submitted
  btrfs: fix btrfs_mkdir goto for no free objectids
  Btrfs: flush data on snapshot creation
  Btrfs: make df be a little bit more understandable
  ...
2010-03-18 16:50:55 -07:00
Linus Torvalds 7c34691abe Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: ensure bdi_unregister is called on mount failure.
  NFS: Avoid a deadlock in nfs_release_page
  NFSv4: Don't ignore the NFS_INO_REVAL_FORCED flag in nfs_revalidate_inode()
  nfs4: Make the v4 callback service hidden
  nfs: fix unlikely memory leak
  rpc client can not deal with ENOSOCK, so translate it into ENOCONN
2010-03-18 16:50:09 -07:00
Linus Torvalds 01d61d0d64 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: don't warn about page discards on shutdown
  xfs: use scalable vmap API
  xfs: remove old vmap cache
2010-03-18 16:46:05 -07:00
Chris Mason 8ad6fcab56 Btrfs: fix the inode ref searches done by btrfs_search_path_in_tree
This is used by the inode lookup ioctl to follow all the backrefs up
to the subvol root.  But the search being done would sometimes land one
past the last item in the leaf instead of finding the backref.

This changes the search to look for the highest possible backref and hop
back one item.  It also fixes a leaked path on failure to find the root.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-18 12:23:10 -04:00
Chris Mason 1b53ac4d1b Btrfs: allow treeid==0 in the inode lookup ioctl
When a root id of 0 is sent to the inode lookup ioctl, it will
use the root of the file we're ioctling and pass the root id
back to userland along with the results.

This allows userland to do searches based on that root later on.


Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-18 12:17:05 -04:00
Chris Mason 90fdde147f Btrfs: return keys for large items to the search ioctl
The search ioctl was skipping large items entirely (ones that are too
big for the results buffer).  This changes things to at least copy
the item header so that we can send information about the item back to
userland.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-18 12:14:54 -04:00
Chris Mason abc6e1341b Btrfs: fix key checks and advance in the search ioctl
The search ioctl was working well for finding tree roots, but using it for
generic searches requires a few changes to how the keys are advanced.
This treats the search control min fields for objectid, type and offset
more like a key, where we drop the offset to zero once we bump the type,
etc.

The downside of this is that we are changing the min_type and min_offset
fields during the search, and so the ioctl caller needs extra checks to make sure
the keys in the result are the ones it wanted.

This also changes key_in_sk to use btrfs_comp_cpu_keys, just to make
things more readable.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-18 12:10:08 -04:00
Akinobu Mita c4af96449e ntfs: use bitmap_weight
Use bitmap_weight() instead of doing hweight32() for each u32 element in
the page.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-17 18:43:47 -07:00
Venkatesh Pallipadi bcc54e2a6d jffs2: fix up rb_root initializations to use RB_ROOT
jffs2 uses rb_node = NULL; to zero rb_root.

The problem with this is that 17d9ddc72f ("rbtree: Add
support for augmented rbtrees") in the linux-next tree adds a new field
to that struct which needs to be NULL as well.  This patch uses RB_ROOT
as the intializer so all of the relevant fields will be NULL'd.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Eric Paris <eparis@redhat.com>
Acked-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-17 18:43:47 -07:00
Dave Chinner e8c3753ce4 xfs: don't warn about page discards on shutdown
If we are doing a forced shutdown, we can get lots of noise about
delalloc pages being discarded. This is happens by design during a
forced shutdown, so don't spam the logs with these messages.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-16 15:40:53 -05:00
Alex Elder 8a262e573d xfs: use scalable vmap API
Re-apply a commit that had been reverted due to regressions
that have since been fixed.

    From 95f8e302c0 Mon Sep 17 00:00:00 2001
    From: Nick Piggin <npiggin@suse.de>
    Date: Tue, 6 Jan 2009 14:43:09 +1100

    Implement XFS's large buffer support with the new vmap APIs. See the vmap
    rewrite (db64fe02) for some numbers. The biggest improvement that comes from
    using the new APIs is avoiding the global KVA allocation lock on every call.

    Signed-off-by: Nick Piggin <npiggin@suse.de>
    Reviewed-by: Christoph Hellwig <hch@infradead.org>
    Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>

Only modifications here were a minor reformat, plus making the patch
apply given the new use of xfs_buf_is_vmapped().

Modified-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-16 15:40:36 -05:00
Alex Elder cd9640a70d xfs: remove old vmap cache
Re-apply a commit that had been reverted due to regressions
that have since been fixed.

    Original commit: d2859751cd
    Author: Nick Piggin <npiggin@suse.de>
    Date: Tue, 6 Jan 2009 14:40:44 +1100

    XFS's vmap batching simply defers a number (up to 64) of vunmaps,
    and keeps track of them in a list. To purge the batch, it just goes
    through the list and calls vunamp on each one. This is pretty poor:
    a global TLB flush is generally still performed on each vunmap, with
    the most expensive parts of the operation being the broadcast IPIs
    and locking involved in the SMP callouts, and the locking involved
    in the vmap management -- none of these are avoided by just batching
    up the calls. I'm actually surprised it ever made much difference.
    (Now that the lazy vmap allocator is upstream, this description is
    not quite right, but the vunmap batching still doesn't seem to do
    much).

    Rip all this logic out of XFS completely. I will improve vmap
    performance and scalability directly in subsequent patch.

    Signed-off-by: Nick Piggin <npiggin@suse.de>
    Reviewed-by: Christoph Hellwig <hch@infradead.org>
    Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>

The only change I made was to use the "new" xfs_buf_is_vmapped()
function in a place it had been open-coded in the original.

Modified-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-03-16 15:40:19 -05:00
Chris Mason 7fde62bffb Btrfs: buffer results in the space_info ioctl
The space_info ioctl was using copy_to_user inside rcu_read_lock.  This
commit changes things to copy into a buffer first and then dump the
result down to userland.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-16 15:40:10 -04:00
Sage Weil ce769a2904 Btrfs: use __u64 types in ioctl.h
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-16 14:24:27 -04:00
Sage Weil 854d2c3531 Btrfs: fix search_ioctl key advance
key->type is u8, not u64.

fs/btrfs/ioctl.c: In function 'copy_to_sk':
fs/btrfs/ioctl.c:1024: warning: comparison is always true due to limited range of data type

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-16 14:24:27 -04:00
NeilBrown cfbc0683af NFS: ensure bdi_unregister is called on mount failure.
bdi_unregister is called by nfs_put_super which is only called by
generic_shutdown_super if ->s_root is not NULL.  So if we error out
in a circumstance where we called nfs_bdi_register (i.e. server !=
NULL) but have not set s_root, then we need to call bdi_unregister
explicitly in nfs_get_sb and various other *_get_sb() functions.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-03-15 15:37:45 -04:00
Dan Carpenter 8212cf7583 cifs: trivial white space
I fixed the indent level.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-03-15 15:19:47 +00:00
Nick Piggin ef5780c018 Btrfs: fix gfp flags masking in the compression code
GFP_FS must be masked out, NOFS can't be or'd in.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:05:57 -04:00
Chris Mason 5ff7ba3a79 Btrfs: don't look at bio flags after submit_bio
After callling submit_bio, the bio can be freed at any time.  The
btrfs submission thread helper was checking the bio flags too late,
which might not give the correct answer.

When CONFIG_DEBUG_PAGE_ALLOC is turned on, it can lead to oopsen.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:15 -04:00
Xiao Guangrong a343832f1a btrfs: using btrfs_stack_device_id() get devid
We can use btrfs_stack_device_id() to get dev_item->devid

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:14 -04:00
Akinobu Mita 91748467a5 btrfs: use memparse
Use memparse() instead of its own private implementation.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:14 -04:00
Josef Bacik 1406e4327b Btrfs: add a "df" ioctl for btrfs
df is a very loaded question in btrfs.  This gives us a way to get the per-space
usage information so we can tell exactly what is in use where.  This will help
us figure out ENOSPC problems, and help users better understand where their disk
space is going.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:14 -04:00
Josef Bacik 2ac55d41b5 Btrfs: cache the extent state everywhere we possibly can V2
This patch just goes through and fixes everybody that does

lock_extent()
blah
unlock_extent()

to use

lock_extent_bits()
blah
unlock_extent_cached()

and pass around a extent_state so we only have to do the searches once per
function.  This gives me about a 3 mb/s boots on my random write test.  I have
not converted some things, like the relocation and ioctl's, since they aren't
heavily used and the relocation stuff is in the middle of being re-written.  I
also changed the clear_extent_bit() to only unset the cached state if we are
clearing EXTENT_LOCKED and related stuff, so we can do things like this

lock_extent_bits()
clear delalloc bits
unlock_extent_cached()

without losing our cached state.  I tested this thoroughly and turned on
LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out
fine.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:13 -04:00
Josef Bacik 5a1a3df1f6 Btrfs: cache ordered extent when completing io
When finishing io we run btrfs_dec_test_ordered_pending, and then immediately
run btrfs_lookup_ordered_extent, but btrfs_dec_test_ordered_pending does that
already, so we're searching twice when we don't have to.  This patch lets us
pass a btrfs_ordered_extent in to btrfs_dec_test_ordered_pending so if we do
complete io on that ordered extent we can just use the one we found then instead
of having to do another btrfs_lookup_ordered_extent.  This made my fio job with
the other patch go from 24 mb/s to 29 mb/s.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:13 -04:00
Josef Bacik c2a128d28a Btrfs: cache extent state in find_delalloc_range
This patch makes us cache the extent state we find in find_delalloc_range since
we'll have to lock the extent later on in the function.  This will keep us from
re-searching for the rang when we try to lock the extent.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:13 -04:00
Josef Bacik 49958fd7db Btrfs: change the ordered tree to use a spinlock instead of a mutex
The ordered tree used to need a mutex, but currently all we use it for is to
protect the rb_tree, and a spin_lock is just fine for that.  Using a spin_lock
instead makes dbench run a little faster, 58 mb/s instead of 51 mb/s, and have
less latency, 3445.138 ms instead of 3820.633 ms.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:12 -04:00
Chris Mason 4125bf761c Btrfs: finish read pages in the order they are submitted
The endio is done at reverse order of bio vectors.

That means for a sequential read, the page first submitted will finish
last in a bio. Considering we will do checksum (making cache hot) for
every page, this does introduce delay (and chance to squeeze cache used
soon) for pages submitted at the begining.

I don't observe obvious performance difference with below patch at my
simple test, but seems more natural to finish read in the order they are
submitted.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:12 -04:00
Miao Xie 0be2e98173 btrfs: fix btrfs_mkdir goto for no free objectids
btrfs_mkdir() must jump to the place of ending transaction after
btrfs_find_free_objectid() failed. Or this transaction can't end.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:11 -04:00
Sage Weil 0bdb1db297 Btrfs: flush data on snapshot creation
Flush any delalloc extents when we create a snapshot, so that recently
written file data is always included in the snapshot.

A later commit will add the ability to snapshot without the flush, but
most people expect flushing.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:11 -04:00