Commit Graph

1126 Commits (ff782e0a131c7f669445c07fe5c7ba91e043b7ed)

Author SHA1 Message Date
Chris Mason 1edbb734b4 Btrfs: reduce CPU usage in the extent_state tree
Btrfs is currently mirroring some of the page state bits into
its extent state tree.  The goal behind this was to use it in supporting
blocksizes other than the page size.

But, we don't currently support that, and we're using quite a lot of CPU
on the rb tree and its spin lock.  This commit starts a series of
cleanups to reduce the amount of work done in the extent state tree as
part of each IO.

This commit:

* Adds the ability to lock an extent in the state tree and also set
other bits.  The idea is to do locking and delalloc in one call

* Removes the EXTENT_WRITEBACK and EXTENT_DIRTY bits.  Btrfs is using
a combination of the page bits and the ordered write code for this
instead.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:06 -04:00
Chris Mason e48c465bb3 Btrfs: Fix new state initialization order
As the extent state tree is manipulated, there are call backs
that are used to take extra actions when different state bits are set
or cleared.  One example of this is a counter for the total number
of delayed allocation bytes in a single inode and in the whole FS.

When new states are inserted, this callback is being done before we
properly setup the new state.  This hasn't caused problems before
because the lock bit was always done first, and the existing call backs
don't care about the lock bit.

This patch makes sure the state is properly setup before using the
callback, which is important for later optimizations that do more work
without using the lock bit.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:05 -04:00
Chris Mason 890871be85 Btrfs: switch extent_map to a rw lock
There are two main users of the extent_map tree.  The
first is regular file inodes, where it is evenly spread
between readers and writers.

The second is the chunk allocation tree, which maps blocks from
logical addresses to phyiscal ones, and it is 99.99% reads.

The mapping tree is a point of lock contention during heavy IO
workloads, so this commit switches things to a rw lock.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:05 -04:00
Chris Mason 57fd5a5ff8 Btrfs: tweak congestion backoff
The btrfs io submission thread tries to back off congested devices in
favor of rotating off to another disk.

But, it tries to make sure it submits at least some IO before rotating
on (the others may be congested too), and so it has a magic number of
requests it tries to write before it hops.

This makes the magic number smaller.  Testing shows that we're spending
too much time on congested devices and leaving the other devices idle.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:05 -04:00
Chris Mason a97adc9fff Btrfs: use larger nr_to_write for larger extents
When btrfs fills a large delayed allocation extent, it is a good idea
to try and convince the write_cache_pages caller to go ahead and
write a good chunk of that extent.  The extra IO is basically free
because we know it is contiguous.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:04 -04:00
Chris Mason 4f878e8475 Btrfs: reduce worker thread spin_lock_irq hold times
This changes the btrfs worker threads to batch work items
into a local list.  It allows us to pull work items in
large chunks and significantly reduces the number of times we
need to take the worker thread spinlock.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:04 -04:00
Chris Mason 4e3f9c5042 Btrfs: keep irqs on more often in the worker threads
The btrfs worker thread spinlock was being used both for the
queueing of IO and for the processing of ordered events.

The ordered events never happen from end_io handlers, and so they
don't need to use the _irq version of spinlocks.  This adds a
dedicated lock to the ordered lists so they don't have to run
with irqs off.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:04 -04:00
Chris Mason 40431d6c12 Btrfs: optimize set extent bit
The Btrfs set_extent_bit call currently searches the rbtree
every time it needs to find more extent_state objects to fill
the requested operation.

This adds a simple test with rb_next to see if the next object
in the tree was adjacent to the one we just found.  If so,
we skip the search and just use the next object.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:03 -04:00
Chris Mason 9042846bc7 Btrfs: Allow worker threads to exit when idle
The Btrfs worker threads don't currently die off after they have
been idle for a while, leading to a lot of threads sitting around
doing nothing for each mount.

Also, they are unable to start atomically (from end_io hanlders).

This commit reworks the worker threads so they can be started
from end_io handlers (just setting a flag that asks for a thread
to be added at a later date) and so they can exit if they
have been idle for a long time.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:30:56 -04:00
From: Nick Piggin 03e860bd9f btrfs: fix inode rbtree corruption
Node may not be inserted over existing node. This causes inode tree
corruption and I was seeing crashes in inode_tree_del which I can not
reproduce after this patch.

The other way to fix this would be to tie inode lifetime in the rbtree
with inode while not in freeing state. I had a look at this but it is
not so trivial at this point. At least this patch gets things working again.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Acked-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-08-21 10:09:44 +02:00
Linus Torvalds d6a0967c90 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix balancing oops when invalidate_inode_pages2 returns EBUSY
  Btrfs: correct error-handling zlib error handling
  Btrfs: remove superfluous NULL pointer check in btrfs_rename()
  Btrfs: make sure the async caching thread advances the key
  Btrfs: fix btrfs_remove_from_free_space corner case
2009-08-07 19:03:09 -07:00
Yan Zheng ceab36edd3 Btrfs: fix balancing oops when invalidate_inode_pages2 returns EBUSY
invalidate_inode_pages2_range may return -EBUSY occasionally
which results Oops. This patch fixes the issue by moving
invalidate_inode_pages2_range into a loop and keeping calling
it until the return value is not -EBUSY.

The EBUSY return is temporary, and can happen when the btrfs release page
function is unable to release a page because the EXTENT_LOCK
bit is set.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-08-07 13:51:33 -04:00
Julia Lawall 60f2e8f8a0 Btrfs: correct error-handling zlib error handling
find_zlib_workspace returns an ERR_PTR value in an error case instead of NULL.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@match exists@
expression x, E;
statement S1, S2;
@@

x = find_zlib_workspace(...)
... when != x = E
(
*  if (x == NULL || ...) S1 else S2
|
*  if (x == NULL && ...) S1 else S2
)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-08-07 13:51:33 -04:00
Bartlomiej Zolnierkiewicz 4baf8c9201 Btrfs: remove superfluous NULL pointer check in btrfs_rename()
This takes care of the following entry from Dan's list:

fs/btrfs/inode.c +4788 btrfs_rename(36) warning: variable derefenced before check 'old_inode'

Reported-by: Dan Carpenter <error27@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Eugene Teo <eteo@redhat.com>
Cc: Julia Lawall <julia@diku.dk>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-08-07 13:47:08 -04:00
Chris Mason 013f1b12f4 Btrfs: make sure the async caching thread advances the key
The async caching thread can end up looping forever if a given
search puts it at the last key in a leaf.  It will end up calling
btrfs_next_leaf and then checking if it needs to politely drop
the read semaphore.

Most of the time this looping isn't noticed because it is able to
make progress the next time around.  But, during log replay,
we wait on the async caching thread to finish, and the async thread
is waiting on the commit, and no progress is really made.

The fix used here is to copy the key out of the next leaf,
that way our search lands there properly.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-31 14:57:55 -04:00
Josef Bacik 6606bb97e1 Btrfs: fix btrfs_remove_from_free_space corner case
Yan Zheng hit a problem where we tried to remove some free space but failed
because we couldn't find the free space entry.  This is because the free space
was held within a bitmap that had a starting offset well before the actual
offset of the free space, and there were free space extents that were in the
same range as that offset, so tree_search_offset returned with NULL because we
couldn't find a free space extent that had that offset.  This is fixed by
making sure that if we fail to find the entry, we re-search again with
bitmap_only set to 1 and do an offset_to_bitmap so we can get the appropriate
bitmap.  A similar problem happens in btrfs_alloc_from_bitmap for the
clustering code, but that is not as bad since we will just go and redo our
cluster allocation.

Also this adds some debugging checks to make sure that the free space we are
trying to remove from the bitmap is in fact there.  This can probably go away
after a while, but since this code is only used by the tree-logging stuff it
would be nice to run with it for a while to make sure there are no problems.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-31 11:03:58 -04:00
Linus Torvalds ec6a8679fa Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: be more polite in the async caching threads
  Btrfs: preserve commit_root for async caching
2009-07-30 16:46:48 -07:00
Chris Mason f36f3042ea Btrfs: be more polite in the async caching threads
The semaphore used by the async caching threads can prevent a
transaction commit, which can make the FS appear to stall.  This
releases the semaphore more often when a transaction commit is
in progress.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-30 10:14:46 -04:00
Yan Zheng 276e680d19 Btrfs: preserve commit_root for async caching
The async block group caching code uses the commit_root pointer
to get a stable version of the extent allocation tree for scanning.
This copy of the tree root isn't going to change and it significantly
reduces the complexity of the scanning code.

During a commit, we have a loop where we update the extent allocation
tree root.  We need to loop because updating the root pointer in
the tree of tree roots may allocate blocks which may change the
extent allocation tree.

Right now the commit_root pointer is changed inside this loop.  It
is more correct to change the commit_root pointer only after all the
looping is done.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-30 09:40:40 -04:00
Linus Torvalds 655c5d8fc1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (22 commits)
  Btrfs: Fix async caching interaction with unmount
  Btrfs: change how we unpin extents
  Btrfs: Correct redundant test in add_inode_ref
  Btrfs: find smallest available device extent during chunk allocation
  Btrfs: clear all space_info->full after removing a block group
  Btrfs: make flushoncommit mount option correctly wait on ordered_extents
  Btrfs: Avoid delayed reference update looping
  Btrfs: Fix ordering of key field checks in btrfs_previous_item
  Btrfs: find_free_dev_extent doesn't handle holes at the start of the device
  Btrfs: Remove code duplication in comp_keys
  Btrfs: async block group caching
  Btrfs: use hybrid extents+bitmap rb tree for free space
  Btrfs: Fix crash on read failures at mount
  Btrfs: remove of redundant btrfs_header_level
  Btrfs: adjust NULL test
  Btrfs: Remove broken sanity check from btrfs_rmap_block()
  Btrfs: convert nested spin_lock_irqsave to spin_lock
  Btrfs: make sure all dirty blocks are written at commit time
  Btrfs: fix locking issue in btrfs_find_next_key
  Btrfs: fix double increment of path->slots[0] in btrfs_next_leaf
  ...
2009-07-28 14:27:06 -07:00
Yan Zheng f25784b35f Btrfs: Fix async caching interaction with unmount
- don't stop the caching thread until btrfs_commit_super return.

- if caching is interrupted by umount, set last to (u64)-1.
  otherwise the un-scanned range of block group will be considered
  as free extent.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-28 08:41:57 -04:00
Josef Bacik 68b38550dd Btrfs: change how we unpin extents
We are racy with async block caching and unpinning extents.  This patch makes
things much less complicated by only unpinning the extent if the block group is
cached.  We check the block_group->cached var under the block_group->lock spin
lock.  If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
and unpin the extent and add the free space back.  If it is not set to this, we
start the caching of the block group so the next time we unpin extents we can
unpin the extent.  This keeps us from racing with the async caching threads,
lets us kill the fs wide async thread counter, and keeps us from having to set
DELALLOC bits for every extent we hit if there are caching kthreads going.

One thing that needed to be changed was btrfs_free_super_mirror_extents.  Now
instead of just looking for LOCKED extents, we also look for DIRTY extents,
since we could have left some extents pinned in the previous transaction that
will never get freed now that we are unmounting, which would cause us to leak
memory.  So btrfs_free_super_mirror_extents has been changed to
btrfs_free_pinned_extents, and it will clear the extents locked for the super
mirror, and any remaining pinned extents that may be present.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-27 13:57:01 -04:00
Julia Lawall 631c07c8d1 Btrfs: Correct redundant test in add_inode_ref
dir has already been tested.  It seems that this test should be on the
recently returned value inode.

A simplified version of the semantic match that finds this problem is as
follows: (http://www.emn.fr/x-info/coccinelle/)

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-27 13:57:00 -04:00
Chris Mason 9779b72f05 Btrfs: find smallest available device extent during chunk allocation
Allocating new block group is easy when the disk has plenty of space.
But things get difficult as the disk fills up, especially if
the FS has been run through btrfs-vol -b.  The balance operation
is likely to make the total bytes available on the device greater
than the largest extent we'll actually be able to allocate.

But the device extent allocation code incorrectly assumes that a device
with 5G free will be able to allocate a 5G extent.  It isn't normally a
problem because device extents don't get freed unless btrfs-vol -b
is run.

This fixes the device extent allocator to remember the largest free
extent it can find, and then uses that value as a fallback.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 16:41:41 -04:00
Chris Mason 283bb1979f Btrfs: clear all space_info->full after removing a block group
Btrfs allocates individual extents from block groups, and each
block group has a specific type.  It may hold metadata, data
mirrored or striped etc.

When we balance space (btrfs-vol -b) or remove a drive (btrfs-vol -r)
we free block groups.  Once a block group is freed, the space it was
using on the device may be available for use by new block groups.

btrfs_remove_block_group was clearing the flag that said
'our devices are full, don't even try to allocate new block groups',
but it was only clearing that flag for a specific type of block group.

This commit clears the full flag for all of the types of block groups,
making it much more likely that we'll be able to balance space when
the drive is close to full.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 16:30:55 -04:00
Sage Weil ebecd3d9d2 Btrfs: make flushoncommit mount option correctly wait on ordered_extents
The commit_transaction call to wait_ordered_extents when snap_pending
passes nocow_only=1 to process only NOCOW or PREALLOC extents.  This isn't
correct for the 'flushoncommit' mode, as it skips extents we just started
IO on in start_delalloc_inodes.

So, in the flushoncommit case, wait on all ordered extents.  Otherwise,
only pass the nocow_only flag to wait_ordered_extents if snap_pending.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 13:17:44 -04:00
Yan Zheng d717aa1d31 Btrfs: Avoid delayed reference update looping
btrfs_split_leaf and btrfs_del_items can end up in a loop
where one is constantly spliting a given leaf and the other
is constantly merging it back with the adjacent nodes.

There is a better fix for this, but in the interest of something
small, this patch just changes btrfs_del_items back to balancing less
often.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 12:42:46 -04:00
Yan Zheng 0a4eefbb74 Btrfs: Fix ordering of key field checks in btrfs_previous_item
Check objectid of item before checking the item type, otherwise we may return
zero for a key that is actually too low.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 11:22:47 -04:00
Yan Zheng 1fcbac581b Btrfs: find_free_dev_extent doesn't handle holes at the start of the device
find_free_dev_extent does not properly handle the case where
the device is not complete free, and there is a free extent
at the beginning of the device.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 11:22:47 -04:00
Diego Calleja 20736abaa3 Btrfs: Remove code duplication in comp_keys
comp_keys is duplicating what is done in btrfs_comp_cpu_keys, so just
call it.

Signed-off-by: Diego Calleja <diegocg@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 11:22:46 -04:00
Josef Bacik 817d52f8db Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner.  Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation.  If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching.  This is how I tested
the speedup from this

mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo

Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.

Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group.  This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.

I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached.  This drastically reduces the amount of time it takes to
cache the block groups.  Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.

This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 09:23:39 -04:00
Josef Bacik 9630308170 Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space.  As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area.  Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.

This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache.  Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram.  The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.

Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space.  This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap.  This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.

I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree.  I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree.  This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.

I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff.  I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues.  I've not seen
any performance regressions in any of my tests.

Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems.  Thanks,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 09:23:30 -04:00
David Woodhouse 83121942b2 Btrfs: Fix crash on read failures at mount
If the tree roots hit read errors during mount, btrfs is not properly
erroring out.  We need to check the uptodate bits after
reading in the tree root node.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 16:52:13 -04:00
Daniel Cadete c271b49241 Btrfs: remove of redundant btrfs_header_level
This removes the continues call's of btrfs_header_level. One call of
btrfs_header_level(c) its enough.

Signed-off-by Daniel Cadete <danielncadete10@gmail.com>

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 16:52:13 -04:00
Julia Lawall 33c17ad571 Btrfs: adjust NULL test
Move the call to BUG_ON to before the dereference of the tested value.

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 16:49:01 -04:00
David Woodhouse 3acada49c2 Btrfs: Remove broken sanity check from btrfs_rmap_block()
It was never actually doing anything anyway (see the loop condition),
and it would be difficult to make it work for RAID[56].

Even if it was actually working, it's checking for the wrong thing
anyway. Instead of checking whether we list a block which _doesn't_ land
at the relevant physical location, it should be checking that we _have_
listed all the logical blocks which refer to the required physical
location on all devices.

This function is only called from remove_sb_from_cache() to ensure that
we reserve the logical blocks which would reside at the same physical
location as the superblock copies. So listing more blocks than we need
is actually OK.

With RAID[56] we're going to throw away an entire stripe for each block
we have to ignore, so we _are_ going to list blocks other than the
ones which actually contain the superblock.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 16:49:01 -04:00
Julia Lawall 29c5e8ce01 Btrfs: convert nested spin_lock_irqsave to spin_lock
If spin_lock_irqsave is called twice in a row with the same second
argument, the interrupt state at the point of the second call overwrites
the value saved by the first call.  Indeed, the second call does not need
to save the interrupt state, so it is changed to a simple spin_lock.

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 16:49:00 -04:00
Yan Zheng 4a8c9a62d7 Btrfs: make sure all dirty blocks are written at commit time
Write dirty block groups may allocate new block, and so may add new delayed
back ref. btrfs_run_delayed_refs may make some block groups dirty.

commit_cowonly_roots does not handle the recursion properly, and some dirty
blocks can be left unwritten at commit time. This patch moves
btrfs_run_delayed_refs into the loop that writes dirty block groups, and makes
the code not break out of the loop until there are no dirty block groups or
delayed back refs.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 10:07:05 -04:00
Yan Zheng 33c66f430b Btrfs: fix locking issue in btrfs_find_next_key
When walking up the tree, btrfs_find_next_key assumes the upper level tree
block is properly locked. This isn't always true even path->keep_locks is 1.
This is because btrfs_find_next_key may advance path->slots[] several times
instead of only once.

When 'path->slots[level] >= btrfs_header_nritems(path->nodes[level])' is found,
we can't guarantee the original value of 'path->slots[level]' is
'btrfs_header_nritems(path->nodes[level]) - 1'. If it's not, the tree block at
'level + 1' isn't locked.

This patch fixes the issue by explicitly checking the locking state,
re-searching the tree if it's not locked.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 09:59:00 -04:00
Yan Zheng e457afec60 Btrfs: fix double increment of path->slots[0] in btrfs_next_leaf
if 1 is returned by btrfs_search_slot, the path already points to the
first item with 'key > searching key'. So increasing path->slots[0] by
one is superfluous in that case.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 09:59:00 -04:00
Yan Zheng bf1fb512a5 Btrfs: properly update space information after shrinking device.
Change 'goto done' to 'break' for the case of all device extents have
been freed, so that the code updates space information will be execute.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 09:59:00 -04:00
Yan Zheng 1bec1aed1e Btrfs: fix definition of struct btrfs_extent_inline_ref
use __le64 instead of u64 in on-disk structure definition.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 09:59:00 -04:00
Alexey Dobriyan 405f55712d headers: smp_lock.h redux
* Remove smp_lock.h from files which don't need it (including some headers!)
* Add smp_lock.h to files which do need it
* Make smp_lock.h include conditional in hardirq.h
  It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT

  This will make hardirq.h inclusion cheaper for every PREEMPT=n config
  (which includes allmodconfig/allyesconfig, BTW)

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-12 12:22:34 -07:00
Linus Torvalds 5291a12f05 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix error message formatting
  Btrfs: fix use after free in btrfs_start_workers fail path
  Btrfs: honor nodatacow/sum mount options for new files
  Btrfs: update backrefs while dropping snapshot
  Btrfs: account for space we may use in fallocate
  Btrfs: fix the file clone ioctl for preallocated extents
  Btrfs: don't log the inode in file_write while growing the file
2009-07-02 16:52:38 -07:00
Hu Tao 68f5a38c3e Btrfs: fix error message formatting
Make an error msg look nicer by inserting a space between number and word.

Signed-off-by: Hu Tao <hu.taoo@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-02 13:55:45 -04:00
Jiri Slaby 9b627e9bf4 Btrfs: fix use after free in btrfs_start_workers fail path
worker memory is already freed on one fail path in btrfs_start_workers,
but is still dereferenced. Switch the dereference and kfree.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-02 13:50:58 -04:00
Chris Mason 9427216476 Btrfs: honor nodatacow/sum mount options for new files
The btrfs attr patches unconditionally inherited the inode flags field
without honoring nodatacow and nodatasum.  This fix makes sure
we properly record the nodatacow/sum mount options in new inodes.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-02 13:41:17 -04:00
Yan Zheng 2c47e605a9 Btrfs: update backrefs while dropping snapshot
The new backref format has restriction on type of backref item.  If a tree
block isn't referenced by its owner tree, full backrefs must be used for the
pointers in it. When a tree block loses its owner tree's reference, backrefs
for the pointers in it should be updated to full backrefs. Current
btrfs_drop_snapshot misses the code that updates backrefs, so it's unsafe for
general use.

This patch adds backrefs update code to btrfs_drop_snapshot.  It isn't a
problem in the restricted form btrfs_drop_snapshot is used today, but for
general snapshot deletion this update is required.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-02 13:41:17 -04:00
Josef Bacik a970b0a16c Btrfs: account for space we may use in fallocate
Using Eric Sandeen's xfstest for fallocate, you can easily trigger a ENOSPC
panic on btrfs.  This is because we do not account for data we may use when
doing the fallocate.  This patch fixes the problem by properly reserving space,
and then just freeing it when we are done.  The reservation stuff was made with
delalloc in mind, so its a little crude for this case, but it keeps the box
from panicing.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-02 13:41:16 -04:00
Chris Mason c8a894d77d Btrfs: fix the file clone ioctl for preallocated extents 2009-07-02 13:41:16 -04:00
Chris Mason f597bb19cc Btrfs: don't log the inode in file_write while growing the file 2009-07-02 13:41:16 -04:00
Al Viro 72c04902d1 Get "no acls for this inode" right, fix shmem breakage
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 16:58:48 -04:00
Al Viro 073aaa1b14 helpers for acl caching + switch to those
helpers: get_cached_acl(inode, type), set_cached_acl(inode, type, acl),
forget_cached_acl(inode, type).

ubifs/xattr.c needed includes reordered, the rest is a plain switchover.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:17:07 -04:00
Al Viro 5affd88a10 switch btrfs to inode->i_acl
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:17:05 -04:00
Linus Torvalds d613839ef9 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: remove some includings of blktrace_api.h
  mg_disk: seperate mg_disk.h again
  block: Introduce helper to reset queue limits to default values
  cfq: remove extraneous '\n' in blktrace output
  ubifs: register backing_dev_info
  btrfs: properly register fs backing device
  block: don't overwrite bdi->state after bdi_init() has been run
  cfq: cleanup for last_end_request in cfq_data
2009-06-16 11:46:45 -07:00
Linus Torvalds 69257cae20 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: always update root items for fs trees at commit time
2009-06-16 11:30:16 -07:00
Jens Axboe ad081f1430 btrfs: properly register fs backing device
btrfs assigns this bdi to all inodes on that file system, so make
sure it's registered. This isn't really important now, but will be
when we put dirty inodes there. Even now, we miss the stats when the
bdi isn't visible.

Also fixes failure to check bdi_init() return value, and bad inherit of
->capabilities flags from the default bdi.

Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-16 08:21:03 +02:00
Yan Zheng 978d910d31 Btrfs: always update root items for fs trees at commit time
commit_fs_roots skips updating root items for fs trees that aren't modified.
This is unsafe now that relocation code modifies root item's last_snapshot
field without modifying corresponding fs tree.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-15 20:01:02 -04:00
Christoph Hellwig 5af7926ff3 enforce ->sync_fs is only called for rw superblock
Make sure a superblock really is writeable by checking MS_RDONLY
under s_umount.  sync_filesystems needed some re-arragement for
that, but all but one sync_filesystem caller had the correct locking
already so that we could add that check there.  cachefiles grew
s_umount locking.

I've also added a WARN_ON to sync_filesystem to assert this for
future callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:06 -04:00
Christoph Hellwig 59d697b702 btrfs: remove ->write_super and stop maintaining ->s_dirt
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:05 -04:00
Chris Mason b263c2c8bf Btrfs: fix extent_buffer leak during tree log replay
During tree log replay, we read in the tree log roots,
process them and then free them.  A recent change
takes an extra reference on the root node of the tree
when the root is read in, and stores that reference
in root->commit_root.

This reference was not being freed, leaving us with
one buffer pinned in ram for each subvol with
a tree log root after a crash.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 11:24:47 -04:00
Chris Mason 0b4dcea579 Btrfs: fix oops when btrfs_inherit_iflags called with a NULL dir
This happens during subvol creation.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 11:13:35 -04:00
Chris Mason 067c28adc5 Btrfs: fix -o nodatasum printk spelling
It was printing nodatacsum, which was not the correct option name.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 09:30:13 -04:00
Yan Zheng 85d4198e40 Btrfs: check duplicate backrefs for both data and metadata
lookup_inline_extent_backref only checks for duplicate backref for data
extents. It assumes backrefs for tree block never conflict.

This patch makes lookup_inline_extent_backref check for duplicate backrefs
for both data and tree block, so that we can detect potential bug earlier.
This is a safety check, strictly speaking it is not required.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 08:51:34 -04:00
Shin Hong fd0fb038d5 Btrfs: init worker struct fields before kthread-run
This patch fixes a bug which may result race condition
between btrfs_start_workers() and worker_loop().

btrfs_start_workers() executed in a parent thread writes
on workers->worker and worker_loop() in a child thread
reads workers->worker. However, there is no synchronization
enforcing the order of two operations.

This patch makes btrfs_start_workers() fill workers->worker
before it starts a child thread with worker_loop()

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 20:11:29 -04:00
Hisashi Hifumi 4eedeb75e7 Btrfs: pin buffers during write_dev_supers
write_dev_supers is called in sequence.  First is it called with wait == 0,
which starts IO on all of the super blocks for a given device.  Then it is
called with wait == 1 to make sure they all reach the disk.

It doesn't currently pin the buffers between the two calls, and it also
assumes the buffers won't go away between the two calls, leading to
an oops if the VM manages to free the buffers in the middle of the sync.

This fixes that assumption and updates the code to return an error if things
are not up to date when the wait == 1 run is done.

Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 16:49:25 -04:00
Chris Mason e5e9a5206a Btrfs: avoid races between super writeout and device list updates
On multi-device filesystems, btrfs writes supers to all of the devices
before considering a sync complete.  There wasn't any additional
locking between super writeout and the device list management code
because device management was done inside a transaction and
super writeout only happened  with no transation writers running.

With the btrfs fsync log and other async transaction updates, this
has been racey for some time.  This adds a mutex to protect
the device list.  The existing volume mutex could not be reused due to
transaction lock ordering requirements.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 15:17:02 -04:00
Al Viro 7df336ec12 Fix btrfs when ACLs are configured out
... otherwise generic_permission() will allow *anything* for all
files you don't own and that have some group permissions.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:36:43 -04:00
Hisashi Hifumi 524724ed1f Btrfs: fdatasync should skip metadata writeout
In btrfs, fdatasync and fsync are identical, but
fdatasync should skip committing transaction when
inode->i_state is set just I_DIRTY_SYNC and this indicates
only atime or/and mtime updates.
Following patch improves fdatasync throughput.

--file-block-size=4K --file-total-size=16G --file-test-mode=rndwr
--file-fsync-mode=fdatasync run

Results:
-2.6.30-rc8
Test execution summary:
    total time:                          1980.6540s
    total number of events:              10001
    total time taken by event execution: 1192.9804
    per-request statistics:
         min:                            0.0000s
         avg:                            0.1193s
         max:                            15.3720s
         approx.  95 percentile:         0.7257s

Threads fairness:
    events (avg/stddev):           625.0625/151.32
    execution time (avg/stddev):   74.5613/9.46

-2.6.30-rc8-patched
Test execution summary:
    total time:                          1695.9118s
    total number of events:              10000
    total time taken by event execution: 871.3214
    per-request statistics:
         min:                            0.0000s
         avg:                            0.0871s
         max:                            10.4644s
         approx.  95 percentile:         0.4787s

Threads fairness:
    events (avg/stddev):           625.0000/131.86
    execution time (avg/stddev):   54.4576/8.98

Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:53 -04:00
David Woodhouse 163e783e6a Btrfs: remove crc32c.h and use libcrc32c directly.
There's no need to preserve this abstraction; it used to let us use
hardware crc32c support directly, but libcrc32c is already doing that for us
through the crypto API -- so we're already using the Intel crc32c
acceleration where appropriate.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:53 -04:00
Christoph Hellwig 6cbff00f46 Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION
Add support for the standard attributes set via chattr and read via
lsattr.  Currently we store the attributes in the flags value in
the btrfs inode, but I wonder whether we should split it into two so
that we don't have to keep converting between the two formats.

Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros
as they were confusing the existing code and got in the way of the
new additions.

Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's
trivial.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:52 -04:00
Chris Mason c289811cc0 Btrfs: autodetect SSD devices
During mount, btrfs will check the queue nonrot flag
for all the devices found in the FS.  If they are all
non-rotating, SSD mode is enabled by default.

If the FS was mounted with -o nossd, the non-rotating
flag is ignored.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:52 -04:00
Chris Mason 451d7585a8 Btrfs: add mount -o ssd_spread to spread allocations out
Some SSDs perform best when reusing block numbers often, while
others perform much better when clustering strictly allocates
big chunks of unused space.

The default mount -o ssd will find rough groupings of blocks
where there are a bunch of free blocks that might have some
allocated blocks mixed in.

mount -o ssd_spread will make sure there are no allocated blocks
mixed in.  It should perform better on lower end SSDs.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:52 -04:00
Chris Mason c604480171 Btrfs: avoid allocation clusters that are too spread out
In SSD mode for data, and all the time for metadata the allocator
will try to find a cluster of nearby blocks for allocations.  This
commit adds extra checks to make sure that each free block in the
cluster is close to the last one.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:51 -04:00
Chris Mason 3b30c22f64 Btrfs: Add mount -o nossd
This allows you to turn off the ssd mode via remount.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:50 -04:00
Chris Mason d644d8a1e3 Btrfs: avoid IO stalls behind congested devices in a multi-device FS
The btrfs IO submission threads try to service a bunch of devices with a small
number of threads.  They do a congestion check to try and avoid waiting
on requests for a busy device.

The checks make sure we've sent a few requests down to a given device just so
that we aren't bouncing between busy devices without actually sending down
any IO.  The counter used to decide if we can switch to the next device
is somewhat overloaded.  It is also being used to decide if we've done
a good batch of requests between the WRITE_SYNC or regular priority lists.
It may get reset to zero often, leaving us hammering on a busy device
instead of moving on to another disk.

This commit adds a new counter for the number of bios sent while
servicing a device.  It doesn't get reset or fiddled with.  On
multi-device filesystems, this fixes IO stalls in streaming
write workloads.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:49 -04:00
Chris Mason d84275c938 Btrfs: don't allow WRITE_SYNC bios to starve out regular writes
Btrfs uses dedicated threads to submit bios when checksumming is on,
which allows us to make sure the threads dedicated to checksumming don't get
stuck waiting for requests.  For each btrfs device, there are
two lists of bios.  One list is for WRITE_SYNC bios and the other
is for regular priority bios.

The IO submission threads used to process all of the WRITE_SYNC bios first and
then switch to the regular bios.  This commit makes sure we don't completely
starve the regular bios by rotating between the two lists.

WRITE_SYNC bios are still favored 2:1 over the regular bios, and this tries
to run in batches to avoid seeking.  Benchmarking shows this eliminates
stalls during streaming buffered writes on both multi-device and
single device filesystems.

If the regular bios starve, the system can end up with a large amount of ram
pinned down in writeback pages.  If we are a little more fair between the two
classes, we're able to keep throughput up and make progress on the bulk of
our dirty ram.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:49 -04:00
Chris Mason 585ad2c379 Btrfs: fix metadata dirty throttling limits
Once a metadata block has been written, it must be recowed, so the
btrfs dirty balancing call has a check to make sure a fair amount of metadata
was actually dirty before it started writing it back to disk.

A previous commit had changed the dirty tracking for metadata without
updating the btrfs dirty balancing checks.  This commit switches it
to use the correct counter.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:48 -04:00
Chris Mason 2c943de6ad Btrfs: reduce mount -o ssd CPU usage
The block allocator in SSD mode will try to find groups of free blocks
that are close together.  This commit makes it loop less on a given
group size before bumping it.

The end result is that we are less likely to fill small holes in the
available free space, but we don't waste as much CPU building the
large cluster used by ssd mode.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:48 -04:00
Chris Mason cfbb930846 Btrfs: balance btree more often
With the new back reference code, the cost of a balance has gone down
in terms of the number of back reference updates done.  This commit
makes us more aggressively balance leaves and nodes as they become
less full.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:47 -04:00
Chris Mason b361242102 Btrfs: stop avoiding balancing at the end of the transaction.
When the delayed reference code was added, some checks were added
to avoid extra balancing while the delayed references were being flushed.
This made for less efficient btrees, but it reduced the chances of
loops where no forward progress was made because the balances made
more delayed ref updates.

With the new dead root removal code and the mixed back references,
the extent allocation tree is no longer using precise back refs, and
the delayed reference updates don't carry the risk of looping forever
anymore.  So, the balance avoidance is no longer required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:47 -04:00
Yan Zheng 5d4f98a28c Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.

When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one.  At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.

The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root.  This commit reduces the
transaction overhead by avoiding the need for dead root records.

When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.

This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.

We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.

This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.

This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.

This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.

The improved balancing code scales significantly better with a large
number of snapshots.

This is a very large commit and was written in a number of
pieces.  But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:46 -04:00
Yan Zheng 5c939df56c btrfs: Fix set/clear_extent_bit for 'end == (u64)-1'
There are some 'start = state->end + 1;' like code in set_extent_bit
and clear_extent_bit. They overflow when end == (u64)-1.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:46 -04:00
Linus Torvalds 064e38aade Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: Fix oops and use after free during space balancing
  Btrfs: set device->total_disk_bytes when adding new device
2009-06-05 11:54:28 -07:00
Chris Mason 44fb551163 Btrfs: Fix oops and use after free during space balancing
The btrfs allocator uses list_for_each to walk the available block
groups when searching for free blocks.  It starts off with a hint
to help find the best block group for a given allocation.

The hint is resolved into a block group, but we don't properly check
to make sure the block group we find isn't in the middle of being
freed due to filesystem shrinking or balancing.  If it is being
freed, the list pointers in it are bogus and can't be trusted.  But,
the code happily goes along and uses them in the list_for_each loop,
leading to all kinds of fun.

The fix used here is to check to make sure the block group we find really
is on the list before we use it.  list_del_init is used when removing
it from the list, so we can do a proper check.

The allocation clustering code has a similar bug where it will trust
the block group in the current free space cluster.  If our allocation
flags have changed (going from single spindle dup to raid1 for example)
because the drives in the FS have changed, we're not allowed to use
the old block group any more.

The fix used here is to check the current cluster against the
current allocation flags.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-04 15:41:27 -04:00
Yan Zheng 2cc3c559fb Btrfs: set device->total_disk_bytes when adding new device
It was not being properly initialized, and so the size saved to
disk was not correct.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-04 09:23:57 -04:00
Linus Torvalds 5732c46849 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: Spelling fix in btrfs_lookup_first_block_group comments
  Btrfs: make show_options result match actual option names
  Btrfs: remove outdated comment in btrfs_ioctl_resize()
  Btrfs: remove some WARN_ONs in the IO failure path
  Btrfs: Don't loop forever on metadata IO failures
  Btrfs: init inode ordered_data_close flag properly
2009-05-14 19:18:44 -07:00
Sankar P 9f55684c2d Btrfs: Spelling fix in btrfs_lookup_first_block_group comments
Signed-off-by: Sankar P <sankar.curiosity@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:34 -04:00
Sage Weil 6b65c5c61b Btrfs: make show_options result match actual option names
The notreelog and flushoncommit mount options were being printed slightly
differently.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:34 -04:00
Li Hong 5d847a8ed9 Btrfs: remove outdated comment in btrfs_ioctl_resize()
In Li Zefan's commit dae7b665cf,
a combination call of kmalloc() and copy_from_user() is replaced by
memdup_user(). So btrfs_ioctl_resize() doesn't use GFP_NOFS any more.

Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:33 -04:00
Chris Mason cc7b0c9b70 Btrfs: remove some WARN_ONs in the IO failure path
These debugging WARN_ONs make too much console noise during regular
IO failures.  An IO failure will still generate a number of messages
as we verify checksums etc, but these two are not needed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:33 -04:00
Chris Mason 76a05b35a3 Btrfs: Don't loop forever on metadata IO failures
When a btrfs metadata read fails, the first thing we try to do is find
a good copy on another mirror of the block.  If this fails, read_tree_block()
ends up returning a buffer that isn't up to date.

The btrfs btree reading code was reworked to drop locks and repeat
the search when IO was done, but the changes didn't add a check for failed
reads.  The end result was looping forever on buffers that were never
going to become up to date.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:32 -04:00
Chris Mason 2757495c90 Btrfs: init inode ordered_data_close flag properly
This flag is used to decide when we need to send a given file through
the ordered code to make sure it is fully written before a transaction
commits.  It was not being properly set to zero when the inode was
being setup.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:31 -04:00
Al Viro 6f5bbff9a1 Convert obvious places to deactivate_locked_super()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:40 -04:00
Linus Torvalds 4ebf662337 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: look for acls during btrfs_read_locked_inode
  Btrfs: fix acl caching
  Btrfs: Fix a bunch of printk() warnings.
  Btrfs: Fix a trivial warning using max() of u64 vs ULL.
  Btrfs: remove unused btrfs_bit_radix slab
  Btrfs: ratelimit IO error printks
  Btrfs: remove #if 0 code
  Btrfs: When shrinking, only update disk size on success
  Btrfs: fix deadlocks and stalls on dead root removal
  Btrfs: fix fallocate deadlock on inode extent lock
  Btrfs: kill btrfs_cache_create
  Btrfs: don't export symbols
  Btrfs: simplify makefile
  Btrfs: try to keep a healthy ratio of metadata vs data block groups
2009-04-27 11:16:33 -07:00
Chris Mason 46a53cca82 Btrfs: look for acls during btrfs_read_locked_inode
This changes btrfs_read_locked_inode() to peek ahead in the btree for acl items.
If it is certain a given inode has no acls, it will set the in memory acl
fields to null to avoid acl lookups completely.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 13:18:35 -04:00
Chris Mason 7b1a14bbb0 Btrfs: fix acl caching
Linus noticed the btrfs code to cache acls wasn't properly caching
a NULL acl when the inode didn't have any acls.  This meant the common
case of no acls resulted in expensive btree searches every time the
kernel checked permissions (which is quite often).

This is a modified version of Linus' original patch:

Properly set initial acl fields to BTRFS_ACL_NOT_CACHED in the inode.
This forces an acl lookup when permission checks are done.

Fix btrfs_get_acl to avoid lookups and locking when the inode acls fields
are set to null.

Fix btrfs_get_acl to use the right return value from __btrfs_getxattr
when deciding to cache a NULL acl.  It was storing a NULL acl when
__btrfs_getxattr return -ENOENT, but __btrfs_getxattr was actually returning
-ENODATA for this case.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 13:18:26 -04:00
Joel Becker 21380931eb Btrfs: Fix a bunch of printk() warnings.
Just happened to notice a bunch of %llu vs u64 warnings.  Here's a patch
to cast them all.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 08:37:49 -04:00
Joel Becker e63b6a6c0f Btrfs: Fix a trivial warning using max() of u64 vs ULL.
A small warning popped up on ia64 because inode-map.c was comparing a
u64 object id with the ULL FIRST_FREE_OBJECTID.  My first thought was
that all the OBJECTID constants should contain the u64 cast because
btrfs code deals entirely in u64s.  But then I saw how large that was,
and figured I'd just fix the max() call.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 08:37:49 -04:00
Chris Mason 45c06543af Btrfs: remove unused btrfs_bit_radix slab
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 08:37:48 -04:00