Commit Graph

97 Commits (0b778e76c1e7ccf49f8980b594e72f984095fd26)

Author SHA1 Message Date
Yan, Zheng d68fc57b7e Btrfs: Metadata reservation for orphan inodes
reserve metadata space for handling orphan inodes

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:52 -04:00
Yan, Zheng 8929ecfa50 Btrfs: Introduce global metadata reservation
Reserve metadata space for extent tree, checksum tree and root tree

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:52 -04:00
Yan, Zheng 0ca1f7ceb1 Btrfs: Update metadata reservation for delayed allocation
Introduce metadata reservation context for delayed allocation
and update various related functions.

This patch also introduces EXTENT_FIRST_DELALLOC control bit for
set/clear_extent_bit. It tells set/clear_bit_hook whether they
are processing the first extent_state with EXTENT_DELALLOC bit
set. This change is important if set/clear_extent_bit involves
multiple extent_state.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:51 -04:00
Yan, Zheng a22285a6a3 Btrfs: Integrate metadata reservation with start_transaction
Besides simplify the code, this change makes sure all metadata
reservation for normal metadata operations are released after
committing transaction.

Changes since V1:

Add code that check if unlink and rmdir will free space.

Add ENOSPC handling for clone ioctl.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:50 -04:00
Linus Torvalds 18e41da89d Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: check for read permission on src file in the clone ioctl
2010-05-15 12:55:31 -07:00
Dan Rosenberg 5dc6416414 Btrfs: check for read permission on src file in the clone ioctl
The existing code would have allowed you to clone a file that was
only open for writing

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-15 12:05:50 -04:00
Linus Torvalds 795d580bae Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: add check for changed leaves in setup_leaf_for_split
  Btrfs: create snapshot references in same commit as snapshot
  Btrfs: fix small race with delalloc flushing waitqueue's
  Btrfs: use add_to_page_cache_lru, use __page_cache_alloc
  Btrfs: fix chunk allocate size calculation
  Btrfs: kill max_extent mount option
  Btrfs: fail to mount if we have problems reading the block groups
  Btrfs: check btrfs_get_extent return for IS_ERR()
  Btrfs: handle kmalloc() failure in inode lookup ioctl
  Btrfs: dereferencing freed memory
  Btrfs: Simplify num_stripes's calculation logical for __btrfs_alloc_chunk()
  Btrfs: Add error handle for btrfs_search_slot() in btrfs_read_chunk_tree()
  Btrfs: Remove unnecessary finish_wait() in wait_current_trans()
  Btrfs: add NULL check for do_walk_down()
  Btrfs: remove duplicate include in ioctl.c

Fix trivial conflict in fs/btrfs/compression.c due to slab.h include
cleanups.
2010-04-05 13:21:15 -07:00
Dan Carpenter 6cf8bfbf5e Btrfs: check btrfs_get_extent return for IS_ERR()
btrfs_get_extent() never returns NULL, only a valid pointer or ERR_PTR()

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-30 21:19:09 -04:00
Dan Carpenter c2b96929e2 Btrfs: handle kmalloc() failure in inode lookup ioctl
Return -ENOMEM if kmalloc() fails.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-30 21:19:09 -04:00
Dan Carpenter 683be16eb6 Btrfs: dereferencing freed memory
The original code dereferenced range on the next line.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-30 21:19:09 -04:00
Andrea Gelmini 2f3014fc2a Btrfs: remove duplicate include in ioctl.c
fs/btrfs/ioctl.c: ctree.h is included more than once.

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-30 21:19:08 -04:00
Tejun Heo 5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Chris Mason 8ad6fcab56 Btrfs: fix the inode ref searches done by btrfs_search_path_in_tree
This is used by the inode lookup ioctl to follow all the backrefs up
to the subvol root.  But the search being done would sometimes land one
past the last item in the leaf instead of finding the backref.

This changes the search to look for the highest possible backref and hop
back one item.  It also fixes a leaked path on failure to find the root.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-18 12:23:10 -04:00
Chris Mason 1b53ac4d1b Btrfs: allow treeid==0 in the inode lookup ioctl
When a root id of 0 is sent to the inode lookup ioctl, it will
use the root of the file we're ioctling and pass the root id
back to userland along with the results.

This allows userland to do searches based on that root later on.


Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-18 12:17:05 -04:00
Chris Mason 90fdde147f Btrfs: return keys for large items to the search ioctl
The search ioctl was skipping large items entirely (ones that are too
big for the results buffer).  This changes things to at least copy
the item header so that we can send information about the item back to
userland.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-18 12:14:54 -04:00
Chris Mason abc6e1341b Btrfs: fix key checks and advance in the search ioctl
The search ioctl was working well for finding tree roots, but using it for
generic searches requires a few changes to how the keys are advanced.
This treats the search control min fields for objectid, type and offset
more like a key, where we drop the offset to zero once we bump the type,
etc.

The downside of this is that we are changing the min_type and min_offset
fields during the search, and so the ioctl caller needs extra checks to make sure
the keys in the result are the ones it wanted.

This also changes key_in_sk to use btrfs_comp_cpu_keys, just to make
things more readable.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-18 12:10:08 -04:00
Chris Mason 7fde62bffb Btrfs: buffer results in the space_info ioctl
The space_info ioctl was using copy_to_user inside rcu_read_lock.  This
commit changes things to copy into a buffer first and then dump the
result down to userland.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-16 15:40:10 -04:00
Sage Weil 854d2c3531 Btrfs: fix search_ioctl key advance
key->type is u8, not u64.

fs/btrfs/ioctl.c: In function 'copy_to_sk':
fs/btrfs/ioctl.c:1024: warning: comparison is always true due to limited range of data type

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-16 14:24:27 -04:00
Akinobu Mita 91748467a5 btrfs: use memparse
Use memparse() instead of its own private implementation.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:14 -04:00
Josef Bacik 1406e4327b Btrfs: add a "df" ioctl for btrfs
df is a very loaded question in btrfs.  This gives us a way to get the per-space
usage information so we can tell exactly what is in use where.  This will help
us figure out ENOSPC problems, and help users better understand where their disk
space is going.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:14 -04:00
Josef Bacik 2ac55d41b5 Btrfs: cache the extent state everywhere we possibly can V2
This patch just goes through and fixes everybody that does

lock_extent()
blah
unlock_extent()

to use

lock_extent_bits()
blah
unlock_extent_cached()

and pass around a extent_state so we only have to do the searches once per
function.  This gives me about a 3 mb/s boots on my random write test.  I have
not converted some things, like the relocation and ioctl's, since they aren't
heavily used and the relocation stuff is in the middle of being re-written.  I
also changed the clear_extent_bit() to only unset the cached state if we are
clearing EXTENT_LOCKED and related stuff, so we can do things like this

lock_extent_bits()
clear delalloc bits
unlock_extent_cached()

without losing our cached state.  I tested this thoroughly and turned on
LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out
fine.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:13 -04:00
Chris Mason 1e701a3292 Btrfs: add new defrag-range ioctl.
The btrfs defrag ioctl was limited to doing the entire file.  This
commit adds a new interface that can defrag a specific range inside
the file.

It can also force compression on the file, allowing you to selectively
compress individual files after they were created, even when mount -o
compress isn't turned on.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:10 -04:00
Chris Mason 940100a4a7 Btrfs: be more selective in the defrag ioctl
The btrfs defrag ioctl had some bugs around delalloc accounting, and it
wasn't properly skipping pages that were not in the mapping.

It wasn't properly clearing the page checked flag, which could make the
writeback code ignore the page forever while pinning it as dirty.

This commit fixes those problems and makes defrag a little smarter.  It
skips holes and it doesn't waste time defragging large extents.  If a
tiny extent comes before a very large extent, it will defrag both of
them to make sure the tiny extent ends up next to something big.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:10 -04:00
Josef Bacik 6ef5ed0d38 Btrfs: add ioctl and incompat flag to set the default mount subvol
This patch needs to go along with my previous patch.  This lets us set the
default dir item's location to whatever root we want to use as our default
mounting subvol.  With this we don't have to use mount -o subvol=<tree id>
anymore to mount a different subvol, we can just set the new one and it will
just magically work.  I've done some moderate testing with this, mostly just
switching the default mount around, mounting subvols and the default mount at
the same time and such, everything seems to work.  Thanks,

Older kernels would generally be able to still mount the filesystem with the
default subvolume set, but it would result in a different volume being mounted,
which could be an even more unpleasant suprise for users.  So if you set your
default subvolume, you can't go back to older kernels.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 11:00:08 -04:00
Chris Mason ac8e9819d7 Btrfs: add search and inode lookup ioctls
The search ioctl is a generic tool for doing btree searches from
userland applications.  The first user of the search ioctl is a
subvolume listing feature, but we'll also use it to find new
files in a subvolume.

The search ioctl allows you to specify min and max keys to search for,
along with min and max transid.  It returns the items along with a
header that includes the item key.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 10:55:10 -04:00
TARUISI Hiroaki 98d377a089 Btrfs: add a function to lookup a directory path by following backrefs
This will be used by the inode lookup ioctl.

Signed-off-by: TARUISI Hiroaki <taruishi.hiroak@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-03-15 10:55:09 -04:00
Yan, Zheng 86b9f2eca5 Btrfs: Fix per root used space accounting
The bytes_used field in root item was originally planned to
trace the amount of used data and tree blocks. But it never
worked right since we can't trace freeing of data accurately.
This patch changes it to only trace the amount of tree blocks.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-17 12:33:35 -05:00
Yan, Zheng 2e4bfab970 Btrfs: Avoid orphan inodes cleanup during committing transaction
btrfs_lookup_dentry may trigger orphan cleanup, so it's not good
to call it while committing a transaction.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-17 12:33:33 -05:00
Yan, Zheng 920bbbfb05 Btrfs: Rewrite btrfs_drop_extents
Rewrite btrfs_drop_extents by using btrfs_duplicate_item, so we can
avoid calling lock_extent within transaction.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-15 21:24:52 -05:00
Chris Mason ac6889cbb2 Btrfs: fix file clone ioctl for bookend extents
The file clone ioctl was incorrectly taking the offset into the
extent on disk into account when calculating the length of the
cloned extent.

The length never changes based on the offset into the physical extent.

Test case:

fallocate -l 1g image
mke2fs image
bcp image image2
e2fsck -f image2

(errors on image2)

The math bug ends up wrapping the length of the extent, and things
go wrong from there.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-09 11:29:53 -04:00
Yan, Zheng efefb1438b Btrfs: remove negative dentry when deleting subvolumne
The use of btrfs_dentry_delete is removing dentries from the
dcache when deleting subvolumne. btrfs_dentry_delete ignores
negative dentries. This is incorrect since if we don't remove
the negative dentry, its parent dentry can't be removed.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-09 09:25:16 -04:00
Sage Weil 1ab86aedbc Btrfs: fix error cases for ioctl transactions
Fix leak of vfsmount write reference and open_ioctl_trans reference on
ENOMEM.  Clean up the error paths while we're at it.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-29 18:38:44 -04:00
Josef Bacik 9ed74f2dba Btrfs: proper -ENOSPC handling
At the start of a transaction we do a btrfs_reserve_metadata_space() and
specify how many items we plan on modifying.  Then once we've done our
modifications and such, just call btrfs_unreserve_metadata_space() for
the same number of items we reserved.

For keeping track of metadata needed for data I've had to add an extent_io op
for when we merge extents.  This lets us track space properly when we are doing
sequential writes, so we don't end up reserving way more metadata space than
what we need.

The only place where the metadata space accounting is not done is in the
relocation code.  This is because Yan is going to be reworking that code in the
near future, so running btrfs-vol -b could still possibly result in a ENOSPC
related panic.  This patch also turns off the metadata_ratio stuff in order to
allow users to more efficiently use their disk space.

This patch makes it so we track how much metadata we need for an inode's
delayed allocation extents by tracking how many extents are currently
waiting for allocation.  It introduces two new callbacks for the
extent_io tree's, merge_extent_hook and split_extent_hook.  These help
us keep track of when we merge delalloc extents together and split them
up.  Reservations are handled prior to any actually dirty'ing occurs,
and then we unreserve after we dirty.

btrfs_unreserve_metadata_for_delalloc() will make the appropriate
unreservations as needed based on the number of reservations we
currently have and the number of extents we currently have.  Doing the
reservation outside of doing any of the actual dirty'ing lets us do
things like filemap_flush() the inode to try and force delalloc to
happen, or as a last resort actually start allocation on all delalloc
inodes in the fs.  This has survived dbench, fs_mark and an fsx torture
test.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-28 16:29:42 -04:00
Sage Weil 1fb58a6051 Btrfs: fix arithmetic error in clone ioctl
Fix an arithmetic error that was breaking extents cloned via the clone
ioctl starting in the second half of a file.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-21 16:00:27 -04:00
Yan, Zheng 76dda93c6a Btrfs: add snapshot/subvolume destroy ioctl
This patch adds snapshot/subvolume destroy ioctl.  A subvolume that isn't being
used and doesn't contains links to other subvolumes can be destroyed.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-21 16:00:26 -04:00
Yan, Zheng 4df27c4d5c Btrfs: change how subvolumes are organized
btrfs allows subvolumes and snapshots anywhere in the directory tree.
If we snapshot a subvolume that contains a link to other subvolume
called subvolA, subvolA can be accessed through both the original
subvolume and the snapshot. This is similar to creating hard link to
directory, and has the very similar problems.

The aim of this patch is enforcing there is only one access point to
each subvolume. Only the first directory entry (the one added when
the subvolume/snapshot was created) is treated as valid access point.
The first directory entry is distinguished by checking root forward
reference. If the corresponding root forward reference is missing,
we know the entry is not the first one.

This patch also adds snapshot/subvolume rename support, the code
allows rename subvolume link across subvolumes.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-21 15:56:00 -04:00
Chris Mason 83ebade34b Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable 2009-09-11 19:07:25 -04:00
Chris Mason a1ed835e1a Btrfs: Fix extent replacment race
Data COW means that whenever we write to a file, we replace any old
extent pointers with new ones.  There was a window where a readpage
might find the old extent pointers on disk and cache them in the
extent_map tree in ram in the middle of a given write replacing them.

Even though both the readpage and the write had their respective bytes
in the file locked, the extent readpage inserts may cover more bytes than
it had locked down.

This commit closes the race by keeping the new extent pinned in the extent
map tree until after the on-disk btree is properly setup with the new
extent pointers.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:07 -04:00
Alexey Dobriyan 405f55712d headers: smp_lock.h redux
* Remove smp_lock.h from files which don't need it (including some headers!)
* Add smp_lock.h to files which do need it
* Make smp_lock.h include conditional in hardirq.h
  It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT

  This will make hardirq.h inclusion cheaper for every PREEMPT=n config
  (which includes allmodconfig/allyesconfig, BTW)

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-12 12:22:34 -07:00
Chris Mason c8a894d77d Btrfs: fix the file clone ioctl for preallocated extents 2009-07-02 13:41:16 -04:00
Chris Mason 0b4dcea579 Btrfs: fix oops when btrfs_inherit_iflags called with a NULL dir
This happens during subvol creation.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 11:13:35 -04:00
Christoph Hellwig 6cbff00f46 Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION
Add support for the standard attributes set via chattr and read via
lsattr.  Currently we store the attributes in the flags value in
the btrfs inode, but I wonder whether we should split it into two so
that we don't have to keep converting between the two formats.

Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros
as they were confusing the existing code and got in the way of the
new additions.

Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's
trivial.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:52 -04:00
Yan Zheng 5d4f98a28c Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.

When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one.  At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.

The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root.  This commit reduces the
transaction overhead by avoiding the need for dead root records.

When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.

This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.

We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.

This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.

This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.

This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.

The improved balancing code scales significantly better with a large
number of snapshots.

This is a very large commit and was written in a number of
pieces.  But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:46 -04:00
Linus Torvalds 5732c46849 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: Spelling fix in btrfs_lookup_first_block_group comments
  Btrfs: make show_options result match actual option names
  Btrfs: remove outdated comment in btrfs_ioctl_resize()
  Btrfs: remove some WARN_ONs in the IO failure path
  Btrfs: Don't loop forever on metadata IO failures
  Btrfs: init inode ordered_data_close flag properly
2009-05-14 19:18:44 -07:00
Li Hong 5d847a8ed9 Btrfs: remove outdated comment in btrfs_ioctl_resize()
In Li Zefan's commit dae7b665cf,
a combination call of kmalloc() and copy_from_user() is replaced by
memdup_user(). So btrfs_ioctl_resize() doesn't use GFP_NOFS any more.

Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:33 -04:00
Linus Torvalds 4ebf662337 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: look for acls during btrfs_read_locked_inode
  Btrfs: fix acl caching
  Btrfs: Fix a bunch of printk() warnings.
  Btrfs: Fix a trivial warning using max() of u64 vs ULL.
  Btrfs: remove unused btrfs_bit_radix slab
  Btrfs: ratelimit IO error printks
  Btrfs: remove #if 0 code
  Btrfs: When shrinking, only update disk size on success
  Btrfs: fix deadlocks and stalls on dead root removal
  Btrfs: fix fallocate deadlock on inode extent lock
  Btrfs: kill btrfs_cache_create
  Btrfs: don't export symbols
  Btrfs: simplify makefile
  Btrfs: try to keep a healthy ratio of metadata vs data block groups
2009-04-27 11:16:33 -07:00
Joel Becker 21380931eb Btrfs: Fix a bunch of printk() warnings.
Just happened to notice a bunch of %llu vs u64 warnings.  Here's a patch
to cast them all.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 08:37:49 -04:00
Chris Mason e980b50cda Btrfs: fix fallocate deadlock on inode extent lock
The btrfs fallocate call takes an extent lock on the entire range
being fallocated, and then runs through insert_reserved_extent on each
extent as they are allocated.

The problem with this is that btrfs_drop_extents may decide to try
and take the same extent lock fallocate was already holding.  The solution
used here is to push down knowledge of the range that is already locked
going into btrfs_drop_extents.

It turns out that at least one other caller had the same bug.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-24 15:46:05 -04:00
Li Zefan dae7b665cf btrfs: use memdup_user()
Remove open-coded memdup_user().

Note this changes some GFP_NOFS to GFP_KERNEL, since copy_from_user() may
cause pagefault, it's pointless to pass GFP_NOFS to kmalloc().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20 23:02:50 -04:00
Al Viro ce3b0f8d5c New helper - current_umask()
current->fs->umask is what most of fs_struct users are doing.
Put that into a helper function.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-03-31 23:00:26 -04:00