My overcommit stuff can be a little racy when we're filling up the disk with
fs_mark and we overcommit into things that quickly get used up for data. So use
num_bytes to see if we have enough available space so we're less likely to
overcommit ourselves out of the ability to make reservations. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We need to check the return value of filemap_write_and_wait in the space cache
writeout code. Also don't set the inode's generation until we're sure nothing
else is going to fail. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In writing and reading the space cache we have one big loop that keeps track of
which page we are on and then a bunch of sizeable loops underneath this big loop
to try and read/write out properly. Especially in the write case this makes
things hugely complicated and hard to follow, and makes our error checking and
recovery equally as complex. So add a io_ctl struct with a bunch of helpers to
keep track of the pages we have, where we are, if we have enough space etc.
This unifies how we deal with the pages we're writing and keeps all the messy
tracking internal. This allows us to kill the big loops in both the read and
write case and makes reviewing and chaning the write and read paths much
simpler. I've run xfstests and stress.sh on this code and it survives. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I noticed a slight bug where we will not bother writing out the block group
cache's space cache if it's space tree is empty. Since it could have a cluster
or pinned extents that need to be written out this is just not a valid test.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Some users have requested this and I've found I needed a way to disable cache
loading without actually clearing the cache, so introduce the no_space_cache
option. Before we check the super blocks cache generation field and if it was
populated we always turned space caching on. Now we check this and set the
space cache option on, and then parse the mount options so that if we want it
off it get's turned off. Then we check the mount option all the places we do
the caching work instead of checking the super's cache generation. This makes
things more consistent and lets us turn space caching off. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Xfstests 79 was failing because we were inheriting the S_APPEND flag when we
weren't supposed to. There isn't any specific documentation on this so I'm
taking the test as the standard of how things work, and having S_APPEND set on a
directory doesn't mean that S_APPEND gets inherited by its children according to
this test. So only inherit btrfs specific things. This will let us set
compress/nocompress on specific directories and everything in the directories
will inherit this flag, same with nodatacow. With this patch test 79 passes.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
One of the things that kills us is the fact that our ENOSPC reservations are
horribly over the top in most normal cases. There isn't too much that can be
done about this because when we are completely full we really need them to work
like this so we don't under reserve. However if there is plenty of unallocated
chunks on the disk we can use that to gauge how much we can overcommit. So this
patch adds chunk free space accounting so we always know how much unallocated
space we have. Then if we fail to make a reservation within our allocated
space, check to see if we can overcommit. In the normal flushing case (like
with delalloc metadata reservations) we'll take the free space and divide it by
2 if our metadata profile is setup for DUP or any of those, and then divide it
by 8 to make sure we don't overcommit too much. Then if we're in a non-flushing
case (we really need this reservation now!) we only limit ourselves to half of
the free space. This makes this fio test
[torrent]
filename=torrent-test
rw=randwrite
size=4g
ioengine=sync
directory=/mnt/btrfs-test
go from taking around 45 minutes to 10 seconds on my freshly formatted 3 TiB
file system. This doesn't seem to break my other enospc tests, but could really
use some more testing as this is a super scary change. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I noticed while running xfstests 83 that if we didn't have enough space to
delete our inode the orphan cleanup would just loop. This is because it keeps
finding the same orphan item and keeps trying to kill it but can't because we
don't get an error back from iput for deleting the inode. So keep track of the
last guy we tried to kill, if it's the same as the one we're trying to kill
currently we know we are having problems and can just error out. I don't have a
way to test this so look hard and make sure it's right. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Xfstests 83 really stresses our ENOSPC since it uses a 100mb fs which ends up
with the mixed block group stuff. Because of this we can run into a situation
where we don't have enough space to delete inodes, or even worse we can't free
the inodes when we next mount the fs which causes the orphan code to lose its
mind. So if we fail to make our reservation, steal from the global reserve.
The global reserve will end up taking up the entire rest of the free space on
the fs in this worst case so there really is no other option. With this patch
test 83 doesn't freak out. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
While looking for a performance regression a user was complaining about, I
noticed that we had a regression with the varmail test of filebench. This was
introduced by
0d10ee2e6d
which keeps us from calling writepages in writepage. This is a correct change,
however it happens to help the varmail test because we write out in larger
chunks. This is largly to do with how we write out dirty pages for each
transaction. If you run filebench with
load varmail
set $dir=/mnt/btrfs-test
run 60
prior to this patch you would get ~1420 ops/second, but with the patch you get
~1200 ops/second. This is a 16% decrease. So since we know the range of dirty
pages we want to write out, don't write out in one page chunks, write out in
ranges. So to do this we call filemap_fdatawrite_range() on the range of bytes.
Then we convert the DIRTY extents to NEED_WAIT extents. When we then call
btrfs_wait_marked_extents() we only have to filemap_fdatawait_range() on that
range and clear the NEED_WAIT extents. This doesn't get us back to our original
speeds, but I've been seeing ~1380 ops/second, which is a <5% regression as
opposed to a >15% regression. That is acceptable given that the original commit
greatly reduces our latency to begin with. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If I have a range where I know a certain bit is and I want to set it to another
bit the only option I have is to call set and then clear bit, which will result
in 2 tree searches. This is inefficient, so introduce convert_extent_bit which
will go through and set the bit I want and clear the old bit I don't want.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
There is a bug that may lead to early ENOSPC in our reservation code. We've
been checking against num_bytes which may be above and beyond what we want to
actually reserve, which could give us a false ENOSPC. Fix this by making sure
the unused space is above how much we want to reserve and not how much we're
trying to flush. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In fixing how we deal with bad inodes, we had a regression in the orphan cleanup
code, since it expects to get a bad inode back. So fix it to deal with getting
-ESTALE back by deleting the orphan item manually and moving on. Thanks,
Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Josef Bacik <josef@redhat.com>
Johannes pointed out we were allocating only kernel pages for doing writes,
which is kind of a big deal if you are on 32bit and have more than a gig of ram.
So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we
don't re-enter. Thanks,
Reported-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
I kept getting warnings from evict because we were calling
btrfs_start_transaction() with a transaction already started when doing a
balance. This is because we remove a block group which requires a transaction,
and the put the last reference on the cache inode. Instead of doing this we
need to delay the iput so it is done not within a transaction having started.
This gets rid of our warnings. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Checksums are charged in 2 different ways. The first case is when we're writing
to the disk, we account for the new checksums with the delalloc block rsv. In
order for this to work we check if we're allocating a block for the csum root
and if trans->block_rsv == the delalloc block rsv. But when we're deleting the
csums because of cow, this is charged to the global block rsv, and is done when
we run the delayed refs. So we need to make sure that trans->block_rsv == NULL
when running the delayed refs. So set it to NULL and reset it in
should_end_transaction, and set it to NULL in commit_transaction. This got rid
of the ridiculous amount of warnings I was seeing when trying to do a balance.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The only thing that we need to have a trans handle for is in
reserve_metadata_bytes and thats to know how much flushing we can do. So
instead of passing it around, just check current->journal_info for a
trans_handle so we know if we can commit a transaction to try and free up space
or not. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Since the durable block rsv stuff has been killed there is no need to get the
block_rsv in btrfs_free_tree_block anymore.
Signed-off-by: Josef Bacik <josef@redhat.com>
The alloc warnings everybody has been seeing is because we have been reserving
space for csums, but we weren't actually using that space. So make
get_block_rsv() return the trans->block_rsv if we're modifying the csum root.
Also set the trans->block_rsv to NULL so that if we modify the csum root when
running delayed ref's that comes out of the global reserve like it's supposed
to. With this patch I'm not seeing those alloc warnings anymore. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Since free space inodes now use normal checksumming we need to make sure to
account for their metadata use. So reserve metadata space, and then if we fail
to write out the metadata we can just release it, otherwise it will be freed up
when the io completes. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In moving some enospc stuff around I noticed that when we unmount we are often
evicting the free space cache inodes before we do our last commit. This isn't
bad, but it makes us constantly have to re-read the inodes back. So instead
don't evict the cache until after we do our last commit, this will make things a
little less crappy and makes a future enospc change work properly. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
While debugging a different issue I noticed that we were always reserving space
when we tried to use our truncate block rsv's. This is because they didn't have
a ->size value, so use_block_rsv just assumes there is nothing reserved and it
does a reserve_metadata_bytes. This is because btrfs_check_block_rsv() doesn't
actually add to the size of the block rsv. That seems to be the right thing to
do so set ->size to the minimum truncate size we need, since we will always only
refill to that size anyway, and this way everything works out correctly.
Signed-off-by: Josef Bacik <josef@redhat.com>
If we have to emergency reserve space we need to not increase the block_rsv
size, otherwise we'll leak space. Take for instance delalloc, say we reserve
4k, and we use that 4k, and then we have to emergency allocate another 4k, we
bump the size up to 8k, however we've only accounted for 4k in reservations in
all of our supporting logic, so we'll go to free the 4k and end up having a size
of 4k, which will cause us to later not free as much space. I saw this doing
testing where I wasn't reserving enough space for something but was still
leaking space, very frustrating. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When changing back to using a spin_lock to protect the extent counters I decided
that since we would only be dropping our original extent, it was ok to just drop
the extent and return. However since somebody else could have come in and done
a reservation, we need to do the normal song and dance to clear the reservation
out properly. So calculate how much space we need to free, and then subtract
what we just attempted to reserve. If it's more then we know we need to drop
those bytes from the delalloc block rsv. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We are setting ins_len to 1 even tho we are just modifying an item that should
be there already. This may cause the search stuff to split nodes on the way
down needelessly. Set this to 0 since we aren't inserting anything. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If you run xfstest 224 it you will get lots of messages about not being able to
delete inodes and that they will be cleaned up next mount. This is because
btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to
flush, so if there was not enough space, it simply failed. But in truncate and
evict case we could easily flush space to try and get enough space to do our
work, so make btrfs_block_rsv_check take a flush argument to pass down to
reserve_metadata_bytes. Now xfstests 224 runs fine without all those
complaints. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
With btrfs_truncate_inode_items we always return if we have to go to another
leaf, which makes us do our reservation again. This means we will only ever
modify one leaf at a time, so we only need 1 items worth of slack space. Also,
since we are deleting we will not be creating nodes as we go down, if anything
we'll be free'ing them as we merge them together, so make a different
calculation for truncate which will only have the worst case useage of COW'ing
the entire path down to the leaf. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Lukas found a problem where if he tries to fallocate over the same region twice
and the first fallocate took up all the space we would fail with ENOSPC. This
is because we reserve the total space we want to use for fallocate, regardless
of wether or not we will have to actually preallocate. So instead move the
check into the loop where we actually have to do the preallocate. Thanks,
Tested-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently we're starting and stopping a transaction for no real reason, so kill
that and just reserve enough space as if we can truncate all in one transaction.
Also use btrfs_block_rsv_check() for our reserve to minimize the amount of space
we may have to allocate for our slack space. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We will try and reserve metadata bytes in btrfs_block_rsv_check and if we cannot
because we have a transaction open it will return EAGAIN, so we do not need to
try and commit the transaction again.
Signed-off-by: Josef Bacik <josef@redhat.com>
The priority and refill_used flags are not used anymore, and neither is the
usage counter, so just remove them from btrfs_block_rsv.
Signed-off-by: Josef Bacik <josef@redhat.com>
A user reported getting spammed when moving to 3.0 by this message. Since we
switched to the normal checksumming infrastructure all old free space caches
will be wrong and need to be regenerated so people are likely to see this
message a lot, so ratelimit it so it doesn't fill up their logs and freak them
out. Thanks,
Reported-by: Andrew Lutomirski <luto@mit.edu>
Signed-off-by: Josef Bacik <josef@redhat.com>
I converted btrfs_truncate to do sane reservations for truncate, but didn't
convert btrfs_evict_inode. Basically we need to save the orphan_rsv for
deleting the orphan item, and do normal reservations for our truncate. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This patch kills off the calculation for the amount of space needed for the
orphan operations during a snapshot. The thing is we only do snapshots on
commit, so any space that is in the block_rsv->freed[] isn't going to be in the
new snapshot anyway, so there isn't any reason to require that space to be
reserved for the snapshot to occur. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We have not been reserving enough space for checksums. We were just reserving
bytes for the checksum items themselves, we were not taking into account having
to cow the tree and such. This patch adds a csum_bytes counter to the inode for
keeping track of the number of bytes outstanding we have for checksums. Then we
calculate how many leaves would be required for the checksums we are given and
use that to reserve space. This adds a significant amount of bytes to our
reservations, but we will handle this later. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We always look for delalloc bytes in our io_tree so we can fill in delalloc.
This is fine in most cases, but if we're writing out the btree_inode this is
just a superfluous tree search on the io_tree, and if we have a lot of metadata
dirty this could be an expensive check. So instead check to see if our io_tree
has a ->fill_delalloc op, and if not don't even bother doing the lookup.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We have been using bytes_reserved for metadata reservations, which is wrong
since we use that to keep track of outstanding reservations from the allocator.
This resulted in us doing a lot of silly things to make sure we don't allocate a
bunch of metadata chunks since we never had a real view of how much space was
actually in use by metadata.
This passes Arne's enospc test and xfstests as well as my own enospc tests.
Hopefully this will get us moving in the right direction. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We've only been able to mount with subvol=<whatever> where whatever was a subvol
within whatever root we had as the default. This allows us to mount -o
subvol=path/to/subvol/you/want relative from the normal fs_tree root. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently what we do is just wrong. We either
1) Alloc a new "root" dentry with sb->s_root as it's parent which is just wrong
as we could walk into this subvol later on via another path and hilarity could
ensue. Also we don't check the return value of d_splice_alias which isn't good
either.
or
2) Do a d_find_alias() which we could have lost our dentry from cache at this
point and found nothing.
So use d_obtain_alias(). In the case that we already have the inode/dentry in
cache we will get the correct dentry. If not we will get a disconnected dentry
tree so if we walk into it later on everything will be connected up properly.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Moving things around to give us better packing in the btrfs_inode. This reduces
the size of our inode by 8 bytes. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
A user reported a problem where ceph was getting into 100% cpu usage while doing
some writing. It turns out it's because we were doing a short write on a not
uptodate page, which means we'd fall back at one page at a time and fault the
page in. The problem is our position is on the page boundary, so our fault in
logic wasn't actually reading the page, so we'd just spin forever or until the
page got read in by somebody else. This will force a readpage if we end up
doing a short copy. Alexandre could reproduce this easily with ceph and reports
it fixes his problem. I also wrote a reproducer that no longer hangs my box
with this patch. Thanks,
Reported-and-tested-by: Alexandre Oliva <aoliva@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Fix a crash/BUG_ON in the clone ioctl due to insufficient reservation. We
need to reserve space for:
- adjusting the old extent (possibly splitting it)
- adding the new extent
- updating the inode
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can race with readdir and the RCU path walking stuff. This is because we
clear the need lookup flag before actually instantiating the inode. This will
lead the RCU path walk stuff to find a dentry it thinks is valid without a
d_inode attached. So instead unhash the dentry when we first start the lookup,
and then clear the flag after we've instantiated the dentry so we're garunteed
to either try the slow lookup, or have the d_inode set properly.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The recent reworking of btrfs' lseek lead to incorrect
values being returned. This adds checks for seeking
beyond EOF in SEEK_HOLE and makes sure the error
values come back correct.
Andi Kleen also sent in similar patches.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>