linux

Commit Graph

Author	SHA1	Message	Date
Sage Weil	c5c9cd4d1b	Btrfs: allow clone of an arbitrary file range This patch adds an additional CLONE_RANGE ioctl to clone an arbitrary (block-aligned) file range to another file. The original CLONE ioctl becomes a special case of cloning the entire file range. The logic is a bit more complex now since ranges may be cloned to different offsets, and because we may only be cloning the beginning or end of a particular extent or checksum item. An additional sanity check ensures the source and destination files aren't the same (which would previously deadlock), although eventually this could be extended to allow the duplication of file data at a different offset within the same file. Any extents within the destination range in the target file are dropped. We currently do not cope with the case where a compressed inline extent needs to be split. This will probably require decompressing the extent into a temporary address_space, and inserting just the cloned portion as a new compressed inline extent. For now, just return -EINVAL in this case. Note that this never comes up in the more common case of cloning an entire file. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-12 14:32:25 -05:00
Chris Mason	2ed6d66408	Btrfs: Fix handling of space info full during allocations When we fail to allocate a new block group, we should still do the checks to make sure allocations try again with the minimum requested allocation size. This also fixes a deadlock that come from a missed down_read in the chunk allocation failure handling. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-13 09:59:33 -05:00
Chris Mason	6f3577bdc7	Btrfs: Improve metadata read latencies This fixes latency problems on metadata reads by making sure they don't go through the async submit queue, and by tuning down the amount of readahead done during btree searches. Also, the btrfs bdi congestion function is tuned to ignore the number of pending async bios and checksums pending. There is additional code that throttles new async bios now and the congestion function doesn't need to worry about it anymore. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-13 09:59:36 -05:00
Chris Mason	5b050f04c8	Btrfs: Fix compile warnings on 32 bit machines Simple casting here and there to fix things up. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-11 09:34:41 -05:00
Yan Zheng	8247b41ac9	Btrfs: Fix starting search offset inside btrfs_drop_extents btrfs_drop_extents will drop paths and search again when it needs to force COW of higher nodes. It was using the key it found during the last search as the offset for the next search. But, this wasn't always correct. The key could be from before our desired range, and because we're dropping the path, it is possible for file's items to change while we do the search again. The fix here is to make sure we don't search for something smaller than the offset btrfs_drop_extents was called with. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-11 09:33:29 -05:00
Chris Mason	8a1413a296	Btrfs: empty_size allocation fixes again The allocator wasn't catching all of the cases where it needed to do extra loops because the check to enforce them wasn't happening early enough. When the allocator decided to increase the size of the allocation for metadata clustering, it wasn't always setting the empty_size to include the extra (optional) bytes. This also fixes the empty_size field to be correct. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-10 16:13:54 -05:00
Chris Mason	240d5d482b	Btrfs: tune btrfs unplug functions for a small number of devices When btrfs unplugs, it tries to find the correct device to unplug via search through the extent_map tree. This avoids unplugging a device that doesn't need it, but is a waste of time for filesystems with a small number of devices. This patch checks the total number of devices before doing the search. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-10 13:08:31 -05:00
Chris Mason	b47eda8690	Btrfs: Turn off extent state leak debugging The extent_io.c code has a #define to find and cleanup extent state leaks on module unmount. This adds a very highly contended spinlock to a hot path for most FS operations. Turn it off by default. A later changeset will add a .config option for it. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-10 12:34:40 -05:00
Chris Mason	445a694499	Btrfs: Fix usage of struct extent_map->orig_start This makes sure the orig_start field in struct extent_map gets set everywhere the extent_map structs are created or modified. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-10 11:53:33 -05:00
Chris Mason	39be25cd89	Btrfs: Use invalidatepage when writepage finds a page outside of i_size With all the recent fixes to the delalloc locking, it is now safe again to use invalidatepage inside the writepage code for pages outside of i_size. This used to deadlock against some of the code to write locked ranges of pages, but all of that has been fixed. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-10 11:50:50 -05:00
Chris Mason	f5a31e1667	Btrfs: Try harder while searching for free space The loop searching for free space would exit out too soon when metadata clustering was trying to allocate a large extent. This makes sure a full scan of the free space is done searching for only the minimum extent size requested by the higher layers. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-10 11:47:09 -05:00
Chris Mason	e04ca626ba	Btrfs: Fix use after free during compressed reads Yan's fix to use the correct file offset during compressed reads used the extent_map struct pointer after it had been freed. This saves the fields we want for later use instead. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-10 11:44:58 -05:00
Yan Zheng	ff5b7ee33d	Btrfs: Fix csum error for compressed data The decompress code doesn't take the logical offset in extent pointer into account. If the logical offset isn't zero, data will be decompressed into wrong pages. The solution used here is to record the starting offset of the extent in the file separately from the logical start of the extent_map struct. This allows us to avoid problems inserting overlapping extents. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2008-11-10 07:34:43 -05:00
Chris Mason	f2b1c41cf9	Btrfs: Make sure pages are dirty before doing delalloc for them This adds a PageDirty check to the writeback path that locks pages for delalloc. If a page wasn't dirty at this point, it is in the process of being truncated away. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-10 07:31:30 -05:00
Chris Mason	5b7c3fcc46	Btrfs: Don't substract too much from the allocation target (avoid wrapping) When metadata allocation clustering has to fall back to unclustered allocs because large free areas could not be found, it was sometimes substracting too much from the total bytes to allocate. This would make it wrap below zero. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-10 07:26:33 -05:00
Chris Mason	5f2cc086cc	Btrfs: Avoid unplug storms during commit While doing a commit, btrfs makes sure all the metadata blocks were properly written to disk, calling wait_on_page_writeback for each page. This writeback happens after allowing another transaction to start, so it competes for the disk with other processes in the FS. If the page writeback bit is still set, each wait_on_page_writeback might trigger an unplug, even though the page might be waiting for checksumming to finish or might be waiting for the async work queue to submit the bio. This trades wait_on_page_writeback for waiting on the extent writeback bits. It won't trigger any unplugs and substantially improves performance in a number of workloads. This also changes the async bio submission to avoid requeueing if there is only one device. The requeue just wastes CPU time because there are no other devices to service. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-07 18:22:45 -05:00
Chris Mason	42e70e7a2f	Btrfs: Fix more false enospc errors and an oops from empty clustering In comes cases the empty cluster was added twice to the total number of bytes the allocator was trying to find. With empty clustering on, the hint byte was sometimes outside of the block group. Add an extra goto to find the correct block group. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-07 18:17:11 -05:00
Chris Mason	af09abfece	Btrfs: make sure compressed bios don't complete too soon When writing a compressed extent, a number of bios are created that point to a single struct compressed_bio. At end_io time an atomic counter in the compressed_bio struct makes sure that all of the bios have finished before final end_io processing is done. But when multiple bios are needed to write a compressed extent, the counter was being incremented after the first bio was sent to submit_bio. It is possible the bio will complete before the counter is incremented, making the end_io handler free the compressed_bio struct before processing is finished. The fix is to increment the atomic counter before bio submission, both for compressed reads and writes. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-07 12:35:44 -05:00
Chris Mason	4366211ccd	Btfs: More metadata allocator optimizations This lowers the empty cluster target for metadata allocations. The lower target makes it easier to do allocations and still seems to perform well. It also fixes the allocator loop to drop the empty cluster when things start getting difficult, avoiding false enospc warnings. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-07 09:06:11 -05:00
Chris Mason	3b7885bf96	Btrfs: enforce metadata allocation clustering The allocator uses the last allocation as a starting point for metadata allocations, and tries to allocate in clusters of at least 256k. If the search for a free block fails to find the expected block, this patch forces a new cluster to be found in the free list. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-06 21:48:27 -05:00
Chris Mason	771ed689d2	Btrfs: Optimize compressed writeback and reads When reading compressed extents, try to put pages into the page cache for any pages covered by the compressed extent that readpages didn't already preload. Add an async work queue to handle transformations at delayed allocation processing time. Right now this is just compression. The workflow is: 1) Find offsets in the file marked for delayed allocation 2) Lock the pages 3) Lock the state bits 4) Call the async delalloc code The async delalloc code clears the state lock bits and delalloc bits. It is important this happens before the range goes into the work queue because otherwise it might deadlock with other work queue items that try to lock those extent bits. The file pages are compressed, and if the compression doesn't work the pages are written back directly. An ordered work queue is used to make sure the inodes are written in the same order that pdflush or writepages sent them down. This changes extent_write_cache_pages to let the writepage function update the wbc nr_written count. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-06 22:02:51 -05:00
Chris Mason	4a69a41009	Btrfs: Add ordered async work queues Btrfs uses kernel threads to create async work queues for cpu intensive operations such as checksumming and decompression. These work well, but they make it difficult to keep IO order intact. A single writepages call from pdflush or fsync will turn into a number of bios, and each bio is checksummed in parallel. Once the checksum is computed, the bio is sent down to the disk, and since we don't control the order in which the parallel operations happen, they might go down to the disk in almost any order. The code deals with this somewhat by having deep work queues for a single kernel thread, making it very likely that a single thread will process all the bios for a single inode. This patch introduces an explicitly ordered work queue. As work structs are placed into the queue they are put onto the tail of a list. They have three callbacks: ->func (cpu intensive processing here) ->ordered_func (order sensitive processing here) ->ordered_free (free the work struct, all processing is done) The work struct has three callbacks. The func callback does the cpu intensive work, and when it completes the work struct is marked as done. Every time a work struct completes, the list is checked to see if the head is marked as done. If so the ordered_func callback is used to do the order sensitive processing and the ordered_free callback is used to do any cleanup. Then we loop back and check the head of the list again. This patch also changes the checksumming code to use the ordered workqueues. One a 4 drive array, it increases streaming writes from 280MB/s to 350MB/s. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-06 22:03:00 -05:00
Chris Mason	537fb06715	Btrfs: rev the disk format for fallocate Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-10-31 12:54:14 -04:00
Chris Mason	70b99e6959	Btrfs: Compression corner fixes Make sure we keep page->mapping NULL on the pages we're getting via alloc_page. It gets set so a few of the callbacks can do the right thing, but in general these pages don't have a mapping. Don't try to truncate compressed inline items in btrfs_drop_extents. The whole compressed item must be preserved. Don't try to create multipage inline compressed items. When we try to overwrite just the first page of the file, we would have to read in and recow all the pages after it in the same compressed inline items. For now, only create single page inline items. Make sure we lock pages in the correct order during delalloc. The search into the state tree for delalloc bytes can return bytes before the page we already have locked. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-10-31 12:46:39 -04:00
Yan Zheng	d899e05215	Btrfs: Add fallocate support v2 This patch updates btrfs-progs for fallocate support. fallocate is a little different in Btrfs because we need to tell the COW system that a given preallocated extent doesn't need to be cow'd as long as there are no snapshots of it. This leverages the -o nodatacow checks. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2008-10-30 14:25:28 -04:00
Yan Zheng	80ff385665	Btrfs: update nodatacow code v2 This patch simplifies the nodatacow checker. If all references were created after the latest snapshot, then we can avoid COW safely. This patch also updates run_delalloc_nocow to do more fine-grained checking. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2008-10-30 14:20:02 -04:00
Yan Zheng	6643558db2	Btrfs: Fix bookend extent race v2 When dropping middle part of an extent, btrfs_drop_extents truncates the extent at first, then inserts a bookend extent. Since truncation and insertion can't be done atomically, there is a small period that the bookend extent isn't in the tree. This causes problem for functions that search the tree for file extent item. The way to fix this is lock the range of the bookend extent before truncation. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2008-10-30 14:19:50 -04:00
Yan Zheng	9036c10208	Btrfs: update hole handling v2 This patch splits the hole insertion code out of btrfs_setattr into btrfs_cont_expand and updates btrfs_get_extent to properly handle the case that file extent items are not continuous. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2008-10-30 14:19:41 -04:00
Chris Mason	19b9bdb054	Btrfs: Fix logic to avoid reading checksums for -o nodatasum,compress When compression was on, we were improperly ignoring -o nodatasum. This reworks the logic a bit to properly honor all the flags. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-10-30 14:23:13 -04:00
Chris Mason	cfbc246eaa	Btrfs: walk compressed pages based on the nr_pages count instead of bytes The byte walk counting was awkward and error prone. This uses the number of pages sent the higher layer to build bios. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-10-30 13:22:14 -04:00
Chris Mason	87ef2bb46b	Btrfs: prevent looping forever in finish_current_insert and del_pending_extents finish_current_insert and del_pending_extents process extent tree modifications that build up while we are changing the extent tree. It is a confusing bit of code that prevents recursion. Both functions run through a list of pending operations and both funcs add to the list of pending operations. If you have two procs in either one of them, they can end up looping forever making more work for each other. This patch makes them walk forward through the list of pending changes instead of always trying to process the entire list. At transaction commit time, we catch any changes that were left over. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-10-30 11:23:27 -04:00
Chris Mason	09fde3c9ba	Btrfs: Rev the disk format for compression and root pointer generation fields	2008-10-29 14:49:04 -04:00
Yan Zheng	84234f3a1f	Btrfs: Add root tree pointer transaction ids This patch adds transaction IDs to root tree pointers. Transaction IDs in tree pointers are compared with the generation numbers in block headers when reading root blocks of trees. This can detect some types of IO errors. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2008-10-29 14:49:05 -04:00
Josef Bacik	2517920135	Btrfs: nuke fs wide allocation mutex V2 This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch of little locks. There is now a pinned_mutex, which is used when messing with the pinned_extents extent io tree, and the extent_ins_mutex which is used with the pending_del and extent_ins extent io trees. The locking for the extent tree stuff was inspired by a patch that Yan Zheng wrote to fix a race condition, I cleaned it up some and changed the locking around a little bit, but the idea remains the same. Basically instead of holding the extent_ins_mutex throughout the processing of an extent on the extent_ins or pending_del trees, we just hold it while we're searching and when we clear the bits on those trees, and lock the extent for the duration of the operations on the extent. Also to keep from getting hung up waiting to lock an extent, I've added a try_lock_extent so if we cannot lock the extent, move on to the next one in the tree and we'll come back to that one. I have tested this heavily and it does not appear to break anything. This has to be applied on top of my find_free_extent redo patch. I tested this patch on top of Yan's space reblancing code and it worked fine. The only thing that has changed since the last version is I pulled out all my debugging stuff, apparently I forgot to run guilt refresh before I sent the last patch out. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com>	2008-10-29 14:49:05 -04:00
Josef Bacik	80eb234af0	Btrfs: fix enospc when there is plenty of space So there is an odd case where we can possibly return -ENOSPC when there is in fact space to be had. It only happens with Metadata writes, and happens _very_ infrequently. What has to happen is we have to allocate have allocated out of the first logical byte on the disk, which would set last_alloc to first_logical_byte(root, 0), so search_start == orig_search_start. We then need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA \| BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start, block_group_bits() won't match and we'll go to choose another block group. However because search_start matches orig_search_start we go to see if we can allocate a chunk. If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC. This is kind of a big flaw of the way find_free_extent works, as it along with find_free_space loop through _all_ of the block groups, not just the ones that we want to allocate out of. This patch completely kills find_free_space and rolls it into find_free_extent. I've introduced a sort of state machine into this, which will make it easier to get cache miss information out of the allocator, and will work well with my locking changes. The basic flow is this: We have the variable loop which is 0, meaning we are in the hint phase. We lookup the block group for the hint, and lookup the space_info for what we want to allocate out of. If the block group we were pointed at by the hint either isn't of the correct type, or just doesn't have the space we need, we set head to space_info->block_groups, so we start at the beginning of the block groups for this particular space info, and loop through. This is also where we add the empty_cluster to total_needed. At this point loop is set to 1 and we just loop through all of the block groups for this particular space_info looking for the space we need, just as find_free_space would have done, except we only hit the block groups we want and not _all_ of the block groups. If we come full circle we see if we can allocate a chunk. If we cannot of course we exit with -ENOSPC and we are good. If not we start over at space_info->block_groups and loop through again, with loop == 2. If we come full circle and haven't found what we need then we exit with -ENOSPC. I've been running this for a couple of days now and it seems stable, and I haven't yet hit a -ENOSPC when there was plenty of space left. Also I've made a groups_sem to handle the group list for the space_info. This is part of my locking changes, but is relatively safe and seems better than holding the space_info spinlock over that entire search time. Thanks, Signed-off-by: Josef Bacik <jbacik@redhat.com>	2008-10-29 14:49:05 -04:00
Yan Zheng	f82d02d9d8	Btrfs: Improve space balancing code This patch improves the space balancing code to keep more sharing of tree blocks. The only case that breaks sharing of tree blocks is data extents get fragmented during balancing. The main changes in this patch are: Add a 'drop sub-tree' function. This solves the problem in old code that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block. Remove relocation mapping tree. Relocation mappings are stored in struct btrfs_ref_path and updated dynamically during walking up/down the reference path. This reduces CPU usage and simplifies code. This patch also fixes a bug. Root items for reloc trees should be updated in btrfs_free_reloc_root. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2008-10-29 14:49:05 -04:00
Chris Mason	c8b978188c	Btrfs: Add zlib compression support This is a large change for adding compression on reading and writing, both for inline and regular extents. It does some fairly large surgery to the writeback paths. Compression is off by default and enabled by mount -o compress. Even when the -o compress mount option is not used, it is possible to read compressed extents off the disk. If compression for a given set of pages fails to make them smaller, the file is flagged to avoid future compression attempts later. * While finding delalloc extents, the pages are locked before being sent down to the delalloc handler. This allows the delalloc handler to do complex things such as cleaning the pages, marking them writeback and starting IO on their behalf. * Inline extents are inserted at delalloc time now. This allows us to compress the data before inserting the inline extent, and it allows us to insert an inline extent that spans multiple pages. * All of the in-memory extent representations (extent_map.c, ordered-data.c etc) are changed to record both an in-memory size and an on disk size, as well as a flag for compression. From a disk format point of view, the extent pointers in the file are changed to record the on disk size of a given extent and some encoding flags. Space in the disk format is allocated for compression encoding, as well as encryption and a generic 'other' field. Neither the encryption or the 'other' field are currently used. In order to limit the amount of data read for a single random read in the file, the size of a compressed extent is limited to 128k. This is a software only limit, the disk format supports u64 sized compressed extents. In order to limit the ram consumed while processing extents, the uncompressed size of a compressed extent is limited to 256k. This is a software only limit and will be subject to tuning later. Checksumming is still done on compressed extents, and it is done on the uncompressed version of the data. This way additional encodings can be layered on without having to figure out which encoding to checksum. Compression happens at delalloc time, which is basically singled threaded because it is usually done by a single pdflush thread. This makes it tricky to spread the compression load across all the cpus on the box. We'll have to look at parallel pdflush walks of dirty inodes at a later time. Decompression is hooked into readpages and it does spread across CPUs nicely. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-10-29 14:49:59 -04:00
Josef Bacik	37d3cdddf5	Btrfs: make tree_search_offset more flexible in its searching Sometimes we end up freeing a reserved extent because we don't need it, however this means that its possible for transaction->last_alloc to point to the middle of a free area. When we search for free space in find_free_space we do a tree_search_offset with contains set to 0, because we want it to find the next best free area if we do not have an offset starting on the given offset. Unfortunately that currently means that if the offset we were given as a hint points to the middle of a free area, we won't find anything. This is especially bad if we happened to last allocate from the big huge chunk of a newly formed block group, since we won't find anything and have to go back and search the long way around. This fixes this problem by making it so that we return the free space area regardless of the contains variable. This made cache missing happen _alot_ less, and speeds things up considerably. Signed-off-by: Josef Bacik <jbacik@redhat.com>	2008-10-10 10:24:32 -04:00
Chris Mason	a3dddf3fc8	Btrfs: Don't call security_inode_mkdir during subvol creation Subvol creation already requires privs, and security_inode_mkdir isn't exported. For now we don't need it. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-10-10 10:23:22 -04:00
Christoph Hellwig	cb8e70901d	Btrfs: Fix subvolume creation locking rules Creating a subvolume is in many ways like a normal VFS ->mkdir, and we really need to play with the VFS topology locking rules. So instead of just creating the snapshot on disk and then later getting rid of confliting aliases do it correctly from the start. This will become especially important once we allow for subvolumes anywhere in the tree, and not just below a hidden root. Note that snapshots will need the same treatment, but do to the delay in creating them we can't do it currently. Chris promised to fix that issue, so I'll wait on that. Signed-off-by: Christoph Hellwig <hch@lst.de>	2008-10-09 13:39:39 -04:00
Chris Mason	833023e46c	Btrfs: Rev the disk format for the new back reference format Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-10-09 11:55:03 -04:00
Sage Weil	61f8c86ee8	Btrfs: Fix makefile for builing btrfs static This fixes the btrfs makefile for building in the tree and out of the tree both as a module and static. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-10-09 11:52:35 -04:00
Yan Zheng	5b84e8d6ee	Btrfs: Fix leaf reference cache miss Due to the optimization for truncate, tree leaves only containing checksum items can be deleted without being COW'ed first. This causes reference cache misses. The way to fix the miss is create cache entries for tree leaves only contain checksum. This patch also fixes a -EEXIST issue in shared reference cache. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2008-10-09 11:46:19 -04:00
Yan Zheng	3bb1a1bc42	Btrfs: Remove offset field from struct btrfs_extent_ref The offset field in struct btrfs_extent_ref records the position inside file that file extent is referenced by. In the new back reference system, tree leaves holding references to file extent are recorded explicitly. We can scan these tree leaves very quickly, so the offset field is not required. This patch also makes the back reference system check the objectid when extents are in deleting. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2008-10-09 11:46:24 -04:00
Yan Zheng	a76a3cd40c	Btrfs: Count space allocated to file in bytes This patch makes btrfs count space allocated to file in bytes instead of 512 byte sectors. Everything else in btrfs uses a byte count instead of sector sizes or blocks sizes, so this fits better. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2008-10-09 11:46:29 -04:00
Chris Mason	a62b940160	Btrfs: cast bio->bi_sector to a u64 before shifting On 32 bit machines without CONFIG_LBD, the bi_sector field is only 32 bits. Btrfs needs to cast it before shifting up, or we end up doing IO into the wrong place. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-10-03 16:31:08 -04:00
Chris Mason	30c43e2444	Btrfs: remove last_log_alloc allocator optimization The tree logging code was trying to separate tree log allocations from normal metadata allocations to improve writeback patterns during an fsync. But, the code was not effective and ended up just mixing tree log blocks with regular metadata. That seems to be working fairly well, so the last_log_alloc code can be removed. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-10-03 12:24:01 -04:00
Chris Mason	cb843a6f51	Btrfs: O_DIRECT writes via buffered writes + invaldiate This reworks the btrfs O_DIRECT write code a bit. It had always fallen back to buffered IO and done an invalidate, but needed to be updated for the data=ordered code. The invalidate wasn't actually removing pages because they were still inside an ordered extent. This also combines the O_DIRECT/O_SYNC paths where possible, and kicks off IO in the main btrfs_file_write loop to keep the pipe down the the disk full as we process long writes. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-10-03 12:30:02 -04:00
Chris Mason	323ac95bce	Btrfs: don't read leaf blocks containing only checksums during truncate Checksum items take up a significant portion of the metadata for large files. It is possible to avoid reading them during truncates by checking the keys in the higher level nodes. If a given leaf is followed by another leaf where the lowest key is a checksum item from the same file, we know we can safely delete the leaf without reading it. For a 32GB file on a 6 drive raid0 array, Btrfs needs 8s to delete the file with a cold cache. It is read bound during the run. With this change, Btrfs is able to delete the file in 0.5s Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-10-01 19:05:46 -04:00
Josef Bacik	cf74982385	Btrfs: fix deadlock between alloc_mutex/chunk_mutex This fixes a deadlock that happens between the alloc_mutex and chunk_mutex. Process A comes in, decides to do a do_chunk_alloc, which takes the chunk_mutex, and is holding the alloc_mutex because the only way you get to do_chunk_alloc is by holding the alloc_mutex. btrfs_alloc_chunk does its thing and goes to insert a new item, which results in a cow of the block. We get into del_pending_extents from there, where if we need to be rescheduled we drop the alloc_mutex and schedule. At this point process B comes in to do an allocation and gets the alloc_mutex, and because process A did not do the chunk allocation completely it thinks its a good time to do a chunk allocation as well, and hangs on the chunk_mutex. Process A wakes up and tries to take the alloc_mutex and cannot. The way to fix this is do a mutex_trylock() on chunk_mutex. If we return 0 we didn't get the lock, and if this is just a "hey it may be a good time to allocate a chunk" then we just exit. If we are trying to force an allocation then we reschedule and keep trying to acquire the chunk_mutex. If once we acquire it the space is already full then we can just exit, otherwise we can continue with the chunk allocation. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com>	2008-10-01 19:11:18 -04:00

1 2 3 4 5 ...

773 Commits (c5c9cd4d1b827fe545ed2a945e91e3a6909f3886)