linux

Commit Graph

Author	SHA1	Message	Date
Josef Bacik	02c24a8218	fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers Btrfs needs to be able to control how filemap_write_and_wait_range() is called in fsync to make it less of a painful operation, so push down taking i_mutex and the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some file systems can drop taking the i_mutex altogether it seems, like ext3 and ocfs2. For correctness sake I just pushed everything down in all cases to make sure that we keep the current behavior the same for everybody, and then each individual fs maintainer can make up their mind about what to do from there. Thanks, Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:59 -04:00
Josef Bacik	c334b1138b	Ext4: handle SEEK_HOLE/SEEK_DATA generically Since Ext4 has its own lseek we need to make sure it handles SEEK_HOLE/SEEK_DATA. For now just do the same thing that is done in the generic case, somebody else can come along and make it do fancy things later. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:57 -04:00
Christoph Hellwig	72c5052ddc	fs: move inode_dio_done to the end_io handler For filesystems that delay their end_io processing we should keep our i_dio_count until the the processing is done. Enable this by moving the inode_dio_done call to the end_io handler if one exist. Note that the actual move to the workqueue for ext4 and XFS is not done in this patch yet, but left to the filesystem maintainers. At least for XFS it's not needed yet either as XFS has an internal equivalent to i_dio_count. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:50 -04:00
Christoph Hellwig	aacfc19c62	fs: simplify the blockdev_direct_IO prototype Simple filesystems always pass inode->i_sb_bdev as the block device argument, and never need a end_io handler. Let's simply things for them and for my grepping activity by dropping these arguments. The only thing not falling into that scheme is ext4, which passes and end_io handler without needing special flags (yet), but given how messy the direct I/O code there is use of __blockdev_direct_IO in one instead of two out of three cases isn't going to make a large difference anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:49 -04:00
Christoph Hellwig	562c72aa57	fs: move inode_dio_wait calls into ->setattr Let filesystems handle waiting for direct I/O requests themselves instead of doing it beforehand. This means filesystem-specific locks to prevent new dio referenes from appearing can be held. This is important to allow generalizing i_dio_count to non-DIO_LOCKING filesystems. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:47 -04:00
Jan Kara	9ea7df534e	ext4: Rewrite ext4_page_mkwrite() to use generic helpers Rewrite ext4_page_mkwrite() to use __block_page_mkwrite() helper. This removes the need of using i_alloc_sem to avoid races with truncate which seems to be the wrong locking order according to lock ordering documented in mm/rmap.c. Also calling ext4_da_write_begin() as used by the old code seems to be problematic because we can decide to flush delay-allocated blocks which will acquire s_umount semaphore - again creating unpleasant lock dependency if not directly a deadlock. Also add a check for frozen filesystem so that we don't busyloop in page fault when the filesystem is frozen. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:45 -04:00
Al Viro	a9049376ee	make d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err) ... and simplify the living hell out of callers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:26 -04:00
Al Viro	7e40145eb1	->permission() sanitizing: don't pass flags to ->check_acl() not used in the instances anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:21 -04:00
Al Viro	9c2c703929	->permission() sanitizing: pass MAY_NOT_BLOCK to ->check_acl() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:43:19 -04:00
Mimi Zohar	9d8f13ba3f	security: new security_inode_init_security API adds function callback This patch changes the security_inode_init_security API by adding a filesystem specific callback to write security extended attributes. This change is in preparation for supporting the initialization of multiple LSM xattrs and the EVM xattr. Initially the callback function walks an array of xattrs, writing each xattr separately, but could be optimized to write multiple xattrs at once. For existing security_inode_init_security() calls, which have not yet been converted to use the new callback function, such as those in reiserfs and ocfs2, this patch defines security_old_inode_init_security(). Signed-off-by: Mimi Zohar <zohar@us.ibm.com>	2011-07-18 12:29:38 -04:00
Robin Dong	d46203159e	ext4: avoid eh_entries overflow before insert extent_idx If eh_entries is equal to (or greater than) eh_max, the operation of inserting new extent_idx will make number of entries overflow. So check eh_entries before inserting the new extent_idx. Although there is no bug case according the code (function ext4_ext_insert_index is called by ext4_ext_split and ext4_ext_split is called only if the index block has free space), the right logic should be "lookup the capacity before insertion". Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-17 23:43:42 -04:00
Robin Dong	015861badd	ext4: avoid wasted extent cache lookup if !PUNCH_OUT_EXT This patch avoids an extraneous lookup of the extent cache in ext4_ext_map_blocks() when the flag EXT4_GET_BLOCKS_PUNCH_OUT_EXT is absent. The existing logic was performing the lookup but not making use of the result. The patch simply reverses the order of evaluation in the condition. Since ext4_ext_in_cache() does not initialize newex on misses, bypassing its invocation does not introduce any new issue in this regard. Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Eric Gouriou <egouriou@google.com>	2011-07-17 23:27:43 -04:00
Allison Henderson	c6a0371cbe	ext4: remove unneeded parameter to ext4_ext_remove_space() This patch removes the extra parameter in ext4_ext_remove_space() which is no longer needed. Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-17 23:21:03 -04:00
Allison Henderson	f7d0d3797f	ext4: punch hole optimizations: skip un-needed extent lookup This patch optimizes the punch hole operation by skipping the tree walking code that is used by truncate. Since punch hole is done through map blocks, the path to the extent is already known in this function, so we do not need to look it up again. Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-17 23:17:02 -04:00
Dan Ehrenberg	3eb0865843	ext4: ignore a stripe width of 1 If the stripe width was set to 1, then this patch will ignore that stripe width and ext4 will act as if the stripe width were 0 with respect to optimizing allocations. Signed-off-by: Dan Ehrenberg <dehrenberg@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-17 21:18:51 -04:00
Dan Ehrenberg	d7a1fee135	ext4: make the preallocation size be a multiple of stripe size Previously, if a stripe width was provided, then it would be used as the preallocation granularity, with no santiy checking and no way to override this. Now, mb_prealloc_size defaults to the smallest multiple of stripe size that is greater than or equal to the old default mb_prealloc_size, and this can be overridden with the sysfs interface. Signed-off-by: Dan Ehrenberg <dehrenberg@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-17 21:11:30 -04:00
Bernd Schubert	265c6a0f92	ext4: fix compilation with -DDX_DEBUG Compilation of ext4/namei.c brought up an error and warning messages when compiled with -DDX_DEBUG Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-16 19:41:23 -04:00
Lukas Czerner	afb86178cb	ext4: remove unnecessary comments in ext4_orphan_add() The comment from Al Viro about possible race in the ext4_orphan_add() is not justified. There is no race possible as we always have either i_mutex locked, or the inode can not be referenced from outside hence the J_ASSERS should not be hit from the reason described in comment. This commit replaces it with notion that we are holding i_mutex so it should not be possible for i_nlink to be changed while waiting for s_orphan_lock. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-11 18:47:04 -04:00
Tao Ma	caaf7a29d3	ext4: Fix a double free of sbi->s_group_info in ext4_mb_init_backend If we meet with an error in ext4_mb_add_groupinfo, we kfree sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)], but fail to reset it to NULL. So the caller ext4_mb_init_backend will try to kfree it again and causes a double free. So fix it by resetting it to NULL. Some typo in comments of mballoc.c are also changed. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-11 18:42:42 -04:00
Tao Ma	823ba01fc0	ext4: fix a race which could leak memory in ext4_groupinfo_create_slab() In ext4_groupinfo_create_slab, we create ext4_groupinfo_caches within ext4_grpinfo_slab_create_mutex, but set it outside the lock, and there does exist some case that we may create it twice and causes a memory leak. So set it before we call mutex_unlock. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-11 18:26:01 -04:00
Robin Dong	598dbdf243	ext4: avoid unneeded ext4_ext_next_leaf_block() while inserting extents Optimize ext4_ext_insert_extent() by avoiding ext4_ext_next_leaf_block() when the result is not used/needed. Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-11 18:24:01 -04:00
Robin Dong	ffb505ff0f	ext4: remove redundant goto in ext4_ext_insert_extent() If eh->eh_entries is smaller than eh->eh_max, the routine will go to the "repeat" and then go to "has_space" directlly , since argument "depth" and "eh" are not even changed. Therefore, goto "has_space" directly and remove redundant "repeat" tag. Signed-off-by: Robin Dong <sanbai@taobao.com>	2011-07-11 11:43:59 -04:00
Tao Ma	22612283f7	ext4: Change the wrong param comment for ext4_trim_all_free at ext4_trim_all_free() comment, there is no longer an @e4b parameter, instead it is @group. Reported-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-11 00:04:34 -04:00
Tao Ma	3d56b8d2c7	ext4: Speed up FITRIM by recording flags in ext4_group_info In ext4, when FITRIM is called every time, we iterate all the groups and do trim one by one. It is a bit time wasting if the group has been trimmed and there is no change since the last trim. So this patch adds a new flag in ext4_group_info->bb_state to indicate that the group has been trimmed, and it will be cleared if some blocks is freed(in release_blocks_on_commit). Another trim_minlen is added in ext4_sb_info to record the last minlen we use to trim the volume, so that if the caller provide a small one, we will go on the trim regardless of the bb_state. A simple test with my intel x25m ssd: df -h shows: /dev/sdb1 40G 21G 17G 56% /mnt/ext4 Block size: 4096 run the FITRIM with the following parameter: range.start = 0; range.len = UINT64_MAX; range.minlen = 1048576; without the patch: [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a real 0m5.505s user 0m0.000s sys 0m1.224s [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a real 0m5.359s user 0m0.000s sys 0m1.178s [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a real 0m5.228s user 0m0.000s sys 0m1.151s with the patch: [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a real 0m5.625s user 0m0.000s sys 0m1.269s [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a real 0m0.002s user 0m0.000s sys 0m0.001s [root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a real 0m0.002s user 0m0.000s sys 0m0.001s A big improvement for the 2nd and 3rd run. Even after I delete some big image files, it is still much faster than iterating the whole disk. [root@boyu-tm test]# time ./ftrim /mnt/ext4/a real 0m1.217s user 0m0.000s sys 0m0.196s Cc: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-11 00:03:38 -04:00
Tao Ma	b3d4c2b10b	ext4: Add new ext4 trim tracepoints Add ext4_trim_extent and ext4_trim_all_free. Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-11 00:01:52 -04:00
Tao Ma	169ddc3ec8	ext4: speed up group trim with the right free block count When we trim some free blocks in a group of ext4, we need to calculate the free blocks properly and check whether there are enough freed blocks left for us to trim. Current solution will only calculate free spaces if they are large for a trim which isn't appropriate. Let us see a small example: a group has 1.5M free which are 300k, 300k, 300k, 300k, 300k. And minblocks is 1M. With current solution, we have to iterate the whole group since these 300k will never be subtracted from 1.5M. But actually we should exit after we find the first 2 free spaces since the left 3 chunks only sum up to 900K if we subtract the first 600K although they can't be trimed. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-11 00:00:07 -04:00
Tao Ma	22f1045743	ext4: fix trim length underflow with small trim length In `0f0a25b`, we adjust 'len' with s_first_data_block - start, but it could underflow in case blocksize=1K, fstrim_range.len=512 and fstrim_range.start = 0. In this case, when we run the code: len -= first_data_blk - start; len will be underflow to -1ULL. In the end, although we are safe that last_group check later will limit the trim to the whole volume, but that isn't what the user really want. So this patch fix it. It also adds the check for 'start' like ext3 so that we can break immediately if the start is invalid. Cc: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-10 23:52:37 -04:00
Theodore Ts'o	12706394bc	ext4: add tracepoint for ext4_journal_start This will help debug who is responsible for starting a jbd2 transaction. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-07-10 22:37:50 -04:00
Jiaying Zhang	575a1d4bdf	ext4: free allocated and pre-allocated blocks when check_eofblocks_fl fails Upon corrupted inode or disk failures, we may fail after we already allocate some blocks from the inode or take some blocks from the inode's preallocation list, but before we successfully insert the corresponding extent to the extent tree. In this case, we should free any allocated blocks and discard the inode's preallocated blocks because the entries in the inode's preallocation list may be in an inconsistent state. Signed-off-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-07-10 20:07:25 -04:00
Maxim Patlasov	7132de744b	ext4: fix i_blocks/quota accounting when extent insertion fails The current implementation of ext4_free_blocks() always calls dquot_free_block This looks quite sensible in the most cases: blocks to be freed are associated with inode and were accounted in quota and i_blocks some time ago. However, there is a case when blocks to free were not accounted by the time calling ext4_free_blocks() yet: 1. delalloc is on, write_begin pre-allocated some space in quota 2. write-back happens, ext4 allocates some blocks in ext4_ext_map_blocks() 3. then ext4_ext_map_blocks() gets an error (e.g. ENOSPC) from ext4_ext_insert_extent() and calls ext4_free_blocks(). In this scenario, ext4_free_blocks() calls dquot_free_block() who, in turn, decrements i_blocks for blocks which were not accounted yet (due to delalloc) After clean umount, e2fsck reports something like: > Inode 21, i_blocks is 5080, should be 5128. Fix<y>? because i_blocks was erroneously decremented as explained above. The patch fixes the problem by passing the new flag EXT4_FREE_BLOCKS_NO_QUOT_UPDATE to ext4_free_blocks(), to request that the dquot_free_block() call be skipped. Signed-off-by: Maxim Patlasov <maxim.patlasov@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-07-10 19:37:48 -04:00
Theodore Ts'o	275d3ba6b4	ext4: remove loop around bio_alloc() These days, bio_alloc() is guaranteed to never fail (as long as nvecs is less than BIO_MAX_PAGES), so we don't need the loop around the struct bio allocation. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-29 21:44:45 -04:00
Yongqiang Yang	9331b62610	ext4: quiet 'unused variables' compile warnings Unused variables was deleted. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-28 10:19:05 -04:00
Eric Sandeen	f86186b44b	ext4: refactor duplicated block placement code I found that ext4_ext_find_goal() and ext4_find_near() share the same code for returning a coloured start block based on i_block_group. We can refactor this into a common function so that they don't diverge in the future. Thanks to adilger for suggesting the new function name. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-28 10:01:31 -04:00
Amir Goldstein	dae1e52cb1	ext4: move ext4_ind_* functions from inode.c to indirect.c This patch moves functions from inode.c to indirect.c. The moved functions are ext4_ind_* functions and their helpers. Functions called from inode.c are declared extern. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-27 19:40:50 -04:00
Theodore Ts'o	9f125d641b	ext4: move common truncate functions to header file Move two functions that will be needed by the indirect functions to be moved to indirect.c as well as inode.c to truncate.h as inline functions, so that we can avoid having duplicate copies of the function (which can be a maintenance problem) without having to expose them as globally functions. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-27 19:16:04 -04:00
Theodore Ts'o	1f7d1e7741	ext4: move __ext4_check_blockref to block_validity.c In preparation for moving the indirect functions to a separate file, move __ext4_check_blockref() to block_validity.c and rename it to ext4_check_blockref() which is exported as globally visible function. Also, rename the cpp macro ext4_check_inode_blockref() to ext4_ind_check_inode(), to make it clear that it is only valid for use with non-extent mapped inodes. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-27 19:16:02 -04:00
Amir Goldstein	8bb2b24712	ext4: rename ext4_indirect_* funcs to ext4_ind_* We are going to move all ext4_ind_* functions to indirect.c. Before we do that, let's rename 2 functions called ext4_indirect_* to ext4_ind_*, to keep to the naming convention. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-27 17:10:28 -04:00
Amir Goldstein	ff9893dc8a	ext4: split ext4_ind_truncate from ext4_truncate We are about to move all indirect inode functions to a new file. Before we do that, let's split ext4_ind_truncate() out of ext4_truncate() leaving only generic code in the latter, so we will be able to move ext4_ind_truncate() to the new file. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-27 16:36:31 -04:00
Robin Dong	ed7a7e1672	ext4: fix incorrect error msg in ext4_ext_insert_index In function ext4_ext_insert_index when eh_entries of curp is bigger than eh_max, error messages will be printed out, but the content is about logical and ei_block, that's incorret. Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-27 15:35:53 -04:00
Wu Fengguang	6e6938b6d3	writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage sync(2) is performed in two stages: the WB_SYNC_NONE sync and the WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and do livelock prevention for it, too. Jan's commit `f446daaea9` ("mm: implement writeback livelock avoidance using page tagging") is a partial fix in that it only fixed the WB_SYNC_ALL phase livelock. Although ext4 is tested to no longer livelock with commit `f446daaea9`, it may due to some "redirty_tail() after pages_skipped" effect which is by no means a guarantee for _all_ the file systems. Note that writeback_inodes_sb() is called by not only sync(), they are treated the same because the other callers also need livelock prevention. Impact: It changes the order in which pages/inodes are synced to disk. Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode until finished with the current inode. Acked-by: Jan Kara <jack@suse.cz> CC: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-06-08 08:25:20 +08:00
Lukas Czerner	a9c667f8f0	ext4: fixed tracepoints cleanup While creating fixed tracepoints for ext3, basically by porting them from ext4, I found a lot of useless retyping, wrong type usage, useless variable passing and other inconsistencies in the ext4 fixed tracepoint code. This patch cleans the fixed tracepoint code for ext4 and also simplify some of them. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-06 09:51:52 -04:00
Lukas Czerner	c03f8aa9ab	ext4: use FIEMAP_EXTENT_LAST flag for last extent in fiemap Currently we are not marking the extent as the last one (FIEMAP_EXTENT_LAST) if there is a hole at the end of the file. This is because we just do not check for it right now and continue searching for next extent. But at the point we hit the hole at the end of the file, it is too late. This commit adds check for the allocated block in subsequent extent and if there is no more extents (block = EXT_MAX_BLOCKS) just flag the current one as the last one. This behaviour has been spotted unintentionally by 252 xfstest, when the test hangs out, because of wrong loop condition. However on other filesystems (like xfs) it will exit anyway, because we notice the last extent flag and exit. With this patch xfstest 252 does not hang anymore, ext4 fiemap implementation still reports bad extent type in some cases, however this seems to be different issue. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-06 00:06:52 -04:00
Lukas Czerner	f17722f917	ext4: Fix max file size and logical block counting of extent format file Kazuya Mio reported that he was able to hit BUG_ON(next == lblock) in ext4_ext_put_gap_in_cache() while creating a sparse file in extent format and fill the tail of file up to its end. We will hit the BUG_ON when we write the last block (2^32-1) into the sparse file. The root cause of the problem lies in the fact that we specifically set s_maxbytes so that block at s_maxbytes fit into on-disk extent format, which is 32 bit long. However, we are not storing start and end block number, but rather start block number and length in blocks. It means that in order to cover extent from 0 to EXT_MAX_BLOCK we need EXT_MAX_BLOCK+1 to fit into len (because we counting block 0 as well) - and it does not. The only way to fix it without changing the meaning of the struct ext4_extent members is, as Kazuya Mio suggested, to lower s_maxbytes by one fs block so we can cover the whole extent we can get by the on-disk extent format. Also in many places EXT_MAX_BLOCK is used as length instead of maximum logical block number as the name suggests, it is all a bit messy. So this commit renames it to EXT_MAX_BLOCKS and change its usage in some places to actually be maximum number of blocks in the extent. The bug which this commit fixes can be reproduced as follows: dd if=/dev/zero of=/mnt/mp1/file bs=<blocksize> count=1 seek=$((232-2)) sync dd if=/dev/zero of=/mnt/mp1/file bs=<blocksize> count=1 seek=$((232-1)) Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-06 00:05:17 -04:00
Yongqiang Yang	5def136025	ext4: correct comments for ext4_free_blocks() metadata is not parameter of ext4_free_blocks() any more. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-06-05 23:26:40 -04:00
Christoph Hellwig	aa38572954	fs: pass exact type of data dirties to ->dirty_inode Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or anything else, so that the filesystem can track internally if it needs to push out a transaction for fdatasync or not. This is just the prototype change with no user for it yet. I plan to push large XFS changes for the next merge window, and getting this trivial infrastructure in this window would help a lot to avoid tree interdependencies. Also remove incorrect comments that ->dirty_inode can't block. That has been changed a long time ago, and many implementations rely on it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-05-27 07:04:40 -04:00
Linus Torvalds	f8d613e2a6	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem: xen: cleancache shim to Xen Transcendent Memory ocfs2: add cleancache support ext4: add cleancache support btrfs: add cleancache support ext3: add cleancache support mm/fs: add hooks to support cleancache mm: cleancache core ops functions and config fs: add field to superblock to support cleancache mm/fs: cleancache documentation Fix up trivial conflict in fs/btrfs/extent_io.c due to includes	2011-05-26 10:50:56 -07:00
Dan Magenheimer	7abc52c2ed	ext4: add cleancache support This seventh patch of eight in this cleancache series "opts-in" cleancache for ext4. Filesystems must explicitly enable cleancache by calling cleancache_init_fs anytime an instance of the filesystem is mounted. For ext4, all other cleancache hooks are in the VFS layer including the matching cleancache_flush_fs hook which must be called on unmount. Details and a FAQ can be found in Documentation/vm/cleancache.txt [v6-v8: no changes] [v5: jeremy@goop.org: simplify init hook and any future fs init changes] Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik Van Riel <riel@redhat.com> Cc: Jan Beulich <JBeulich@novell.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Cc: Nitin Gupta <ngupta@vflare.org>	2011-05-26 10:02:03 -06:00
Yongqiang Yang	1b16da77f9	ext4: teach ext4_ext_split to calculate extents efficiently Make ext4_ext_split() get extents to be moved by calculating in a statement instead of counting in a loop. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-25 17:41:48 -04:00
Jan Kara	ae24f28d39	ext4: Convert ext4 to new truncate calling convention Trivial conversion. Fixup one error handling case calling vmtruncate() and remove ->truncate callback. We also fix a bug that IS_IMMUTABLE and IS_APPEND files could not be truncated during failed writes. In fact, the test can be completely removed as upper layers do necessary permission checks for truncate in do_sys_[f]truncate() and may_open() anyway. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-25 17:39:48 -04:00
Vivek Haldar	556b27abf7	ext4: do not normalize block requests from fallocate() Currently, an fallocate request of size slightly larger than a power of 2 is turned into two block requests, each a power of 2, with the extra blocks pre-allocated for future use. When an application calls fallocate, it already has an idea about how large the file may grow so there is usually little benefit to reserve extra blocks on the preallocation list. This reduces disk fragmentation. Tested: fsstress. Also verified manually that fallocat'ed files are contiguously laid out with this change (whereas without it they begin at power-of-2 boundaries, leaving blocks in between). CPU usage of fallocate is not appreciably higher. In a tight fallocate loop, CPU usage hovers between 5%-8% with this change, and 5%-7% without it. Using a simulated file system aging program which the file system to 70%, the percentage of free extents larger than 8MB (as measured by e2freefrag) increased from 38.8% without this change, to 69.4% with this change. Signed-off-by: Vivek Haldar <haldar@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-25 07:41:54 -04:00
Allison Henderson	a4bb6b64e3	ext4: enable "punch hole" functionality This patch adds new routines: "ext4_punch_hole" "ext4_ext_punch_hole" and "ext4_ext_check_cache" fallocate has been modified to call ext4_punch_hole when the punch hole flag is passed. At the moment, we only support punching holes in extents, so this routine is pretty much a wrapper for the ext4_ext_punch_hole routine. The ext4_ext_punch_hole routine first completes all outstanding writes with the associated pages, and then releases them. The unblock aligned data is zeroed, and all blocks in between are punched out. The ext4_ext_check_cache routine is very similar to ext4_ext_in_cache except it accepts a ext4_ext_cache parameter instead of a ext4_extent parameter. This routine is used by ext4_ext_punch_hole to check and see if a block in a hole that has been cached. The ext4_ext_cache parameter is necessary because the members ext4_extent structure are not large enough to hold a 32 bit value. The existing ext4_ext_in_cache routine has become a wrapper to this new function. [ext4 punch hole patch series 5/5 v7] Signed-off-by: Allison Henderson <achender@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Mingming Cao <cmm@us.ibm.com>	2011-05-25 07:41:50 -04:00
Allison Henderson	e861304b8e	ext4: add "punch hole" flag to ext4_map_blocks() This patch adds a new flag to ext4_map_blocks() that specifies the given range of blocks should be punched out. Extents are first converted to uninitialized extents before they are punched out. Because punching a hole may require that the extent be split, it is possible that the splitting may need more blocks than are available. To deal with this, use of reserved blocks are enabled to allow the split to proceed. The routine then returns the number of blocks successfully punched out. [ext4 punch hole patch series 4/5 v7] Signed-off-by: Allison Henderson <achender@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Mingming Cao <cmm@us.ibm.com>	2011-05-25 07:41:46 -04:00
Allison Henderson	d583fb87a3	ext4: punch out extents This patch modifies the truncate routines to support hole punching Below is a brief summary of the patches changes: - Added end param to ext_ext4_rm_leaf This function has been modified to accept an end parameter which enables it to punch holes in leafs instead of just truncating them. - Implemented the "remove head" case in the ext_remove_blocks routine This routine is used by ext_ext4_rm_leaf to remove the tail of an extent during a truncate. The new ext_ext4_rm_leaf routine will now also use it to remove the head of an extent in the case that the hole covers a region of blocks at the beginning of an extent. - Added "end" param to ext4_ext_remove_space routine This function has been modified to accept a stop parameter, which is passed through to ext4_ext_rm_leaf. [ext4 punch hole patch series 3/5 v6] Signed-off-by: Allison Henderson <achender@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-25 07:41:43 -04:00
Allison Henderson	308488518d	ext4: add new function ext4_block_zero_page_range() This patch modifies the existing ext4_block_truncate_page() function which was used by the truncate code path, and which zeroes out block unaligned data, by adding a new length parameter, and renames it to ext4_block_zero_page_rage(). This function can now be used to zero out the head of a block, the tail of a block, or the middle of a block. The ext4_block_truncate_page() function is now a wrapper to ext4_block_zero_page_range(). [ext4 punch hole patch series 2/5 v7] Signed-off-by: Allison Henderson <achender@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Mingming Cao <cmm@us.ibm.com>	2011-05-25 07:41:32 -04:00
Allison Henderson	55f020db66	ext4: add flag to ext4_has_free_blocks This patch adds an allocation request flag to the ext4_has_free_blocks function which enables the use of reserved blocks. This will allow a punch hole to proceed even if the disk is full. Punching a hole may require additional blocks to first split the extents. Because ext4_has_free_blocks is a low level function, the flag needs to be passed down through several functions listed below: ext4_ext_insert_extent ext4_ext_create_new_leaf ext4_ext_grow_indepth ext4_ext_split ext4_ext_new_meta_block ext4_mb_new_blocks ext4_claim_free_blocks ext4_has_free_blocks [ext4 punch hole patch series 1/5 v7] Signed-off-by: Allison Henderson <achender@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Mingming Cao <cmm@us.ibm.com>	2011-05-25 07:41:26 -04:00
Aditya Kali	ae81230686	ext4: reserve inodes and feature code for 'quota' feature I am working on patch to add quota as a built-in feature for ext4 filesystem. The implementation is based on the design given at https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4. This patch reserves the inode numbers 3 and 4 for quota purposes and also reserves EXT4_FEATURE_RO_COMPAT_QUOTA feature code. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-24 19:00:39 -04:00
Johann Lombardi	c5e06d101a	ext4: add support for multiple mount protection Prevent an ext4 filesystem from being mounted multiple times. A sequence number is stored on disk and is periodically updated (every 5 seconds by default) by a mounted filesystem. At mount time, we now wait for s_mmp_update_interval seconds to make sure that the MMP sequence does not change. In case of failure, the nodename, bdevname and the time at which the MMP block was last updated is displayed. Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: Johann Lombardi <johann@whamcloud.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-24 18:31:25 -04:00
Kazuya Mio	d02a9391f7	ext4: ensure f_bfree returned by ext4_statfs() is non-negative I found the issue that the number of free blocks went negative. # stat -f /mnt/mp1/ File: "/mnt/mp1/" ID: e175ccb83a872efe Namelen: 255 Type: ext2/ext3 Block size: 4096 Fundamental block size: 4096 Blocks: Total: 258022 Free: -15 Available: -13122 Inodes: Total: 65536 Free: 63029 f_bfree in struct statfs will go negative when the filesystem has few free blocks. Because the number of dirty blocks is bigger than the number of free blocks in the following two cases. CASE 1: ext4_da_writepages mpage_da_map_and_submit ext4_map_blocks ext4_ext_map_blocks ext4_mb_new_blocks ext4_mb_diskspace_used percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len); <--- interrupt statfs systemcall ---> ext4_da_update_reserve_space percpu_counter_sub(&sbi->s_dirtyblocks_counter, used + ei->i_allocated_meta_blocks); CASE 2: ext4_write_begin __block_write_begin ext4_map_blocks ext4_ext_map_blocks ext4_mb_new_blocks ext4_mb_diskspace_used percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len); <--- interrupt statfs systemcall ---> percpu_counter_sub(&sbi->s_dirtyblocks_counter, reserv_blks); To avoid the issue, this patch ensures that f_bfree is non-negative. Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>	2011-05-24 18:30:07 -04:00
Lukas Czerner	28739eea9c	ext4: protect bb_first_free in ext4_trim_all_free() with group lock We should protect reading bd_info->bb_first_free with the group lock because otherwise we might miss some free blocks. This is not a big deal at all, but the change to do right thing is really simple, so lets do that. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-24 18:28:07 -04:00
Lukas Czerner	7894408666	ext4: only load buddy bitmap in ext4_trim_fs() when it is needed Currently we are loading buddy ext4_mb_load_buddy() for every block group we are going through in ext4_trim_fs() in many cases just to find out that there is not enough space to be bothered with. As Amir Goldstein suggested we can use bb_free information directly from ext4_group_info. This commit removes ext4_mb_load_buddy() from ext4_trim_fs() and rather get the ext4_group_info via ext4_get_group_info() and use the bb_free information directly from that. This avoids unnecessary call to load buddy in the case the group does not have enough free space to trim. Loading buddy is now moved to ext4_trim_all_free(). Tested by me with xfstests 251. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-24 18:16:27 -04:00
Jan Kara	93628ffb9b	ext4: fix waiting and sending of a barrier in ext4_sync_file() jbd2_log_start_commit() returns 1 only when we really start a transaction. But we also need to wait for a transaction when the commit is already running. Fix this problem by waiting for transaction commit unconditionally (which is just a quick check if the transaction is already committed). Also we have to be more careful with sending of a barrier because when transaction is being committed in parallel to ext4_sync_file() running, we cannot be sure that the barrier the journalling code sends happens after we wrote all the data for fsync (note that not every data writeout needs to trigger metadata changes thus commit of some metadata changes can be running while other data is still written out). So use jbd2_will_send_data_barrier() helper to detect the common cases when we can be sure barrier will be issued by the commit code and issue the barrier ourselves in the remaining cases. Reported-by: Edward Goggin <egoggin@vmware.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-24 12:00:54 -04:00
Yongqiang Yang	b221349fa8	ext4: fix ext4_ext_fiemap_cb() to handle blocks before request range correctly To get delayed-extent information, ext4_ext_fiemap_cb() looks up pagecache, it thus collects information starting from a page's head block. If blocksize < pagesize, the beginning blocks of a page may lies before the request range. So ext4_ext_fiemap_cb() should proceed ignoring them, because they has been handled before. If no mapped buffer in the range is found in the 1st page, we need to look up the 2nd page, otherwise delayed-extents after a hole will be ignored. Without this patch, xfstests 225 will hung on ext4 with 1K block. Reported-by: Amir Goldstein <amir73il@users.sourceforge.net> Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-24 11:36:58 -04:00
Theodore Ts'o	072bd7ea74	ext4: use truncate_setsize() unconditionally In commit `c8d46e41` (ext4: Add flag to files with blocks intentionally past EOF), if the EOFBLOCKS_FL flag is set, we call ext4_truncate() before calling vmtruncate(). This caused any allocated but unwritten blocks created by calling fallocate() with the FALLOC_FL_KEEP_SIZE flag to be dropped. This was done to make to make sure that EOFBLOCKS_FL would not be cleared while still leaving blocks past i_size allocated. This was not necessary, since ext4_truncate() guarantees that blocks past i_size will be dropped, even in the case where truncate() has increased i_size before calling ext4_truncate(). So fix this by removing the EOFBLOCKS_FL special case treatment in ext4_setattr(). In addition, use truncate_setsize() followed by a call to ext4_truncate() instead of using vmtruncate(). This is more efficient since it skips the call to inode_newsize_ok(), which has been checked already by inode_change_ok(). This is also in a win in the case where EOFBLOCKS_FL is set since it avoids calling ext4_truncate() twice. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-23 15:13:02 -04:00
Eric Gouriou	f6d2f6b327	ext4: fix unbalanced up_write() in ext4_ext_truncate() error path ext4_ext_truncate() should not invoke up_write(&EXT4_I(inode)->i_data_sem) when ext4_orphan_add() returns an error, as it hasn't performed a down_write() yet. This trivial patch fixes this by moving the up_write() invocation above the out_stop label. Signed-off-by: Eric Gouriou <egouriou@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-22 21:33:00 -04:00
Vivek Haldar	77f4135f2a	ext4: count hits/misses of extent cache and expose in sysfs The number of hits and misses for each filesystem is exposed in /sys/fs/ext4/<dev>/extent_cache_{hits, misses}. Tested: fsstress, manual checks. Signed-off-by: Vivek Haldar <haldar@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-22 21:24:16 -04:00
Yongqiang Yang	93917411be	ext4: make ext4_split_extent() handle error correctly Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Mingming Cao <cmm@us.ibm.com>	2011-05-22 20:49:12 -04:00
Theodore Ts'o	373cd5c53d	ext4: don't show mount options in /proc/mounts if there is no journal After creating an ext4 file system without a journal: # mke2fs -t ext4 -O ^has_journal /dev/sda # mount -t ext4 /dev/sda /test the /proc/mounts will show: "/dev/sda /test ext4 rw,relatime,user_xattr,acl,barrier=1,data=writeback 0 0" which can fool users into thinking that the fs is using writeback mode. So don't set the writeback option when the journal has not been enabled; we don't depend on the writeback option being set, since ext4_should_writeback_data() in ext4_jbd2.h tests to see if the journal is not present before returning true. Reported-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-22 16:12:35 -04:00
Lukas Czerner	1bb933fb1f	ext4: fix possible use-after-free in ext4_remove_li_request() We need to take reference to the s_li_request after we take a mutex, because it might be freed since then, hence result in accessing old already freed memory. Also we should protect the whole ext4_remove_li_request() because ext4_li_info might be in the process of being freed in ext4_lazyinit_thread(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com>	2011-05-20 13:55:29 -04:00
Lukas Czerner	51ce651156	ext4: fix the mount option "init_itable=n" to work as expected for n=0 For some reason, when we set the mount option "init_itable=0" it behaves as we would set init_itable=20 which is not right at all. Basically when we set it to zero we are saying to lazyinit thread not to wait between zeroing the inode table (except of cond_resched()) so this commit fixes that and removes the unnecessary condition. The 'n' should be also properly used on remount. When the n is not set at all, it means that the default miltiplier EXT4_DEF_LI_WAIT_MULT is set instead. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Eric Sandeen <sandeen@redhat.com>	2011-05-20 13:55:16 -04:00
Lukas Czerner	e1290b3e62	ext4: Remove unnecessary wait_event ext4_run_lazyinit_thread() For some reason we have been waiting for lazyinit thread to start in the ext4_run_lazyinit_thread() but it is not needed since it was jus unnecessary complexity, so get rid of it. We can also remove li_task and li_wait_task since it is not used anymore. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com>	2011-05-20 13:49:51 -04:00
Lukas Czerner	4ed5c033c1	ext4: Use schedule_timeout_interruptible() for waiting in lazyinit thread In order to make lazyinit eat approx. 10% of io bandwidth at max, we are sleeping between zeroing each single inode table. For that purpose we are using timer which wakes up thread when it expires. It is set via add_timer() and this may cause troubles in the case that thread has been woken up earlier and in next iteration we call add_timer() on still running timer hence hitting BUG_ON in add_timer(). We could fix that by using mod_timer() instead however we can use schedule_timeout_interruptible() for waiting and hence simplifying things a lot. This commit exchange the old "waiting mechanism" with simple schedule_timeout_interruptible(), setting the time to sleep. Hence we do not longer need li_wait_daemon waiting queue and others, so get rid of it. Addresses-Red-Hat-Bugzilla: #699708 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com>	2011-05-20 13:49:04 -04:00
Darrick J. Wong	0e499890c1	ext4: wait for writeback to complete while making pages writable In order to stabilize pages during disk writes, ext4_page_mkwrite must wait for writeback operations to complete before making a page writable. Furthermore, the function must return locked pages, and recheck the writeback status if the page lock is ever dropped. The "someone could wander in" part of this patch was suggested by Chris Mason. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-18 13:55:20 -04:00
Darrick J. Wong	7cb1a5351d	ext4: clean up some wait_on_page_writeback calls wait_on_page_writeback already checks the writeback bit, so callers of it needn't do that test. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-18 13:53:20 -04:00
Tao Ma	ed3ce80a52	ext4: don't warn about mnt_count if it has been disabled Currently, if we mkfs a new ext4 volume with s_max_mnt_count set to zero, and mount it for the first time, we will get the warning: maximal mount count reached, running e2fsck is recommended It is really misleading. So change the check so that it won't warn in that case. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-18 13:29:57 -04:00
Allison Henderson	9b940f8e8c	ext4: ext4_ext_convert_to_initialized bug found in extended FSX testing This patch addresses bugs found while testing punch hole with the fsx test. The patch corrects the number of blocks that are zeroed out while splitting an extent, and also corrects the return value to return the number of blocks split out, instead of the number of blocks zeroed out. This patch has been tested in addition to the following patches: [Ext4 punch hole v7] [XFS Tests Punch Hole 1/1 v2] Add Punch Hole Testing to FSX The test ran successfully for 24 hours. Signed-off-by: Allison Henderson <achender@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-16 10:11:09 -04:00
Amir Goldstein	0b26859027	ext4: fix oops in ext4_quota_off() If quota is not enabled when ext4_quota_off() is called, we must not dereference quota file inode since it is NULL. Check properly for this. This fixes a bug in commit `21f976975c` (ext4: remove unnecessary [cm]time update of quota file), which was merged for 2.6.39-rc3. Reported-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-16 09:59:13 -04:00
Allison Henderson	6976a6f2ac	ext4: don't dereference null pointer when make_indexed_dir() fails Fix for a null pointer bug found while running punch hole tests Signed-off-by: Allison Henderson <achender@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-15 00:19:41 -04:00
Amir Goldstein	44183d4231	ext4: remove alloc_semp After taking care of all group init races, all that remains is to remove alloc_semp from ext4_allocation_context and ext4_buddy structs. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-09 21:52:36 -04:00
Amir Goldstein	9b8b7d353f	ext4: teach ext4_mb_init_cache() to skip uptodate buddy caches After online resize which adds new groups, some of the groups in a buddy page may be initialized and uptodate, while other (new ones) may be uninitialized. The indication for init of new block groups is when ext4_mb_init_cache() is called with an uptodate buddy page. In this case, initialized groups on that buddy page must be skipped when initializing the buddy cache. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-09 21:49:42 -04:00
Amir Goldstein	2de8807b25	ext4: synchronize ext4_mb_init_group() with buddy page lock The old routines ext4_mb_[get\|put]_buddy_cache_lock(), which used to take grp->alloc_sem for all groups on the buddy page have been replaced with the routines ext4_mb_[get\|put]_buddy_page_lock(). The new routines take both buddy and bitmap page locks to protect against concurrent init of groups on the same buddy page. The GROUP_NEED_INIT flag is tested again under page lock to check if the group was initialized by another caller. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-09 21:48:13 -04:00
Amir Goldstein	e73a347b77	ext4: implement ext4_add_groupblocks() by freeing blocks The old imlementation used to take grp->alloc_sem and set the GROUP_NEED_INIT flag, so that the buddy cache would be reloaded. The new implementation updates the buddy cache by freeing the added blocks and making them available for use, so there is no need to reload the buddy cache and there is no need to take grp->alloc_sem. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-09 21:40:01 -04:00
Theodore Ts'o	2cd05cc393	ext4: remove unneeded ext4_journal_get_undo_access The block allocation code used to use jbd2_journal_get_undo_access as a way to make changes that wouldn't show up until the commit took place. The new multi-block allocation code has a its own way of preventing newly freed blocks from getting reused until the commit takes place (it avoids updating the buddy bitmaps until the commit is done), so we don't need to use jbd2_journal_get_undo_access(), which has extra overhead compared to jbd2_journal_get_write_access(). There was one last vestigal use of ext4_journal_get_undo_access() in ext4_add_groupblocks(); change it to use ext4_journal_get_write_access() and then remove the ext4_journal_get_undo_access() support. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-09 10:58:45 -04:00
Amir Goldstein	2846e82004	ext4: move ext4_add_groupblocks() to mballoc.c In preparation for the next patch, the function ext4_add_groupblocks() is moved to mballoc.c, where it could use some static functions. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-09 10:46:41 -04:00
Amerigo Wang	66bb82798d	ext4: remove redundant #ifdef in super.c There is already an #ifdef CONFIG_QUOTA some lines above, so this one is totally useless. Signed-off-by: WANG Cong <amwang@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-09 10:30:41 -04:00
Tao Ma	55ff3840a2	ext4: remove redundant check for first_not_zeroed in ext4_register_li_request We have checked first_not_zeroed == ngroups already above, so remove this redundant check. sbi->s_li_request = NULL above is also removed since it is NULL already. Cc: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-09 10:28:41 -04:00
Tao Ma	00d098822f	ext4: use s_inodes_per_block directly in __ext4_get_inode_loc In __ext4_get_inode_loc, we calculate inodes_per_block every time by EXT4_BLOCK_SIZE(sb) / EXT4_INODE_SIZE(sb). AFAICS, this function is a hot path for ext4, so we'd better use s_inodes_per_block directly instead of calculating every time. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-09 10:26:41 -04:00
Tao Ma	e8bbe8c401	ext4: use EXT4FS_DEBUG instead of EXT4_DEBUG in fsync.c We have EXT4FS_DEBUG for some old debug and CONFIG_EXT4_DEBUG for the new mballoc debug, but there isn't any EXT4_DEBUG. As CONFIG_EXT4_DEBUG seems to be only used in mballoc, use EXT4FS_DEBUG in fsync.c. [ It doesn't really matter; although I'm including this commit for consistency's sake. The whole point of the #ifdef's is to disable the debugging code. In general you're not going to want to enable all of the code protected by EXT4FS_DEBUG at the same time. -- Ted ] Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-09 10:25:54 -04:00
Yongqiang Yang	667eff35a1	ext4: reimplement convert and split_unwritten Reimplement ext4_ext_convert_to_initialized() and ext4_split_unwritten_extents() using ext4_split_extent() Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Tested-by: Allison Henderson <achender@linux.vnet.ibm.com>	2011-05-03 12:25:07 -04:00
Yongqiang Yang	47ea3bb59c	ext4: add ext4_split_extent_at() and ext4_split_extent() Add two functions: ext4_split_extent_at(), which splits an extent into two extents at given logical block, and ext4_split_extent() which splits an extent into three extents. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Tested-by: Allison Henderson <achender@linux.vnet.ibm.com>	2011-05-03 12:23:07 -04:00
Yongqiang Yang	197217a5af	ext4: add a function merging extents right and left 1) Rename ext4_ext_try_to_merge() to ext4_ext_try_to_merge_right(). 2) Add a new function ext4_ext_try_to_merge() which tries to merge an extent both left and right. 3) Use the new function in ext4_ext_convert_unwritten_endio() and ext4_ext_insert_extent(). Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Tested-by: Allison Henderson <achender@linux.vnet.ibm.com>	2011-05-03 11:45:29 -04:00
Jan Kara	df5e622340	ext4: fix deadlock in ext4_symlink() in ENOSPC conditions ext4_symlink() cannot call __page_symlink() with transaction open. __page_symlink() calls ext4_write_begin() which can wait for transaction commit if we are running out of space thus causing a deadlock. Also error recovery in ext4_truncate_failed_write() does not count with the transaction being already started (although I'm not aware of any particular deadlock here). Fix the problem by stopping a transaction before calling __page_symlink() (we have to be careful and put inode to orphan list so that it gets deleted in case of crash) and starting another one after __page_symlink() returns for addition of symlink into a directory. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-03 11:12:58 -04:00
Jan Kara	7ad8e4e6ae	ext4: Fix fs corruption when make_indexed_dir() fails When make_indexed_dir() fails (e.g. because of ENOSPC) after it has allocated block for index tree root, we did not properly mark all changed buffers dirty. This lead to only some of these buffers being written out and thus effectively corrupting the directory. Fix the issue by marking all changed data dirty even in the error failure case. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-03 11:05:55 -04:00
Theodore Ts'o	74e4e6db38	ext4: set extents flag when migrating file to use extents Fix a typo that was introduced in commit `07a038245b` (in 2.6.36) which caused the extents flag not to be set at the conclusion of converting an inode to use extents. Reported-by: Peter Uchno <peter.uchno@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-03 09:34:42 -04:00
Shaohua Li	dc2070a241	ext4: remove dead code in ext4_has_free_blocks() percpu_counter_sum_positive() never returns a negative value. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-05-01 18:11:18 -04:00
Theodore Ts'o	d9f34504e6	ext4: ignore errors when issuing discards This is an effective revert of commit a30eec2a8: "ext4: stop issuing discards if not supported by device". The problem is that there are some devices that may return errors in response to a discard request some times but not others. (One example would be a hybrid dm device which concatenates an SSD and an HDD device). By this logic, I also removed the error checking from ext4's FITRIM code; so that an error from a discard will not stop the FITRIM from trying to trim the rest of the file system. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-04-30 13:47:24 -04:00
Curt Wohlgemuth	39db00f1c4	ext4: don't set PageUptodate in ext4_end_bio() In the bio completion routine, we should not be setting PageUptodate at all -- it's set at sys_write() time, and is unaffected by success/failure of the write to disk. This can cause a page corruption bug when the file system's block size is less than the architecture's VM page size. if we have only written a single block -- we might end up setting the page's PageUptodate flag, indicating that page is completely read into memory, which may not be true. This could cause subsequent reads to get bad data. This commit also takes the opportunity to clean up error handling in ext4_end_bio(), and remove some extraneous code: - fixes ext4_end_bio() to set AS_EIO in the page->mapping->flags on error, which was left out by mistake. This is needed so that fsync() will return an error if there was an I/O error. - remove the clear_buffer_dirty() call on unmapped buffers for each page. - consolidate page/buffer error handling in a single section. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Jim Meyering <jim@meyering.net> Reported-by: Hugh Dickins <hughd@google.com> Cc: Mingming Cao <cmm@us.ibm.com>	2011-04-30 13:26:26 -04:00
Theodore Ts'o	2035e77605	ext4: check for ext[23] file system features when mounting as ext[23] Provide better emulation for ext[23] mode by enforcing that the file system does not have any unsupported file system features as defined by ext[23] when emulating the ext[23] file system driver when CONFIG_EXT4_USE_FOR_EXT23 is defined. This causes the file system type information in /proc/mounts to be correct for the automatically mounted root file system. This also means that "mount -t ext2 /dev/sda /mnt" will fail if /dev/sda contains an ext3 or ext4 file system, just as one would expect if the original ext2 file system driver were in use. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-04-18 17:29:14 -04:00
Yang Ruirui	26626f1172	ext4: release page cache in ext4_mb_load_buddy error path Add missing page_cache_release in the error path of ext4_mb_load_buddy Signed-off-by: Yang Ruirui <ruirui.r.yang@tieto.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-04-16 19:17:48 -04:00
Linus Torvalds	a97b52022a	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix data corruption regression by reverting commit `6de9843dab` ext4: Allow indirect-block file to grow the file size to max file size ext4: allow an active handle to be started when freezing ext4: sync the directory inode in ext4_sync_parent() ext4: init timer earlier to avoid a kernel panic in __save_error_info jbd2: fix potential memory leak on transaction commit ext4: fix a double free in ext4_register_li_request ext4: fix credits computing for indirect mapped files ext4: remove unnecessary [cm]time update of quota file jbd2: move bdget out of critical section	2011-04-11 15:45:47 -07:00
Theodore Ts'o	c820563602	ext4: fix data corruption regression by reverting commit `6de9843dab` Revert commit `6de9843dab`, since it caused a data corruption regression with BitTorrent downloads. Thanks to Damien for discovering and bisecting to find the problem commit. https://bugzilla.kernel.org/show_bug.cgi?id=32972 Reported-by: Damien Grassart <damien@grassart.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-04-10 22:30:07 -04:00
Kazuya Mio	f80da1e70f	ext4: Allow indirect-block file to grow the file size to max file size We can create 4402345721856 byte file with indirect block mapping. However, if we grow an indirect-block file to the size with ftruncate(), we can see an ext4 warning. The following patch fixes this problem. How to reproduce: # dd if=/dev/zero of=/mnt/mp1/hoge bs=1 count=0 seek=4402345721856 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.000221428 s, 0.0 kB/s # tail -n 1 /var/log/messages Nov 25 15:10:27 test kernel: EXT4-fs warning (device sda8): ext4_block_to_path:345: block 1074791436 > max in inode 12 Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-04-10 22:06:36 -04:00
Yongqiang Yang	be4f27d324	ext4: allow an active handle to be started when freezing ext4_journal_start_sb() should not prevent an active handle from being started due to s_frozen. Otherwise, deadlock is easy to happen, below is a situation. ================================================ freeze \| truncate ================================================ \| ext4_ext_truncate() freeze_super() \| starts a handle sets s_frozen \| \| ext4_ext_truncate() \| holds i_data_sem ext4_freeze() \| waits for updates \| \| ext4_free_blocks() \| calls dquot_free_block() \| \| dquot_free_blocks() \| calls ext4_dirty_inode() \| \| ext4_dirty_inode() \| trys to start an active \| handle \| \| block due to s_frozen ================================================ Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Amir Goldstein <amir73il@users.sf.net> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca>	2011-04-10 22:06:07 -04:00
Curt Wohlgemuth	0893ed458b	ext4: sync the directory inode in ext4_sync_parent() ext4 has taken the stance that, in the absence of a journal, when an fsync/fdatasync of an inode is done, the parent directory should be sync'ed if this inode entry is new. ext4_sync_parent(), which implements this, does indeed sync the dirent pages for parent directories, but it does not sync the directory inode. This patch fixes this. Also now return error status from ext4_sync_parent(). I tested this using a power fail test, which panics a machine running a file server getting requests from a client. Without this patch, on about every other test run, the server is missing many, many files that had been synced. With this patch, on > 6 runs, I see zero files being lost. Google-Bug-Id: 4179519 Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-04-10 22:05:31 -04:00
Tao Ma	0449641130	ext4: init timer earlier to avoid a kernel panic in __save_error_info During mount, when we fail to open journal inode or root inode, the __save_error_info will mod_timer. But actually s_err_report isn't initialized yet and the kernel oops. The detailed information can be found https://bugzilla.kernel.org/show_bug.cgi?id=32082. The best way is to check whether the timer s_err_report is initialized or not. But it seems that in include/linux/timer.h, we can't find a good function to check the status of this timer, so this patch just move the initializtion of s_err_report earlier so that we can avoid the kernel panic. The corresponding del_timer is also added in the error path. Reported-by: Sami Liedes <sliedes@cc.hut.fi> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-04-05 19:55:28 -04:00
Tao Ma	46e4690bbd	ext4: fix a double free in ext4_register_li_request In ext4_register_li_request, we malloc a ext4_li_request and inserts it into ext4_li_info->li_request_list. In case of any error later, we free it in the end. But if we have some error in ext4_run_lazyinit_thread, the whole li_request_list will be dropped and freed in it. So we will double free this ext4_li_request. This patch just sets elr to NULL after it is inserted to the list so that the latter kfree won't double free it. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-04-04 16:00:49 -04:00
Yongqiang Yang	5b41395fcc	ext4: fix credits computing for indirect mapped files When writing a contiguous set of blocks, two indirect blocks could be needed depending on how the blocks are aligned, so we need to increase the number of credits needed by one. [ Also fixed a another bug which could further underestimate the number of journal credits needed by 1; the code was using integer division instead of DIV_ROUND_UP() -- tytso] Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-04-04 15:40:24 -04:00
Jan Kara	21f976975c	ext4: remove unnecessary [cm]time update of quota file It is not necessary to update [cm]time of quota file on each quota file write and it wastes journal space and IO throughput with inode writes. So just remove the updating from ext4_quota_write() and only update times when quotas are being turned off. Userspace cannot get anything reliable from quota files while they are used by the kernel anyway. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-04-04 15:33:39 -04:00
Lucas De Marchi	25985edced	Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>	2011-03-31 11:26:23 -03:00
Linus Torvalds	ae005cbed1	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (43 commits) ext4: fix a BUG in mb_mark_used during trim. ext4: unused variables cleanup in fs/ext4/extents.c ext4: remove redundant set_buffer_mapped() in ext4_da_get_block_prep() ext4: add more tracepoints and use dev_t in the trace buffer ext4: don't kfree uninitialized s_group_info members ext4: add missing space in printk's in __ext4_grp_locked_error() ext4: add FITRIM to compat_ioctl. ext4: handle errors in ext4_clear_blocks() ext4: unify the ext4_handle_release_buffer() api ext4: handle errors in ext4_rename jbd2: add COW fields to struct jbd2_journal_handle jbd2: add the b_cow_tid field to journal_head struct ext4: Initialize fsync transaction ids in ext4_new_inode() ext4: Use single thread to perform DIO unwritten convertion ext4: optimize ext4_bio_write_page() when no extent conversion is needed ext4: skip orphan cleanup if fs has unknown ROCOMPAT features ext4: use the nblocks arg to ext4_truncate_restart_trans() ext4: fix missing iput of root inode for some mount error paths ext4: make FIEMAP and delayed allocation play well together ext4: suppress verbose debugging information if malloc-debug is off ... Fi up conflicts in fs/ext4/super.c due to workqueue changes	2011-03-25 09:57:41 -07:00
Linus Torvalds	6c51038900	Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits) Documentation/iostats.txt: bit-size reference etc. cfq-iosched: removing unnecessary think time checking cfq-iosched: Don't clear queue stats when preempt. blk-throttle: Reset group slice when limits are changed blk-cgroup: Only give unaccounted_time under debug cfq-iosched: Don't set active queue in preempt block: fix non-atomic access to genhd inflight structures block: attempt to merge with existing requests on plug flush block: NULL dereference on error path in __blkdev_get() cfq-iosched: Don't update group weights when on service tree fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away block: Require subsystems to explicitly allocate bio_set integrity mempool jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging fs: make fsync_buffers_list() plug mm: make generic_writepages() use plugging blk-cgroup: Add unaccounted time to timeslice_used. block: fixup plugging stubs for !CONFIG_BLOCK block: remove obsolete comments for blkdev_issue_zeroout. blktrace: Use rq->cmd_flags directly in blk_add_trace_rq. ... Fix up conflicts in fs/{aio.c,super.c}	2011-03-24 10:16:26 -07:00
Serge E. Hallyn	2e14967075	userns: rename is_owner_or_cap to inode_owner_or_capable And give it a kernel-doc comment. [akpm@linux-foundation.org: btrfs changed in linux-next] Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:47:13 -07:00
Akinobu Mita	50e0168cc3	ext4: use little-endian bitops As a preparation for removing ext2 non-atomic bit operations from asm/bitops.h. This converts ext2 non-atomic bit operations to little-endian bit operations. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:17 -07:00
Tao Ma	0ba0851714	ext4: fix a BUG in mb_mark_used during trim. In a bs=4096 volume, if we call FITRIM with the following parameter as fstrim_range(start = 102400, len = 134144000, minlen = 10240), we will trigger this BUG_ON: BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3)); Mar 4 00:55:52 boyu-tm kernel: ------------[ cut here ]------------ Mar 4 00:55:52 boyu-tm kernel: kernel BUG at fs/ext4/mballoc.c:1506! Mar 4 01:21:09 boyu-tm kernel: Code: d4 00 00 00 00 49 89 fe 8b 56 0c 44 8b 7e 04 89 55 c4 48 8b 4f 28 89 d6 44 01 fe 48 63 d6 48 8b 41 18 48 c1 e0 03 48 39 c2 76 04 <0f> 0b eb fe 48 8b 55 b0 8b 47 34 3b 42 08 74 04 0f 0b eb fe 48 Mar 4 01:21:09 boyu-tm kernel: RIP [<ffffffffa053eb42>] mb_mark_used+0x47/0x26c [ext4] Mar 4 01:21:09 boyu-tm kernel: RSP <ffff880121e45c38> Mar 4 01:21:09 boyu-tm kernel: ---[ end trace 9f461696f6a9dcf2 ]--- Fix this bug by doing the accounting correctly. Cc: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-23 15:48:11 -04:00
Sergey Senozhatsky	65922cb5ce	ext4: unused variables cleanup in fs/ext4/extents.c ext4 extents cleanup: . remove unused `ex' from check_eofblocks_fl . remove unused `eh' from ext4_ext_map_blocks Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-23 14:08:27 -04:00
Feng Tang	6de9843dab	ext4: remove redundant set_buffer_mapped() in ext4_da_get_block_prep() The map_bh() call will have already set the buffer_head to mapped. Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-23 14:05:03 -04:00
Jiaying Zhang	0562e0bad4	ext4: add more tracepoints and use dev_t in the trace buffer - Add more ext4 tracepoints. - Change ext4 tracepoints to use dev_t field with MAJOR/MINOR macros so that we can save 4 bytes in the ring buffer on some platforms. - Add sync_mode to ext4_da_writepages, ext4_da_write_pages, and ext4_da_writepages_result tracepoints. Also remove for_reclaim field from ext4_da_writepages since it is usually not very useful. Signed-off-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-21 21:38:05 -04:00
Eric Sandeen	4596fe0767	ext4: don't kfree uninitialized s_group_info members We can call kfree on uninitialized members of the s_group_info array on an the error path. We can avoid this by kzalloc'ing the array. This doesn't entirely solve the oops on mount if we fail down this path; failed_mount4: frees the sbi, for one, which gets referenced later in the failed mount paths - I haven't worked that out yet. https://bugzilla.kernel.org/show_bug.cgi?id=30872 Reported-by: Eugene A. Shatokhin <dame_eugene@mail.ru> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-21 21:25:13 -04:00
Robin Dong	21149d611e	ext4: add missing space in printk's in __ext4_grp_locked_error() When we do performence-testing on ext4 filesystem, we observed a warning like this: EXT4-fs error (device sda7): ext4_mb_generate_buddy:718: group 259825901 blocks in bitmap, 26057 in gd instead, it should be "group 2598, 25901 blocks in bitmap, 26057 in gd" Reviewed-by: Coly Li <bosong.ly@taobao.com> Cc: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-21 20:39:22 -04:00
Tao Ma	a56e69c28a	ext4: add FITRIM to compat_ioctl. FITRIM isn't added in compat_ioctl. So a 32 bit program can't be executed in a 64 bit platform. Add it in the compat_ioctl. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-20 23:16:58 -04:00
Amir Goldstein	d67d121834	ext4: handle errors in ext4_clear_blocks() Checking return code from ext4_journal_get_write_access() is important with snapshots, because this function invokes COW, so may return new errors, such as ENOSPC. ext4_clear_blocks() now returns < 0 for fatal errors, in which case, ext4_free_data() is aborted. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-20 22:59:02 -04:00
Amir Goldstein	537a03103c	ext4: unify the ext4_handle_release_buffer() api There are two wrapper functions which do exactly the same thing: ext4_journal_release_buffer(), and ext4_handle_release_buffer(). In addition, ext4_xattr_block_set() calls jbd2_journal_release_buffer() directly. Unify all of the code to use ext4_handle_release_buffer(), and get rid of ext4_journal_release_buffer(). Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-20 22:57:02 -04:00
Amir Goldstein	ef60789302	ext4: handle errors in ext4_rename Checking return code from ext4_journal_get_write_access() is important with snapshots, because this function invokes COW, so may return new errors, such as ENOSPC. We move the call to ext4_journal_get_write_access earlier in the function, to simplify error handling in the case that this function returns returns an error. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-20 21:18:44 -04:00
Linus Torvalds	e16b396ce3	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits) doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore Update cpuset info & webiste for cgroups dcdbas: force SMI to happen when expected arch/arm/Kconfig: remove one to many l's in the word. asm-generic/user.h: Fix spelling in comment drm: fix printk typo 'sracth' Remove one to many n's in a word Documentation/filesystems/romfs.txt: fixing link to genromfs drivers:scsi Change printk typo initate -> initiate serial, pch uart: Remove duplicate inclusion of linux/pci.h header fs/eventpoll.c: fix spelling mm: Fix out-of-date comments which refers non-existent functions drm: Fix printk typo 'failled' coh901318.c: Change initate to initiate. mbox-db5500.c Change initate to initiate. edac: correct i82975x error-info reported edac: correct i82975x mci initialisation edac: correct commented info fs: update comments to point correct document target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c ... Trivial conflict in fs/eventpoll.c (spelling vs addition)	2011-03-18 10:37:40 -07:00
Theodore Ts'o	688f869ce3	ext4: Initialize fsync transaction ids in ext4_new_inode() When allocating a new inode, we need to make sure i_sync_tid and i_datasync_tid are initialized. Otherwise, one or both of these two values could be left initialized to zero, which could potentially result in BUG_ON in jbd2_journal_commit_transaction. (This could happen by having journal->commit_request getting set to zero, which could wake up the kjournald process even though there is no running transaction, which then causes a BUG_ON via the J_ASSERT(j_ruinning_transaction != NULL) statement. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-16 17:16:31 -04:00
Linus Torvalds	0f6e0e8448	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (33 commits) AppArmor: kill unused macros in lsm.c AppArmor: cleanup generated files correctly KEYS: Add an iovec version of KEYCTL_INSTANTIATE KEYS: Add a new keyctl op to reject a key with a specified error code KEYS: Add a key type op to permit the key description to be vetted KEYS: Add an RCU payload dereference macro AppArmor: Cleanup make file to remove cruft and make it easier to read SELinux: implement the new sb_remount LSM hook LSM: Pass -o remount options to the LSM SELinux: Compute SID for the newly created socket SELinux: Socket retains creator role and MLS attribute SELinux: Auto-generate security_is_socket_class TOMOYO: Fix memory leak upon file open. Revert "selinux: simplify ioctl checking" selinux: drop unused packet flow permissions selinux: Fix packet forwarding checks on postrouting selinux: Fix wrong checks for selinux_policycap_netpeer selinux: Fix check for xfrm selinux context algorithm ima: remove unnecessary call to ima_must_measure IMA: remove IMA imbalance checking ...	2011-03-16 09:15:43 -07:00
Linus Torvalds	bd2895eead	Merge branch 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix build failure introduced by s/freezeable/freezable/ workqueue: add system_freezeable_wq rds/ib: use system_wq instead of rds_ib_fmr_wq net/9p: replace p9_poll_task with a work net/9p: use system_wq instead of p9_mux_wq xfs: convert to alloc_workqueue() reiserfs: make commit_wq use the default concurrency level ocfs2: use system_wq instead of ocfs2_quota_wq ext4: convert to alloc_workqueue() scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path scsi/be2iscsi,qla2xxx: convert to alloc_workqueue() misc/iwmc3200top: use system_wq instead of dedicated workqueues i2o: use alloc_workqueue() instead of create_workqueue() acpi: kacpi*_wq don't need WQ_MEM_RECLAIM fs/aio: aio_wq isn't used in memory reclaim path input/tps6507x-ts: use system_wq instead of dedicated workqueue cpufreq: use system_wq instead of dedicated workqueues wireless/ipw2x00: use system_wq instead of dedicated workqueues arm/omap: use system_wq in mailbox workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER	2011-03-16 08:20:19 -07:00
Aneesh Kumar K.V	f2fa2ffc20	ext4: Copy fs UUID to superblock File system UUID is made available to application via /proc/<pid>/mountinfo Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:45 -04:00
Aneesh Kumar K.V	f17b604207	fs: Remove i_nlink check from file system link callback Now that VFS check for inode->i_nlink == 0 and returns proper error, remove similar check from file system Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:44 -04:00
Jens Axboe	721a9602e6	block: kill off REQ_UNPLUG With the plugging now being explicitly controlled by the submitter, callers need not pass down unplugging hints to the block layer. If they want to unplug, it's because they manually plugged on their own - in which case, they should just unplug at will. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:52:27 +01:00
Jens Axboe	7eaceaccab	block: remove per-queue plugging Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:52:07 +01:00
James Morris	fe3fa43039	Merge branch 'master' of git://git.infradead.org/users/eparis/selinux into next	2011-03-08 11:38:10 +11:00
Mingming Cao	198868f35d	ext4: Use single thread to perform DIO unwritten convertion While running ext4 testing on multiple core, we found there are per cpu ext4-dio-unwritten threads processing conversion from unwritten extents to written for IOs completed from async direct IO patch. Per filesystem is enough, we don't need per cpu threads to work on conversion. Signed-off-by: Mingming Cao <cmm@us.ibm.com>	2011-03-05 11:52:45 -05:00
Theodore Ts'o	b616844310	ext4: optimize ext4_bio_write_page() when no extent conversion is needed If no extent conversion is required, wake up any processes waiting for the page's writeback to be complete and free the ext4_io_end structure directly in ext4_end_bio() instead of dropping it on the linked list (which requires taking a spinlock to queue and dequeue the io_end structure), and waiting for the workqueue to do this work. This removes an extra scheduling delay before process waiting for an fsync() to complete gets woken up, and it also reduces the CPU overhead for a random write workload. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-28 13:12:38 -05:00
Amir Goldstein	d39195c33b	ext4: skip orphan cleanup if fs has unknown ROCOMPAT features Orphan cleanup is currently executed even if the file system has some number of unknown ROCOMPAT features, which deletes inodes and frees blocks, which could be very bad for some RO_COMPAT features, especially the SNAPSHOT feature. This patch skips the orphan cleanup if it contains readonly compatible features not known by this ext4 implementation, which would prevent the fs from being mounted (or remounted) readwrite. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-28 00:53:45 -05:00
Amir Goldstein	8e8eaabefe	ext4: use the nblocks arg to ext4_truncate_restart_trans() nblocks is passed into ext4_truncate_restart_trans() from ext4_ext_truncate_extend_restart() with a value different from the default blocks_for_truncate(), but is being ignored. The two other calls to ext4_truncate_restart_trans() already pass the default value, which is then being recalculated inside the function. Fix the problem by using the passed argument. Signed-off-by: Amir Goldstein <amir73il@users.sf.net>	2011-02-27 23:32:12 -05:00
Manish Katiyar	32a9bb57d7	ext4: fix missing iput of root inode for some mount error paths This assures that the root inode is not leaked, and that sb->s_root is NULL, which will prevent generic_shutdown_super() from doing extra work, including call sync_filesystem, which ultimately results in ext4_sync_fs() getting called with an uninitialized struct super, which is the cause of the crash noted in Kernel Bugzilla #26752. https://bugzilla.kernel.org/show_bug.cgi?id=26752 Signed-off-by: Manish Katiyar <mkatiyar@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-27 20:42:06 -05:00
Yongqiang Yang	6d9c85eb70	ext4: make FIEMAP and delayed allocation play well together Fix the FIEMAP ioctl so that it returns all of the page ranges which are still subject to delayed allocation. We were missing some cases if the file was sparse. Reported by Chris Mason <chris.mason@oracle.com>: >We've had reports on btrfs that cp is giving us files full of zeros >instead of actually copying them. It was tracked down to a bug with >the btrfs fiemap implementation where it was returning holes for >delalloc ranges. > >Newer versions of cp are trusting fiemap to tell it where the holes >are, which does seem like a pretty neat trick. > >I decided to give xfs and ext4 a shot with a few tests cases too, xfs >passed with all the ones btrfs was getting wrong, and ext4 got the basic >delalloc case right. >$ mkfs.ext4 /dev/xxx >$ mount /dev/xxx /mnt >$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 >$ fiemap-test foo >ext: 0 logical: [ 0.. 255] phys: 0.. 255 >flags: 0x007 tot: 256 > >Horray! But once we throw a hole in, things go bad: >$ mkfs.ext4 /dev/xxx >$ mount /dev/xxx /mnt >$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=1 >$ fiemap-test foo >< no output > > >We've got a delalloc extent after the hole and ext4 fiemap didn't find >it. If I run sync to kick the delalloc out: >$sync >$ fiemap-test foo >ext: 0 logical: [ 256.. 511] phys: 34048.. 34303 >flags: 0x001 tot: 256 > >fiemap-test is sitting in my /usr/local/bin, and I have no idea how it >got there. It's full of pretty comments so I know it isn't mine, but >you can grab it here: > >http://oss.oracle.com/~mason/fiemap-test.c > >xfsqa has a fiemap program too. After Fix, test results are as follows: ext: 0 logical: [ 256.. 511] phys: 0.. 255 flags: 0x007 tot: 256 ext: 0 logical: [ 256.. 511] phys: 33280.. 33535 flags: 0x001 tot: 256 $ mkfs.ext4 /dev/xxx $ mount /dev/xxx /mnt $ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=1 $ sync $ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=3 $ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=5 $ fiemap-test foo ext: 0 logical: [ 256.. 511] phys: 33280.. 33535 flags: 0x000 tot: 256 ext: 1 logical: [ 768.. 1023] phys: 0.. 255 flags: 0x006 tot: 256 ext: 2 logical: [ 1280.. 1535] phys: 0.. 255 flags: 0x007 tot: 256 Tested-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-27 17:25:47 -05:00
Theodore Ts'o	4dd89fc625	ext4: suppress verbose debugging information if malloc-debug is off If CONFIG_EXT4_DEBUG is enabled, then if a block allocation fails due to disk being full, a verbose debugging message is printed, even if the malloc-debug switch has not been enabled. Suppress the debugging message so that nothing is printed unless malloc-debug has been turned on. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-27 17:23:47 -05:00
Theodore Ts'o	a54aa76108	ext4: don't leave PageWriteback set after memory failure In ext4_bio_write_page(), if the memory allocation for the struct ext4_io_page fails, it returns with the page's PageWriteback flag set. This will end up causing the page not to skip writeback in WB_SYNC_NONE mode, and in WB_SYNC_ALL mode (i.e., on a sync, fsync, or umount) the writeback daemon will get stuck forever on the wait_on_page_writeback() function in write_cache_pages_da(). Or, if journalling is enabled and the file gets deleted, it the journal thread can get stuck in journal_finish_inode_data_buffers() call to filemap_fdatawait(). Another place where things can get hung up is in truncate_inode_pages(), called out of ext4_evict_inode(). Fix this by not setting PageWriteback until after we have successfully allocated the struct ext4_io_page. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-27 16:43:24 -05:00
Theodore Ts'o	168fc0223c	ext4: move setup of the mpd structure to write_cache_pages_da() Move the initialization of all of the fields of the mpd structure to write_cache_pages_da(). This simplifies the code considerably. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-26 14:09:20 -05:00
Theodore Ts'o	78aaced340	ext4: don't lock the next page in write_cache_pages if not needed If we have accumulated a contiguous region of memory to be written out, and the next page can added to this region, don't bother locking (and then unlocking the page) before writing out the memory. In the unlikely event that the next page was being written back by some other CPU, we can also skip waiting that page to finish writeback. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-26 14:09:14 -05:00
Theodore Ts'o	ee6ecbcc5d	ext4: remove page_skipped hackery in ext4_da_writepages() Because the ext4 page writeback codepath had been prematurely calling clear_page_dirty_for_io(), if it turned out that a particular page couldn't be written out during a particular pass of write_cache_pages_da(), the page would have to get redirtied by calling redirty_pages_for_writeback(). Not only was this wasted work, but redirty_page_for_writeback() would increment wbc->pages_skipped to signal to writeback_sb_inodes() that buffers were locked, and that it should skip this inode until later. Since this signal was incorrect in ext4's case --- which was caused by ext4's historically incorrect use of write_cache_pages() --- ext4_da_writepages() saved and restored wbc->skipped_pages to avoid confusing writeback_sb_inodes(). Now that we've fixed ext4 to call clear_page_dirty_for_io() right before initiating the page I/O, we can nuke the page_skipped save/restore hackery, and breathe a sigh of relief. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-26 14:08:11 -05:00
Theodore Ts'o	9749895644	ext4: clear the dirty bit for a page in writeback at the last minute Move when we call clear_page_dirty_for_io() to just before we actually write the page. This simplifies the code somewhat, and avoids marking pages as clean and then needing to remark them as dirty later. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-26 14:08:01 -05:00
Theodore Ts'o	4f01b02c8c	ext4: simple cleanups to write_cache_pages_da() Eliminate duplicate code, unneeded variables, etc., to make it easier to understand the code. No behavioral changes were made in this patch. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-26 14:07:37 -05:00
Theodore Ts'o	8eb9e5ce21	ext4: fold __mpage_da_writepage() into write_cache_pages_da() Fold the __mpage_da_writepage() function into write_cache_pages_da(). This will give us opportunities to clean up and simplify the resulting code. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-26 14:07:31 -05:00
Theodore Ts'o	6fd7a46781	ext4: enable mblk_io_submit by default Now that we've fixed the file corruption bug in commit `d50bdd5aa5`, it's time to enable mblk_io_submit by default. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-26 13:53:09 -05:00
Curt Wohlgemuth	c7f5938adc	ext4: fix ext4_da_block_invalidatepages() to handle page range properly If ext4_da_block_invalidatepages() is called because of a failure from ext4_map_blocks() in mpage_da_map_and_submit(), it's supposed to clean up -- including unlock -- all the pages in the mpd structure. But these values may not match up, even on a system in which block size == page size: mpd->b_blocknr != mpd->first_page mpd->b_size != (mpd->next_page - mpd->first_page) ext4_da_block_invalidatepages() has been using b_blocknr and b_size; this patch changes it to use first_page and next_page. Tested: I injected a small number (5%) of failures in ext4_map_blocks() in the case that the flags contain EXT4_GET_BLOCKS_DELALLOC_RESERVE, and ran fsstress on this kernel. Without this patch, I got hung tasks every time. With this patch, I see no hangs in many runs of fsstress. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-26 12:27:52 -05:00
Curt Wohlgemuth	e0fd9b9076	ext4: mark multi-page IO complete on mapping failure In mpage_da_map_and_submit(), if we have a delayed block allocation failure from ext4_map_blocks(), we need to mark the IO as complete, by setting mpd->io_done = 1; Otherwise, we could end up submitting the pages in an outer loop; since they are unlocked on mapping failure in ext4_da_block_invalidatepages(), this will cause a bug check in mpage_da_submit_io(). I tested this by injected failures into ext4_map_blocks(). Without this patch, a simple fsstress run will bug check; with the patch, it works fine. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-26 12:25:52 -05:00
Coly Li	5a54b2f199	ext4: mballoc: don't replace the current preallocation group unnecessarily In ext4_mb_check_group_pa(), the current preallocation space is replaced with a new preallocation space when the two have the same distance from the goal block. This doesn't actually gain us anything, so change things so that the function only switches to the new preallocation group if its distance from the goal block is strictly smaller than the current preallocaiton group's distance from the goal block. Signed-off-by: Coly Li <bosong.ly@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-24 14:10:05 -05:00
Coly Li	58696f3ab2	ext4: clarify description of ac_g_ex in struct ext4_allocation_context Signed-off-by: Coly Li <bosong.ly@taobao.com> Cc: Alex Tomas <alex@clusterfs.com> Cc: Theodore Tso <tytso@google.com>	2011-02-24 14:10:00 -05:00
Coly Li	7c78605929	mballoc: add comments to ext4_mb_mark_free_simple() This patch adds comments to ext4_mb_mark_free_simple to make it more understandable. Signed-off-by: Coly Li <bosong.ly@taobao.com> Cc: Alex Tomas <alex@clusterfs.com> Cc: Theodore Tso <tytso@google.com>	2011-02-24 13:24:25 -05:00
Coly Li	235772da3e	ext4: remove unncessary call mb_find_buddy() in debugging code In __mb_check_buddy(), look at the code below: 591 fstart = -1; 592 buddy = mb_find_buddy(e4b, 0, &max); 593 for (i = 0; i < max; i++) { 594 if (!mb_test_bit(i, buddy)) { 595 MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free); 596 if (fstart == -1) { 597 fragments++; 598 fstart = i; 599 } 600 continue; 601 } 602 fstart = -1; 603 /* check used bits only */ 604 for (j = 0; j < e4b->bd_blkbits + 1; j++) { 605 buddy2 = mb_find_buddy(e4b, j, &max2); 606 k = i >> j; 607 MB_CHECK_ASSERT(k < max2); 608 MB_CHECK_ASSERT(mb_test_bit(k, buddy2)); 609 } 610 } 611 MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info)); 612 MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments); 613 614 grp = ext4_get_group_info(sb, e4b->bd_group); 615 buddy = mb_find_buddy(e4b, 0, &max); On line 592, buddy is fetched by mb_find_buddy() with order 0, between line 593 to line 615, buddy is not changed, therefore there is no need to fetch buddy again from mb_find_buddy() with order 0 again. We can safely remove the second mb_find_buddy() on line 615. Signed-off-by: Coly Li <bosong.ly@taobao.com> Cc: Alex Tomas <alex@clusterfs.com> Cc: Theodore Tso <tytso@google.com>	2011-02-24 13:24:18 -05:00
Coly Li	84b775a354	ext4: code cleanup in mb_find_buddy() Current code calculate max no matter whether order is zero, it's unnecessary. This cleanup patch sets max to "1 << (e4b->bd_blkbits + 3)" only when order == 0. Signed-off-by: Coly Li <bosong.ly@taobao.com> Cc: Alex Tomas <alex@clusterfs.com> Cc: Theodore Tso <tytso@google.com>	2011-02-24 12:51:59 -05:00
Eric Sandeen	ea66333694	ext4: enable acls and user_xattr by default There's no good reason to require the extra step of providing a mount option for acl or user_xattr once the feature is configured on; no other filesystem that I know of requires this. Userspace patches have set these options in default mount options, and this patch makes them default in the kernel. At some point we can start to deprecate the options, perhaps. For now I've removed default mount option checks in show_options() to be explicit about what's set, since it's changing the default, but I'm open to alternatives if desired. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-23 17:51:51 -05:00
Lukas Czerner	5c2ed62fd4	ext4: Adjust minlen with discard_granularity in the FITRIM ioctl Discard granularity tells us the minimum size of extent that can be discarded by the device. If the user supplies a minimum extent that should be discarded (range.minlen) which is smaller than the discard granularity, increase minlen to the discard granularity, since there's no point submitting trim requests that the device will reject anyway. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-23 17:49:51 -05:00
Lukas Czerner	4143179218	ext4: check if device support discard in FITRIM ioctl For a device that does not support discard, the FITRIM ioctl returns -EOPNOTSUPP when blkdev_issue_discard() returns this error code, which is how the user is informed that the device does not support discard. If there are no suitable free extents to be trimmed, then FITRIM will return success even though the device does not support discard, which could confuse the user. So check explicitly if the device supports discard and return an error code at the beginning of the FITRIM ioctl processing. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-23 12:42:32 -05:00
Lukas Czerner	0b75a84012	ext4: mark file-local functions and variables as static Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-23 12:22:49 -05:00
Alexander V. Lukyanov	5dbd571d87	ext4: allow inode_readahead_blks=0 (linux-2.6.37) I cannot disable inode-read-ahead feature of ext4 (on 2.6.37): # echo 0 > /sys/fs/ext4/sda2/inode_readahead_blks bash: echo: write error: Invalid argument On a server with lots of small files and random access this read-ahead makes performance worse, and I'd like to disable it. I work around this problem by using value of 1, but it still reads an extra block. This patch fixes the problem by checking for zero explicitly. Signed-off-by: Alexander V. Lukyanov <lav@netis.ru> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-21 21:33:21 -05:00
Peter Huewe	7dc576158d	ext4: Fix sparse warning: Using plain integer as NULL pointer This patch fixes the warning "Using plain integer as NULL pointer", generated by sparse, by replacing the offending 0s with NULL. Signed-off-by: Peter Huewe <peterhuewe@gmx.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-21 21:01:42 -05:00
Theodore Ts'o	da488945f4	ext4: fix compile warnings with EXT4FS_DEBUG enabled Compile 2.6.38-rc1 with turning EXT4FS_DEBUG on, we get following compile warnings. This patch fixes them. CC fs/ext4/hash.o CC fs/ext4/resize.o fs/ext4/resize.c: In function 'setup_new_group_blocks': fs/ext4/resize.c:233:2: warning: format '%#04llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int' fs/ext4/resize.c:251:2: warning: format '%#04llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int' CC fs/ext4/extents.o CC fs/ext4/ext4_jbd2.o CC fs/ext4/migrate.o Reported-by: Akira Fujita <a-fujita@rs.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-21 20:39:58 -05:00
Tejun Heo	43d133c18b	Merge branch 'master' into for-2.6.39	2011-02-21 09:43:56 +01:00
Paul Bolle	fd018fe823	ext4: fix comment typo uninitized Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Reviewed-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-02-15 10:28:59 +01:00
Jiri Kosina	0a9d59a246	Merge branch 'master' into for-next	2011-02-15 10:24:31 +01:00
Eric Sandeen	e9e3bcecf4	ext4: serialize unaligned asynchronous DIO ext4 has a data corruption case when doing non-block-aligned asynchronous direct IO into a sparse file, as demonstrated by xfstest 240. The root cause is that while ext4 preallocates space in the hole, mappings of that space still look "new" and dio_zero_block() will zero out the unwritten portions. When more than one AIO thread is going, they both find this "new" block and race to zero out their portion; this is uncoordinated and causes data corruption. Dave Chinner fixed this for xfs by simply serializing all unaligned asynchronous direct IO. I've done the same here. The difference is that we only wait on conversions, not all IO. This is a very big hammer, and I'm not very pleased with stuffing this into ext4_file_write(). But since ext4 is DIO_LOCKING, we need to serialize it at this high level. I tried to move this into ext4_ext_direct_IO, but by then we have the i_mutex already, and we will wait on the work queue to do conversions - which must also take the i_mutex. So that won't work. This was originally exposed by qemu-kvm installing to a raw disk image with a normal sector-63 alignment. I've tested a backport of this patch with qemu, and it does avoid the corruption. It is also quite a lot slower (14 min for package installs, vs. 8 min for well-aligned) but I'll take slow correctness over fast corruption any day. Mingming suggested that we can track outstanding conversions, and wait on those so that non-sparse files won't be affected, and I've implemented that here; unaligned AIO to nonsparse files won't take a perf hit. [tytso@mit.edu: Keep the mutex as a hashed array instead of bloating the ext4 inode] [tytso@mit.edu: Fix up namespace issues so that global variables are protected with an "ext4_" prefix.] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-12 08:17:34 -05:00
Eric Sandeen	2892c15ddd	ext4: make grpinfo slab cache names static In 2.6.37 I was running into oopses with repeated module loads & unloads. I tracked this down to: `fb1813f4` ext4: use dedicated slab caches for group_info structures (this was in addition to the features advert unload problem) The kstrdup & subsequent kfree of the cache name was causing a double free. In slub, at least, if I read it right it allocates & frees the name itself, slab seems to do something different... so in slub I think we were leaking -our- cachep->name, and double freeing the one allocated by slub. After getting lost in slab/slub/slob a bit, I just looked at other sized-caches that get allocated. jbd2, biovec, sgpool all do it more or less the way jbd2 does. Below patch follows the jbd2 method of dynamically allocating a cache at mount time from a list of static names. (This might also possibly fix a race creating the caches with parallel mounts running). [Folded in a fix from Dan Carpenter which fixed an off-by-one error in the original patch] Cc: stable@kernel.org Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-12 08:12:18 -05:00
Curt Wohlgemuth	d50bdd5aa5	ext4: Fix data corruption with multi-block writepages support This fixes a corruption problem with the multi-block writepages submittal change for ext4, from commit `bd2d0210cf` ("ext4: use bio layer instead of buffer layer in mpage_da_submit_io"). (Note that this corruption is not present in 2.6.37 on ext4, because the corruption was detected after the feature was merged in 2.6.37-rc1, and so it was turned off by adding a non-default mount option, mblk_io_submit. With this commit, which hopefully fixes the last of the bugs with this feature, we'll be able to turn on this performance feature by default in 2.6.38, and remove the mblk_io_submit option.) The ext4 code path to bundle multiple pages for writeback in ext4_bio_write_page() had a bug: we should be clearing buffer head dirty flags before we submit the bio, not in the completion routine. The patch below was tested on 2.6.37 under KVM with the postgresql script which was submitted by Jon Nelson as documented in commit `1449032be1`. Without the patch, I'd hit the corruption problem about 50-70% of the time. With the patch, I executed the script > 100 times with no corruption seen. I also fixed a bug to make sure ext4_end_bio() doesn't dereference the bio after the bio_put() call. Reported-by: Jon Nelson <jnelson@jamponi.net> Reported-by: Matthias Bayer <jackdachef@gmail.com> Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-02-07 12:46:14 -05:00
Theodore Ts'o	dd68314ccf	ext4: fix up ext4 error handling Make sure we the correct cleanup happens if we die while trying to load the ext4 file system. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-03 14:33:49 -05:00
Lukas Czerner	8f021222c1	ext4: unregister features interface on module unload Ext4 features interface was not properly unregistered which led to problems while unloading/reloading ext4 module. This commit fixes that by adding proper kobject unregistration code into ext4_exit_fs() as well as fail-path of ext4_init_fs() Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-02-03 14:33:33 -05:00
Eric Sandeen	8f1f745331	ext4: fix panic on module unload when stopping lazyinit thread https://bugzilla.kernel.org/show_bug.cgi?id=27652 If the lazyinit thread is running, the teardown function ext4_destroy_lazyinit_thread() has problems: ext4_clear_request_list(); while (ext4_li_info->li_task) { wake_up(&ext4_li_info->li_wait_daemon); wait_event(ext4_li_info->li_wait_task, ext4_li_info->li_task == NULL); } Clearing the request list will cause the thread to exit and free ext4_li_info, so then we're waiting on something which is getting freed. Fix this up by making the thread respond to kthread_stop, and exit, without the need to wait for that exit in some other homegrown way. Cc: stable@kernel.org Reported-and-Tested-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-02-03 14:33:15 -05:00
Eric Paris	2a7dba391e	fs/vfs/security: pass last path component to LSM on inode creation SELinux would like to implement a new labeling behavior of newly created inodes. We currently label new inodes based on the parent and the creating process. This new behavior would also take into account the name of the new object when deciding the new label. This is not the (supposed) full path, just the last component of the path. This is very useful because creating /etc/shadow is different than creating /etc/passwd but the kernel hooks are unable to differentiate these operations. We currently require that userspace realize it is doing some difficult operation like that and than userspace jumps through SELinux hoops to get things set up correctly. This patch does not implement new behavior, that is obviously contained in a seperate SELinux patch, but it does pass the needed name down to the correct LSM hook. If no such name exists it is fine to pass NULL. Signed-off-by: Eric Paris <eparis@redhat.com>	2011-02-01 11:12:29 -05:00
Tejun Heo	fd89d5f203	ext4: convert to alloc_workqueue() Convert create_workqueue() to alloc_workqueue(). This is an identity conversion. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: linux-ext4@vger.kernel.org	2011-02-01 11:42:42 +01:00
Linus Torvalds	4843456c5c	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: quota: Fix deadlock during path resolution	2011-01-21 07:33:37 -08:00
Tao Ma	b8d6568a12	ext4: Fix comment typo "especiially". Change "especiially" to "especially". Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-01-21 16:27:01 +01:00
Christoph Hellwig	2fe17c1075	fallocate should be a file operation Currently all filesystems except XFS implement fallocate asynchronously, while XFS forced a commit. Both of these are suboptimal - in case of O_SYNC I/O we really want our allocation on disk, especially for the !KEEP_SIZE case where we actually grow the file with user-visible zeroes. On the other hand always commiting the transaction is a bad idea for fast-path uses of fallocate like for example in recent Samba versions. Given that block allocation is a data plane operation anyway change it from an inode operation to a file operation so that we have the file structure available that lets us check for O_SYNC. This also includes moving the code around for a few of the filesystems, and remove the already unnedded S_ISDIR checks given that we only wire up fallocate for regular files. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-01-17 02:25:31 -05:00
Christoph Hellwig	64c23e8687	make the feature checks in ->fallocate future proof Instead of various home grown checks that might need updates for new flags just check for any bit outside the mask of the features supported by the filesystem. This makes the check future proof for any newly added flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-01-17 02:25:30 -05:00
Linus Torvalds	275220f0fc	Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits) block: ensure that completion error gets properly traced blktrace: add missing probe argument to block_bio_complete block cfq: don't use atomic_t for cfq_group block cfq: don't use atomic_t for cfq_queue block: trace event block fix unassigned field block: add internal hd part table references block: fix accounting bug on cross partition merges kref: add kref_test_and_get bio-integrity: mark kintegrityd_wq highpri and CPU intensive block: make kblockd_workqueue smarter Revert "sd: implement sd_check_events()" block: Clean up exit_io_context() source code. Fix compile warnings due to missing removal of a 'ret' variable fs/block: type signature of major_to_index(int) to major_to_index(unsigned) block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p) cfq-iosched: don't check cfqg in choose_service_tree() fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors cdrom: export cdrom_check_events() sd: implement sd_check_events() sr: implement sr_check_events() ...	2011-01-13 10:45:01 -08:00
Linus Torvalds	b2034d474b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits) fs: add documentation on fallocate hole punching Gfs2: fail if we try to use hole punch Btrfs: fail if we try to use hole punch Ext4: fail if we try to use hole punch Ocfs2: handle hole punching via fallocate properly XFS: handle hole punching via fallocate properly fs: add hole punching to fallocate vfs: pass struct file to do_truncate on O_TRUNC opens (try #2) fix signedness mess in rw_verify_area() on 64bit architectures fs: fix kernel-doc for dcache::prepend_path fs: fix kernel-doc for dcache::d_validate sanitize ecryptfs ->mount() switch afs move internal-only parts of ncpfs headers to fs/ncpfs switch ncpfs switch 9p pass default dentry_operations to mount_pseudo() switch hostfs switch affs switch configfs ...	2011-01-13 10:27:28 -08:00
Linus Torvalds	008d23e485	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits) Documentation/trace/events.txt: Remove obsolete sched_signal_send. writeback: fix global_dirty_limits comment runtime -> real-time ppc: fix comment typo singal -> signal drivers: fix comment typo diable -> disable. m68k: fix comment typo diable -> disable. wireless: comment typo fix diable -> disable. media: comment typo fix diable -> disable. remove doc for obsolete dynamic-printk kernel-parameter remove extraneous 'is' from Documentation/iostats.txt Fix spelling milisec -> ms in snd_ps3 module parameter description Fix spelling mistakes in comments Revert conflicting V4L changes i7core_edac: fix typos in comments mm/rmap.c: fix comment sound, ca0106: Fix assignment to 'channel'. hrtimer: fix a typo in comment init/Kconfig: fix typo anon_inodes: fix wrong function name in comment fix comment typos concerning "consistent" poll: fix a typo in comment ... Fix up trivial conflicts in: - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c) - fs/ext4/ext4.h Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.	2011-01-13 10:05:56 -08:00
Andrew Morton	6db26ffc91	fs/ext4/inode.c: use pr_warn_ratelimited() pr_warning_ratelimited() doesn't exist. Also include printk.h, which defines these things. Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 08:03:05 -08:00
Josef Bacik	d6dc8462f4	Ext4: fail if we try to use hole punch Ext4 doesn't have the ability to punch holes yet, so make sure we return EOPNOTSUPP if we try to use hole punching through fallocate. This support can be added later. Thanks, Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-01-12 20:16:44 -05:00
Jan Kara	f00c9e44ad	quota: Fix deadlock during path resolution As Al Viro pointed out path resolution during Q_QUOTAON calls to quotactl is prone to deadlocks. We hold s_umount semaphore for reading during the path resolution and resolution itself may need to acquire the semaphore for writing when e. g. autofs mountpoint is passed. Solve the problem by performing the resolution before we get hold of the superblock (and thus s_umount semaphore). The whole thing is complicated by the fact that some filesystems (OCFS2) ignore the path argument. So to distinguish between filesystem which want the path and which do not we introduce new .quota_on_meta callback which does not get the path. OCFS2 then uses this callback instead of old .quota_on. CC: Al Viro <viro@ZenIV.linux.org.uk> CC: Christoph Hellwig <hch@lst.de> CC: Ted Ts'o <tytso@mit.edu> CC: Joel Becker <joel.becker@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>	2011-01-12 19:14:55 +01:00
Linus Torvalds	e9688f6aca	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (44 commits) ext4: fix trimming starting with block 0 with small blocksize ext4: revert buggy trim overflow patch ext4: don't pass entire map to check_eofblocks_fl ext4: fix memory leak in ext4_free_branches ext4: remove ext4_mb_return_to_preallocation() ext4: flush the i_completed_io_list during ext4_truncate ext4: add error checking to calls to ext4_handle_dirty_metadata() ext4: fix trimming of a single group ext4: fix uninitialized variable in ext4_register_li_request ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessary ext4: drop i_state_flags on architectures with 64-bit longs ext4: reorder ext4_inode_info structure elements to remove unneeded padding ext4: drop ec_type from the ext4_ext_cache structure ext4: use ext4_lblk_t instead of sector_t for logical blocks ext4: replace i_delalloc_reserved_flag with EXT4_STATE_DELALLOC_RESERVED ext4: fix 32bit overflow in ext4_ext_find_goal() ext4: add more error checks to ext4_mkdir() ext4: ext4_ext_migrate should use NULL not 0 ext4: Use ext4_error_file() to print the pathname to the corrupted inode ext4: use IS_ERR() to check for errors in ext4_error_file ...	2011-01-11 14:37:31 -08:00
Jan Kara	0f0a25bf51	ext4: fix trimming starting with block 0 with small blocksize When s_first_data_block is not zero (which happens e.g. when block size is 1KB) and trim ioctl is called to start trimming from block 0, the math in ext4_get_group_no_and_offset() overflows. The overall result is that ioctl returns EINVAL which is kind of unexpected and we probably don't want userspace tools to bother with internal details of filesystem structure. So just silently increase starting offset (and shorten length) when starting block is below s_first_data_block. CC: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-01-11 15:16:31 -05:00
Theodore Ts'o	0a2179b169	ext4: revert buggy trim overflow patch This reverts commit 4f531501e44: ext4: fix possible overflow in ext4_trim_fs() Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-01-11 14:42:29 -05:00
Eric Sandeen	d002ebf1d8	ext4: don't pass entire map to check_eofblocks_fl Since check_eofblocks_fl() only uses the m_lblk portion of the map structure, we may as well pass that directly, rather than passing the entire map, which IMHO obfuscates what parameters check_eofblocks_fl() cares about. Not a big deal, but seems tidier and less confusing, to me. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-01-10 13:03:35 -05:00
Theodore Ts'o	1c5b9e9065	ext4: fix memory leak in ext4_free_branches Commit `40389687` moved a call to ext4_forget() out of ext4_free_branches and let ext4_free_blocks() handle calling bforget(). But that change unfortunately did not replace the call to ext4_forget() with brelse(), which was needed to drop the in-use count of the indirect block's buffer head, which lead to a memory leak when deleting files that used indirect blocks. Fix this. Thanks to Hugh Dickins for pointing this out. Cc: stable@kernel.org Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-01-10 12:51:28 -05:00
Theodore Ts'o	a5196f8cdf	ext4: remove ext4_mb_return_to_preallocation() This function was never implemented, except for a BUG_ON which was tripping when ext4 is run without a journal. The problem is that although the comment asserts that "truncate (which is the only way to free block) discards all preallocations", ext4_free_blocks() is also called in various error recovery paths when blocks have been allocated, but for various reasons, we were not able to use those data blocks (for example, because we ran out of memory while trying to manipulate the extent tree, or some other similar situation). In addition to the fact that this function isn't implemented except for the incorrect BUG_ON, the single caller of this function, ext4_free_blocks(), doesn't use it all if the journal is enabled. So remove the (stub) function entirely for now. If we decide it's better to add it back, it's only going to be useful with a relatively large number of code changes anyway. Google-Bug-Id: 3236408 Cc: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-01-10 12:47:07 -05:00
Jiaying Zhang	3889fd57ea	ext4: flush the i_completed_io_list during ext4_truncate Ted first found the bug when running 2.6.36 kernel with dioread_nolock mount option that xfstests #13 complained about wrong file size during fsck. However, the bug exists in the older kernels as well although it is somehow harder to trigger. The problem is that ext4_end_io_work() can happen after we have truncated an inode to a smaller size. Then when ext4_end_io_work() calls ext4_convert_unwritten_extents(), we may reallocate some blocks that have been truncated, so the inode size becomes inconsistent with the allocated blocks. The following patch flushes the i_completed_io_list during truncate to reduce the risk that some pending end_io requests are executed later and convert already truncated blocks to initialized. Note that although the fix helps reduce the problem a lot there may still be a race window between vmtruncate() and ext4_end_io_work(). The fundamental problem is that if vmtruncate() is called without either i_mutex or i_alloc_sem held, it can race with an ongoing write request so that the io_end request is processed later when the corresponding blocks have been truncated. Ted and I have discussed the problem offline and we saw a few ways to fix the race completely: a) We guarantee that i_mutex lock and i_alloc_sem write lock are both hold whenever vmtruncate() is called. The i_mutex lock prevents any new write requests from entering writeback and the i_alloc_sem prevents the race from ext4_page_mkwrite(). Currently we hold both locks if vmtruncate() is called from do_truncate(), which is probably the most common case. However, there are places where we may call vmtruncate() without holding either i_mutex or i_alloc_sem. I would like to ask for other people's opinions on what locks are expected to be held before calling vmtruncate(). There seems a disagreement among the callers of that function. b) We change the ext4 write path so that we change the extent tree to contain the newly allocated blocks and update i_size both at the same time --- when the write of the data blocks is completed. c) We add some additional locking to synchronize vmtruncate() and ext4_end_io_work(). This approach may have performance implications so we need to be careful. All of the above proposals may require more substantial changes, so we may consider to take the following patch as a bandaid. Signed-off-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-01-10 12:47:05 -05:00
Theodore Ts'o	b40971426a	ext4: add error checking to calls to ext4_handle_dirty_metadata() Call ext4_std_error() in various places when we can't bail out cleanly, so the file system can be marked as in error. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-01-10 12:46:59 -05:00
Jan Kara	ca6e909f9b	ext4: fix trimming of a single group When ext4_trim_fs() is called to trim a part of a single group, the logic will wrongly set last block of the interval to 'len' instead of 'first_block + len'. Thus a shorter interval is possibly trimmed. Fix it. CC: Lukas Czerner <lczerner@redhat.com> Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-01-10 12:30:39 -05:00
Andrew Morton	6c5a6cb998	ext4: fix uninitialized variable in ext4_register_li_request fs/ext4/super.c: In function 'ext4_register_li_request': fs/ext4/super.c:2936: warning: 'ret' may be used uninitialized in this function It looks buggy to me, too. Cc: Lukas Czerner <lczerner@redhat.com> Cc: stable@kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-01-10 12:30:17 -05:00
Theodore Ts'o	8aefcd557d	ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessary Replace the jbd2_inode structure (which is 48 bytes) with a pointer and only allocate the jbd2_inode when it is needed --- that is, when the file system has a journal present and the inode has been opened for writing. This allows us to further slim down the ext4_inode_info structure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-01-10 12:29:43 -05:00
Theodore Ts'o	353eb83c14	ext4: drop i_state_flags on architectures with 64-bit longs We can store the dynamic inode state flags in the high bits of EXT4_I(inode)->i_flags, and eliminate i_state_flags. This saves 8 bytes from the size of ext4_inode_info structure, which when multiplied by the number of the number of in the inode cache, can save a lot of memory. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-01-10 12:18:25 -05:00
Theodore Ts'o	8a2005d3f8	ext4: reorder ext4_inode_info structure elements to remove unneeded padding By reordering the elements in the ext4_inode_info structure, we can reduce the padding needed on an x86_64 system by 16 bytes. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-01-10 12:13:42 -05:00
Theodore Ts'o	b05e6ae58a	ext4: drop ec_type from the ext4_ext_cache structure We can encode the ec_type information by using ee_len == 0 to denote EXT4_EXT_CACHE_NO, ee_start == 0 to denote EXT4_EXT_CACHE_GAP, and if neither is true, then the cache type must be EXT4_EXT_CACHE_EXTENT. This allows us to reduce the size of ext4_ext_inode by another 8 bytes. (ec_type is 4 bytes, plus another 4 bytes of padding) Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-01-10 12:13:26 -05:00
Theodore Ts'o	01f49d0b9d	ext4: use ext4_lblk_t instead of sector_t for logical blocks This fixes a number of places where we used sector_t instead of ext4_lblk_t for logical blocks, which for ext4 are still 32-bit data types. No point wasting space in the ext4_inode_info structure, and requiring 64-bit arithmetic on 32-bit systems, when it isn't necessary. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-01-10 12:13:03 -05:00
Theodore Ts'o	f232109773	ext4: replace i_delalloc_reserved_flag with EXT4_STATE_DELALLOC_RESERVED Remove the short element i_delalloc_reserved_flag from the ext4_inode_info structure and replace it a new bit in i_state_flags. Since we have an ext4_inode_info for every ext4 inode cached in the inode cache, any savings we can produce here is a very good thing from a memory utilization perspective. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-01-10 12:12:36 -05:00
Kazuya Mio	ad4fb9cafe	ext4: fix 32bit overflow in ext4_ext_find_goal() ext4_ext_find_goal() returns an ideal physical block number that the block allocator tries to allocate first. However, if a required file offset is smaller than the existing extent's one, ext4_ext_find_goal() returns a wrong block number because it may overflow at "block - le32_to_cpu(ex->ee_block)". This patch fixes the problem. ext4_ext_find_goal() will also return a wrong block number in case a file offset of the existing extent is too big. In this case, the ideal physical block number is fixed in ext4_mb_initialize_context(), so it's no problem. reproduce: # dd if=/dev/zero of=/mnt/mp1/tmp bs=127M count=1 oflag=sync # dd if=/dev/zero of=/mnt/mp1/file bs=512K count=1 seek=1 oflag=sync # filefrag -v /mnt/mp1/file Filesystem type is: ef53 File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096) ext logical physical expected length flags 0 128 67456 128 eof /mnt/mp1/file: 2 extents found # rm -rf /mnt/mp1/tmp # echo $((512*4096)) > /sys/fs/ext4/loop0/mb_stream_req # dd if=/dev/zero of=/mnt/mp1/file bs=512K count=1 oflag=sync conv=notrunc result (linux-2.6.37-rc2 + ext4 patch queue): # filefrag -v /mnt/mp1/file Filesystem type is: ef53 File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096) ext logical physical expected length flags 0 0 33280 128 1 128 67456 33407 128 eof /mnt/mp1/file: 2 extents found result(apply this patch): # filefrag -v /mnt/mp1/file Filesystem type is: ef53 File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096) ext logical physical expected length flags 0 0 66560 128 1 128 67456 66687 128 eof /mnt/mp1/file: 2 extents found Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-01-10 12:12:28 -05:00
Namhyung Kim	dabd991f9d	ext4: add more error checks to ext4_mkdir() Check return value of ext4_journal_get_write_access, ext4_journal_dirty_metadata and ext4_mark_inode_dirty. Move brelse() under 'out_stop' to release bh properly in case of journal error. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-01-10 12:11:16 -05:00
Eric Paris	f1dffc4c54	ext4: ext4_ext_migrate should use NULL not 0 ext4_ext_migrate() calls ext4_new_inode() and passes 0 instead of a pointer to a struct qstr. This patch uses NULL, to make it obvious to the caller that this was a pointer. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-01-10 12:11:00 -05:00

... 2 3 4 5 6 ...

1417 Commits (f6f8285132907757ef84ef8dae0a1244b8cde6ac)