Commit graph

263266 commits

Author SHA1 Message Date
Dmitry Monakhov
fba90ffee8 ext4: migrate cleanup
This patch cleanup code a bit, actual logic not changed
- Move current block pointer to migrate_structure, let's all
  walk info will be in one structure.
- Get rid of usless null ind-block ptr checks, caller already
  does that check.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-29 09:03:00 -04:00
Theodore Ts'o
6cdbb0effc fs: optimize out 16 bytes worth of padding in struct inode
Rearrange the fields in struct inode so that on an x86_64 system,
fields that require 8-byte alignment don't end up causing 4-byte holes
in the structure.  It reduces the size of struct inode from 568 bytes
to 552 bytes.

Also move the fields protected by i_lock (i_blocks, i_bytes, and
i_size) into the same cache line as i_lock.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-29 08:24:18 -04:00
Eric Gouriou
80e675f906 ext4: optimize memmmove lengths in extent/index insertions
ext4_ext_insert_extent() (respectively ext4_ext_insert_index())
was using EXT_MAX_EXTENT() (resp. EXT_MAX_INDEX()) to determine
how many entries needed to be moved beyond the insertion point.
In practice this means that (320 - I) * 24 bytes were memmove()'d
when I is the insertion point, rather than (#entries - I) * 24 bytes.

This patch uses EXT_LAST_EXTENT() (resp. EXT_LAST_INDEX()) instead
to only move existing entries. The code flow is also simplified
slightly to highlight similarities and reduce code duplication in
the insertion logic.

This patch reduces system CPU consumption by over 25% on a 4kB
synchronous append DIO write workload when used with the
pre-2.6.39 x86_64 memmove() implementation. With the much faster
2.6.39 memmove() implementation we still see a decrease in
system CPU usage between 2% and 7%.

Note that the ext_debug() output changes with this patch, splitting
some log information between entries. Users of the ext_debug() output
should note that the "move %d" units changed from reporting the number
of bytes moved to reporting the number of entries moved.

Signed-off-by: Eric Gouriou <egouriou@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-27 11:52:18 -04:00
Eric Gouriou
6f91bc5fda ext4: optimize ext4_ext_convert_to_initialized()
This patch introduces a fast path in ext4_ext_convert_to_initialized()
for the case when the conversion can be performed by transferring
the newly initialized blocks from the uninitialized extent into
an adjacent initialized extent. Doing so removes the expensive
invocations of memmove() which occur during extent insertion and
the subsequent merge.

In practice this should be the common case for clients performing
append writes into files pre-allocated via
fallocate(FALLOC_FL_KEEP_SIZE). In such a workload performed via
direct IO and when using a suboptimal implementation of memmove()
(x86_64 prior to the 2.6.39 rewrite), this patch reduces kernel CPU
consumption by 32%.

Two new trace points are added to ext4_ext_convert_to_initialized()
to offer visibility into its operations. No exit trace point has
been added due to the multiplicity of return points. This can be
revisited once the upstream cleanup is backported.

Signed-off-by: Eric Gouriou <egouriou@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-27 11:43:23 -04:00
Thomas Gleixner
446066724c jdb/jbd2: factor out common functions from the jbd[2] header files
The state bits and the lock functions of jbd and jbd2 are
identical.  Share them.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-27 04:38:18 -04:00
Randy Dunlap
4470575461 jbd2: fix build when CONFIG_BUG is not enabled
Fix build error when CONFIG_BUG is not enabled:

fs/jbd2/transaction.c:1175:3: error: implicit declaration of function '__WARN'

by changing __WARN() to WARN_ON(), as suggested by
Arnaud Lacombe <lacombar@gmail.com>.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Arnaud Lacombe <lacombar@gmail.com>
2011-10-27 04:05:13 -04:00
Tao Ma
b3ff056908 ext4: don't check io->flag when setting EXT4_STATE_DIO_UNWRITTEN inode state
When we want to convert the unitialized extent in direct write, we can
either do it in ext4_end_io_nolock(AIO case) or in
ext4_ext_direct_IO(non AIO case) and EXT4_I(inode)->cur_aio_dio is a
guard for ext4_ext_map_blocks to find the right case.  In e9e3bcecf,
we mistakenly change it by:

-			if (io)
+			if (io && !(io->flag & EXT4_IO_END_UNWRITTEN)) {
 				io->flag = EXT4_IO_END_UNWRITTEN;
-			else
+				atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
+			} else
 				ext4_set_inode_state(inode,
 						     EXT4_STATE_DIO_UNWRITTEN);

So now if we map 2 blocks, and the first one set the
EXT_IO_END_UNWRITTEN, the 2nd mapping will set inode state because of
the check for the flag. This is wrong.

Cc: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-26 11:08:39 -04:00
Robin Dong
0a10da73e1 ext4: fix a wrong comment in __mb_check_buddy()
The comment says the bit should be 0, but the after code assert the
bit to be 1.  This makes people confused, so fix it.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-26 08:48:54 -04:00
Robin Dong
b051d8dc4e ext4: remove unused variable in mb_find_extent()
The variable 'ord' in function mb_find_extent() is redundant, so
remove it.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-26 05:30:30 -04:00
Robin Dong
66a83cde47 ext4: remove unused variable in ext4_mb_generate_from_pa()
The variable 'count' in function ext4_mb_generate_from_pa() looks
useless, so remove it.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-26 05:29:21 -04:00
Robin Dong
ebbe027797 ext4: use stream-alloc when mb_group_prealloc set to zero
The kernel will crash on 

ext4_mb_mark_diskspace_used:
	BUG_ON(ac->ac_b_ex.fe_len <= 0);

after we set /sys/fs/ext4/sda/mb_group_prealloc to zero and create new files in an ext4 filesystem.

The reason is: ac_b_ex.fe_len also set to zero(mb_group_prealloc) in ext4_mb_normalize_group_request
because the ac_flags contains EXT4_MB_HINT_GROUP_ALLOC.

I think when someone set mb_group_prealloc to zero, it means DO NOT USE GROUP PREALLOCATION,
so we should set alloc-strategy to STREAM in this case.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-26 05:14:27 -04:00
Yongqiang Yang
fcbb551582 ext4: let ext4_page_mkwrite stop started handle in failure
The started journal handle should be stopped in failure case.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
2011-10-26 05:00:19 -04:00
Curt Wohlgemuth
6f8ff53726 ext4: handle NULL p_ext in ext4_ext_next_allocated_block()
In ext4_ext_next_allocated_block(), the path[depth] might
have a p_ext that is NULL -- see ext4_ext_binsearch().  In
such a case, dereferencing it will crash the machine.

This patch checks for p_ext == NULL in
ext4_ext_next_allocated_block() before dereferencinging it.

Tested using a hand-crafted an inode with eh_entries == 0 in
an extent block, verified that running FIEMAP on it crashes
without this patch, works fine with it.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-26 04:38:59 -04:00
Dan Carpenter
f85b287a01 ext4: error handling fix in ext4_ext_convert_to_initialized()
When allocated is unsigned it breaks the error handling at the end
of the function when we call:
	allocated = ext4_split_extent(...);
	if (allocated < 0)
		err = allocated;

I've made it a signed int instead of unsigned.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-26 03:42:36 -04:00
Eric Sandeen
665436175c ext4: use ext4_reserve_inode_write in ext4_xattr_set_handle
ext4_mark_iloc_dirty() says:

 * The caller must have previously called ext4_reserve_inode_write().
 * Give this, we know that the caller already has write access to iloc->bh.

ext4_xattr_set_handle, however, just open-codes it.  May as well use
the helper function for consistency.

No bug here, just tidiness.

(Note: on cleanup path, ext4_reserve_inode_write sets
the bh to NULL if it returns an error, and brelse() of 
a null bh is handled gracefully).

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-26 03:32:07 -04:00
Andreas Dilger
909a4cf1ff ext4: avoid setting directory i_nlink to zero
If a directory with more than EXT4_LINK_MAX subdirectories, the nlink
count is set to 1.  Subsequently, if any subdirectories are deleted,
ext4_dec_count() decrements the i_nlink count, which may go to 0
temporarily before being incremented back to 1.

While this is done under i_mutex, which prevents races for directory
and inode operations that check i_nlink, the temporary i_nlink == 0
case is exposed to userspace via stat() and similar calls that do not
hold i_mutex.

Instead, change the code to not decrement i_nlink count for any
directories that do not already have i_nlink larger than 2.

Reported-by: Cliff White <cliffw@whamcloud.com>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-26 03:22:31 -04:00
Darrick J. Wong
cf8039036a ext4: prevent stack overrun in ext4_file_open
In ext4_file_open, the filesystem records the mountpoint of the first
file that is opened after mounting the filesystem.  It does this by
allocating a 64-byte stack buffer, calling d_path() to grab the mount
point through which this file was accessed, and then memcpy()ing 64
bytes into the superblock's s_last_mounted field, starting from the
return value of d_path(), which is stored as "cp".  However, if cp >
buf (which it frequently is since path components are prepended
starting at the end of buf) then we can end up copying stack data into
the superblock.

Writing stack variables into the superblock doesn't sound like a great
idea, so use strlcpy instead.  Andi Kleen suggested using strlcpy
instead of strncpy.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-25 09:18:41 -04:00
Dmitry Monakhov
a4e5d88b1b ext4: update EOFBLOCKS flag on fallocate properly
EOFBLOCK_FL should be updated if called w/o FALLOCATE_FL_KEEP_SIZE
Currently it happens only if new extent was allocated.

TESTCASE:
fallocate test_file -n -l4096
fallocate test_file -l4096
Last fallocate cmd has updated size, but keept EOFBLOCK_FL set. And
fsck will complain about that.

Also remove ping pong in ext4_fallocate() in case of new extents,
where ext4_ext_map_blocks() clear EOFBLOCKS bit, and later
ext4_falloc_update_inode() restore it again.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-25 08:15:12 -04:00
Dmitry Monakhov
750c9c47a5 ext4: remove messy logic from ext4_ext_rm_leaf
- Both callers(truncate and punch_hole) already aligned left end point
  so we no longer need split logic here.
- Remove dead duplicated code.
- Call ext4_ext_dirty only after we have updated eh_entries, otherwise
  we'll loose entries update. Regression caused by d583fb87a3
  266'th testcase in xfstests (http://patchwork.ozlabs.org/patch/120872)

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-25 05:35:05 -04:00
Dmitry Monakhov
1939dd84b3 ext4: cleanup ext4_ext_grow_indepth code
Currently code make an impression what grow procedure is very complicated
and some mythical paths, blocks are involved. But in fact grow in depth
it relatively simple procedure:
 1) Just create new meta block and copy root data to that block.
 2) Convert root from extent to index if old depth == 0
 3) Update root block pointer

This patch does:
 - Reorganize code to make it more self explanatory
 - Do not pass path parameter to new_meta_block() in order to
   provoke allocation from inode's group because top-level block
   should site closer to it's inode, but not to leaf data block.

   [ This happens anyway, due to logic in mballoc; we should drop
     the path parameter from new_meta_block() entirely.  -- tytso ]

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-22 01:26:05 -04:00
Dmitry Monakhov
45dc63e7d8 ext4: Allow quota file use root reservation
Quota file is fs's metadata, so it is reasonable  to permit use
root resevation if necessary. This patch fix 265'th xfstest failure

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-20 20:07:23 -04:00
Kazuya Mio
8de49e674a ext4: fix the deadlock in mpage_da_map_and_submit()
If ext4_jbd2_file_inode() in mpage_da_map_and_submit() fails due to
journal abort, this function returns to caller without unlocking the
page.  It leads to the deadlock, and the patch fixes this issue by
calling mpage_da_submit_io().

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-20 19:23:08 -04:00
Akira Fujita
09e0834fb0 ext4: fix deadlock in ext4_ordered_write_end()
If ext4_jbd2_file_inode() in ext4_ordered_write_end() fails for some
reasons, this function returns to caller without unlocking the page.
It leads to the deadlock, and the patch fixes this issue.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-20 18:56:10 -04:00
H Hartley Sweeten
ee90d57e20 ext4: quiet sparse noise about plain integer as NULL pointer
The third parameter to ext4_free_blocks is a struct buffer_head *.  This
parameter should be NULL not 0.

This quiets the sparse noise:

warning: Using plain integer as NULL pointer

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-18 11:01:51 -04:00
H Hartley Sweeten
e6705f7c25 ext4: add __user decoration to calls of copy_{from,to}_user()
This quiets the sparse noise:

warning: incorrect type in argument 2 (different address spaces)
   expected void const [noderef] <asn:1>*from
   got struct fstrim_range *<noident>
warning: incorrect type in argument 1 (different address spaces)
   expected void [noderef] <asn:1>*to
   got struct fstrim_range *<noident>

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-18 10:59:51 -04:00
H Hartley Sweeten
e0cbee3e14 ext4: functions should not be declared extern
The function declarations in ext4.h are already marked extern, so it's
not necessary to do so in the .c files.

This quiets the sparse noise:

warning: function 'ext4_flush_completed_IO' with external linkage has definition
warning: function 'ext4_init_inode_table' with external linkage has definition

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-18 10:57:51 -04:00
Shaohua Li
1bce63d1a2 ext4: add block plug for .writepages
Add block plug for ext4 .writepages. Though ext4 .writepages
already handles request merge very well, block plug is still
helpful to reduce block lock contention.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-18 10:55:51 -04:00
Darrick J. Wong
f6f96fdb8c ext4: Fix comparison endianness problem in MMP initialization
As part of startup, the MMP initialization code does this:

mmp->mmp_seq = seq = cpu_to_le32(mmp_new_seq());

Next, mmp->mmp_seq is written out to disk, a delay happens, and then
the MMP block is read back in and the sequence value is tested:

if (seq != le32_to_cpu(mmp->mmp_seq)) {
	/* fail the mount */

On a LE system such as x86, the *le32* functions do nothing and this
works.  Unfortunately, on a BE system such as ppc64, this comparison
becomes:

if (cpu_to_le32(new_seq) != le32_to_cpu(cpu_to_le32(new_seq)) {
	/* fail the mount */

Except for a few palindromic sequence numbers, this test always causes
the mount to fail, which makes MMP filesystems generally unmountable
on ppc64.  The attached patch fixes this situation.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-18 10:53:51 -04:00
Nikitas Angelinas
bdfc230f33 ext4: MMP: fix error message rate-limiting logic in kmmpd
Current logic would print an error message only once, and then
'failed_writes' would stay at 1.  Rework the loop to increment
'failed_writes' and print the error message every
s_mmp_update_interval * 60 seconds, as intended according to the
comment.

Signed-off-by: Nikitas Angelinas <nikitas_angelinas@xyratex.com>
Signed-off-by: Andrew Perepechko <andrew_perepechko@xyratex.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Andreas Dilger <adilger@dilger.ca>
2011-10-18 10:51:51 -04:00
Nikitas Angelinas
215fc6af73 ext4: MMP: kmmpd should use nodename from init_uts_ns.name, not sysname
sysname holds "Linux" by default, i.e. what appears when doing a "uname
-s"; nodename should be used to print the machine's hostname, i.e. what
is returned when doing a "uname -n" or "hostname", and what
gethostname(2)/sethostname(2) manipulate, in order to notify the
administrator of the node which is contending to mount the filesystem.

Acked-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Nikitas Angelinas <nikitas_angelinas@xyratex.com>
Signed-off-by: Andrew Perepechko <andrew_perepechko@xyratex.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-18 10:49:51 -04:00
Tao Ma
f472e02669 ext4: avoid stamping on other memories in ext4_ext_insert_index()
Add a sanity check to make sure ix hasn't gone beyond the valid bounds
of the extent block.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-17 10:13:46 -04:00
Fabrice Jouhaud
d44651d0f9 ext4: fix ext4 so it works without CONFIG_PROC_FS
This fixes a bug which was introduced in dd68314ccf.  The problem
came from the test of the return value of proc_mkdir which is always
false without procfs, and this would initialization of ext4.

Signed-off-by: Fabrice Jouhaud <yargil@free.fr>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-08 16:26:03 -04:00
Tao Ma
6ee3b21224 ext4: use le32_to_cpu for ext4_extent_idx.ei_block in ext4_ext_search_left()
ext4_extent_idx.e_block is __le32, so use le32_to_cpu() in
ext4_ext_search_left().

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-08 16:08:34 -04:00
Tao Ma
7fd59c83b0 ext4: remove the obsolete/broken EXT4_IOC_WAIT_FOR_READONLY ioctl
There are no users of the EXT4_IOC_WAIT_FOR_READONLY ioctl, and it is
also broken.  No one sets the set_ro_timer, no one wakes up us and our
state is set to TASK_INTERRUPTIBLE not RUNNING.  So remove it.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-08 15:56:35 -04:00
Tao Ma
df3ab17072 ext4: fix the comment describing ext4_ext_search_right()
The comment describing what ext4_ext_search_right() does is incorrect.
We return 0 in *phys when *logical is the 'largest' allocated block,
not smallest.  

Fix a few other typos while we're at it.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
2011-10-08 15:53:49 -04:00
Lukas Czerner
4113c4caa4 ext4: remove deprecated oldalloc
For a long time now orlov is the default block allocator in the
ext4. It performs better than the old one and no one seems to claim
otherwise so we can safely drop it and make oldalloc and orlov mount
option deprecated.

This is a part of the effort to reduce number of ext4 options hence the
test matrix.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-08 14:34:47 -04:00
Theodore Ts'o
af909a57fd ext4: documentation: remove acl and user_xattr mount options
Acl and user_xattr mount options are no longer needed since those
features are enabled by default if configured in (seee commit
ea66333694). We can not easily deprecate
mount options itself (since it is probably too early), but we can
remove it from documentation first.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-08 14:01:08 -04:00
Tao Ma
dcf2d804ed ext4: Free resources in some error path in ext4_fill_super
Some of the error path in ext4_fill_super don't release the
resouces properly. So this patch just try to release them
in the right way.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-06 12:10:11 -04:00
Tao Ma
7aa0baeaba ext4: Free resources in ext4_mb_init()'s error paths
In commit 79a77c5ac, we move ext4_mb_init_backend after the allocation
of s_locality_group to avoid memory leak in error path, but there are
still some other error paths in ext4_mb_init that need to do the same
work. So this patch adds all the error patch for ext4_mb_init. And all
the pointers are reset to NULL in case the caller may double free them.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-06 10:22:28 -04:00
Aditya Kali
5356f2615c ext4: attempt to fix race in bigalloc code path
Currently, there exists a race between delayed allocated writes and
the writeback when bigalloc feature is in use. The race was because we
wanted to determine what blocks in a cluster are under delayed
allocation and we were using buffer_delayed(bh) check for it. But, the
writeback codepath clears this bit without any synchronization which
resulted in a race and an ext4 warning similar to:

EXT4-fs (ram1): ext4_da_update_reserve_space: ino 13, used 1 with only 0
		reserved data blocks

The race existed in two places.
(1) between ext4_find_delalloc_range() and ext4_map_blocks() when called from
    writeback code path.
(2) between ext4_find_delalloc_range() and ext4_da_get_block_prep() (where
    buffer_delayed(bh) is set.

To fix (1), this patch introduces a new buffer_head state bit -
BH_Da_Mapped.  This bit is set under the protection of
EXT4_I(inode)->i_data_sem when we have actually mapped the delayed
allocated blocks during the writeout time. We can now reliably check
for this bit inside ext4_find_delalloc_range() to determine whether
the reservation for the blocks have already been claimed or not.

To fix (2), it was necessary to set buffer_delay(bh) under the
protection of i_data_sem.  So, I extracted the very beginning of
ext4_map_blocks into a new function - ext4_da_map_blocks() - and
performed the required setting of bh_delay bit and the quota
reservation under the protection of i_data_sem.  These two fixes makes
the checking of buffer_delay(bh) and buffer_da_mapped(bh) consistent,
thus removing the race.

Tested: I was able to reproduce the problem by running 'dd' and
'fsync' in parallel. Also, xfstests sometimes used to reproduce this
race. After the fix both my test and xfstests were successful and no
race (warning message) was observed.

Google-Bug-Id: 4997027

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:20:51 -04:00
Aditya Kali
d8990240d8 ext4: add some tracepoints in ext4/extents.c
This patch adds some tracepoints in ext4/extents.c and updates a tracepoint in
ext4/inode.c.

Tested: Built and ran the kernel and verified that these tracepoints work.
Also ran xfstests.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:18:51 -04:00
Theodore Ts'o
df55c99dc8 ext4: rename ext4_has_free_blocks() to ext4_has_free_clusters()
Rename the function so it is more clear what is going on.  Also rename
the various variables so it's clearer what's happening.

Also fix a missing blocks to cluster conversion when reading the
number of reserved blocks for root.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:16:51 -04:00
Theodore Ts'o
e7d5f3156e ext4: rename ext4_claim_free_blocks() to ext4_claim_free_clusters()
This function really claims a number of free clusters, not blocks, so
rename it so it's clearer what's going on.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:14:51 -04:00
Theodore Ts'o
cff1dfd767 ext4: rename ext4_free_blocks_after_init() to ext4_free_clusters_after_init()
This function really returns the number of clusters after initializing
an uninitalized block bitmap has been initialized.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:12:51 -04:00
Theodore Ts'o
5dee54372c ext4: rename ext4_count_free_blocks() to ext4_count_free_clusters()
This function really counts the free clusters reported in the block
group descriptors, so rename it to reduce confusion.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:10:51 -04:00
Theodore Ts'o
021b65bb1e ext4: Rename ext4_free_blks_{count,set}() to refer to clusters
The field bg_free_blocks_count_{lo,high} in the block group
descriptor has been repurposed to hold the number of free clusters for
bigalloc functions.  So rename the functions so it makes it easier to
read and audit the block allocation and block freeing code.

Note: at this point in bigalloc development we doesn't support
online resize, so this also makes it really obvious all of the places
we need to fix up to add support for online resize.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:08:51 -04:00
Theodore Ts'o
6f16b60690 ext4: enable mounting bigalloc as read/write
Now that we have implemented all of the changes needed for bigalloc,
we can finally enable it!

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:06:51 -04:00
Aditya Kali
7b415bf60f ext4: Fix bigalloc quota accounting and i_blocks value
With bigalloc changes, the i_blocks value was not correctly set (it was still
set to number of blocks being used, but in case of bigalloc, we want i_blocks
to represent the number of clusters being used). Since the quota subsystem sets
the i_blocks value, this patch fixes the quota accounting and makes sure that
the i_blocks value is set correctly.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:04:51 -04:00
Theodore Ts'o
27baebb849 ext4: tune mballoc's default group prealloc size for bigalloc file systems
The default group preallocation size had been previously set to 512
blocks/clusters, regardless of the block/cluster size.  This is
probably too big for large cluster sizes.  So adjust the default so
that it is 2 megabytes or 32 clusters, whichever is larger.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:02:51 -04:00
Theodore Ts'o
f975d6bcc7 ext4: teach ext4_statfs() to deal with clusters if bigalloc is enabled
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:00:51 -04:00