Commit graph

12798 commits

Author SHA1 Message Date
Jens Axboe
78f707bfc7 block: revert part of 18ce3751cc
The above commit added WRITE_SYNC and switched various places to using
that for committing writes that will be waited upon immediately after
submission. However, this causes a performance regression with AS and CFQ
for ext3 at least, since sync_dirty_buffer() will submit some writes with
WRITE_SYNC while ext3 has sumitted others dependent writes without the sync
flag set. This causes excessive anticipation/idling in the IO scheduler
because sync and async writes get interleaved, causing a big performance
regression for the below test case (which is meant to simulate sqlite
like behaviour).

---- test case ----

int main(int argc, char **argv)
{

	int fdes, i;
	FILE *fp;
	struct timeval start;
	struct timeval end;
	struct timeval res;

	gettimeofday(&start, NULL);
	for (i=0; i<ROWS; i++) {
		fp = fopen("test_file", "a");
		fprintf(fp, "Some Text Data\n");
		fdes = fileno(fp);
		fsync(fdes);
		fclose(fp);
	}
	gettimeofday(&end, NULL);

	timersub(&end, &start, &res);
	fprintf(stdout, "time to write %d lines is %ld(msec)\n", ROWS,
			(res.tv_sec*1000000 + res.tv_usec)/1000);

	return 0;
}

-------------------

Thanks to Sean.White@APCC.com for tracking down this performance
regression and providing a test case.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-02-18 10:32:01 +01:00
Subhash Peddamallu
a60e78e57a fs/bio: bio_alloc_bioset: pass right object ptr to mempool_free
When freeing from bio pool use right ptr to account for bs->front_pad,
instead of bio ptr,

Signed-off-by: Subhash Peddamallu <subhash.peddamallu@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-02-18 10:32:01 +01:00
Linus Torvalds
48c0d9ece3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: hold trans_mutex when using btrfs_record_root_in_trans
  Btrfs: make a lockdep class for the extent buffer locks
  Btrfs: fs/btrfs/volumes.c: remove useless kzalloc
  Btrfs: remove unused code in split_state()
  Btrfs: remove btrfs_init_path
  Btrfs: balance_level checks !child after access
  Btrfs: Avoid using __GFP_HIGHMEM with slab allocator
  Btrfs: don't clean old snapshots on sync(1)
  Btrfs: use larger metadata clusters in ssd mode
  Btrfs: process mount options on mount -o remount,
  Btrfs: make sure all pending extent operations are complete
2009-02-17 14:19:14 -08:00
Linus Torvalds
3512a79dbc Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Fix NULL dereference in ext4_ext_migrate()'s error handling
  ext4: Implement range_cyclic in ext4_da_writepages instead of write_cache_pages
  ext4: Initialize preallocation list_head's properly
  ext4: Fix lockdep warning
  ext4: Fix to read empty directory blocks correctly in 64k
  jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate()
  Revert "ext4: wait on all pending commits in ext4_sync_fs()"
  jbd2: Fix return value of jbd2_journal_start_commit()
2009-02-17 14:05:05 -08:00
Al Viro
1a88b5364b Fix incomplete __mntput locking
Getting this wrong caused

	WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()

due to optimistically checking cpu_writer->mnt outside the spinlock.

Here's what we really want:
 * we know that nobody will set cpu_writer->mnt to mnt from now on
 * all changes to that sucker are done under cpu_writer->lock
 * we want the laziest equivalent of
	spin_lock(&cpu_writer->lock);
	if (likely(cpu_writer->mnt != mnt)) {
		spin_unlock(&cpu_writer->lock);
		continue;
	}
	/* do stuff */
  that would make sure we won't miss earlier setting of ->mnt done by
  another CPU.

Anyway, for now we just move the spin_lock() earlier and move the test
into the properly locked region.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-and-tested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-17 14:02:08 -08:00
Patrick Ohly
d24fff22d8 net: pass new SIOCSHWTSTAMP through to device drivers
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-15 22:43:38 -08:00
Dan Carpenter
090542641d ext4: Fix NULL dereference in ext4_ext_migrate()'s error handling
This was found through a code checker (http://repo.or.cz/w/smatch.git/). 
It looks like you might be able to trigger the error by trying to migrate 
a readonly file system.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-15 20:02:19 -05:00
David S. Miller
5e30589521 Merge branch 'master' of /home/davem/src/GIT/linux-2.6/
Conflicts:
	drivers/net/wireless/iwlwifi/iwl-agn.c
	drivers/net/wireless/iwlwifi/iwl3945-base.c
2009-02-14 23:12:00 -08:00
Aneesh Kumar K.V
2acf2c261b ext4: Implement range_cyclic in ext4_da_writepages instead of write_cache_pages
With delayed allocation we lock the page in write_cache_pages() and
try to build an in memory extent of contiguous blocks.  This is needed
so that we can get large contiguous blocks request.  If range_cyclic
mode is enabled, write_cache_pages() will loop back to the 0 index if
no I/O has been done yet, and try to start writing from the beginning
of the range.  That causes an attempt to take the page lock of lower
index page while holding the page lock of higher index page, which can
cause a dead lock with another writeback thread.

The solution is to implement the range_cyclic behavior in
ext4_da_writepages() instead.

http://bugzilla.kernel.org/show_bug.cgi?id=12579

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-14 10:42:58 -05:00
Aneesh Kumar K.V
d794bf8e09 ext4: Initialize preallocation list_head's properly
When creating a new ext4_prealloc_space structure, we have to
initialize its list_head pointers before we add them to any prealloc
lists.  Otherwise, with list debug enabled, we will get list
corruption warnings.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-14 10:31:16 -05:00
Andres Salomon
efab0b5d3e [JFFS2] force the jffs2 GC daemon to behave a bit better
I've noticed some pretty poor behavior on OLPC machines after bootup, when
gdm/X are starting.  The GCD monopolizes the scheduler (which in turns
means it gets to do more nand i/o), which results in processes taking much
much longer than they should to start.

As an example, on an OLPC machine going from OFW to a usable X (via
auto-login gdm) takes 2m 30s.  The majority of this time is consumed by
the switch into graphical mode.  With this patch, we cut a full 60s off of
bootup time.  After bootup, things are much snappier as well.

Note that we have seen a CRC node error with this patch that causes the machine
to fail to boot, but we've also seen that problem without this patch.

Signed-off-by: Andres Salomon <dilinger@debian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-02-14 08:59:04 +00:00
Yan Zheng
2456242530 Btrfs: hold trans_mutex when using btrfs_record_root_in_trans
btrfs_record_root_in_trans needs the trans_mutex held to make sure two
callers don't race to setup the root in a given transaction.  This adds
it to all the places that were missing it.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-02-12 14:14:53 -05:00
Chris Mason
4008c04a07 Btrfs: make a lockdep class for the extent buffer locks
Btrfs is currently using spin_lock_nested with a nested value based
on the tree depth of the block.  But, this doesn't quite work because
the max tree depth is bigger than what spin_lock_nested can deal with,
and because locks are sometimes taken before the level field is filled in.

The solution here is to use lockdep_set_class_and_name instead, and to
set the class before unlocking the pages when the block is read from the
disk and just after init of a freshly allocated tree block.

btrfs_clear_path_blocking is also changed to take the locks in the proper
order, and it also makes sure all the locks currently held are properly
set to blocking before it tries to retake the spinlocks.  Otherwise, lockdep
gets upset about bad lock orderin.

The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 14:09:45 -05:00
Julia Lawall
3f3420df50 Btrfs: fs/btrfs/volumes.c: remove useless kzalloc
The call to kzalloc is followed by a kmalloc whose result is stored in the
same variable.

The semantic match that finds the problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@r exists@
local idexpression x;
statement S;
expression E;
identifier f,l;
position p1,p2;
expression *ptr != NULL;
@@

(
if ((x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...)) == NULL) S
|
x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...);
...
if (x == NULL) S
)
<... when != x
     when != if (...) { <+...x...+> }
x->f = E
...>
(
 return \(0\|<+...x...+>\|ptr\);
|
 return@p2 ...;
)

@script:python@
p1 << r.p1;
p2 << r.p2;
@@

print "* file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 10:16:03 -05:00
Qinghuang Feng
a48ddf08ba Btrfs: remove unused code in split_state()
These two lines are not used, remove them.

Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 14:25:23 -05:00
Jeff Mahoney
e00f730865 Btrfs: remove btrfs_init_path
btrfs_init_path was initially used when the path objects were on the
stack.  Now all the work is done by btrfs_alloc_path and btrfs_init_path
isn't required.

This patch removes it, and just uses kmem_cache_zalloc to zero out the object.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 14:11:25 -05:00
Jeff Mahoney
7951f3cefb Btrfs: balance_level checks !child after access
The BUG_ON() is in the wrong spot.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 10:06:15 -05:00
Yan Zheng
b335b0034e Btrfs: Avoid using __GFP_HIGHMEM with slab allocator
btrfs_releasepage may call kmem_cache_alloc indirectly,
and provide same GFP flags it gets to kmem_cache_alloc.
So it's possible to use __GFP_HIGHMEM with the slab
allocator.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-02-12 10:06:04 -05:00
Chris Mason
e1df36d2f1 Btrfs: don't clean old snapshots on sync(1)
Cleaning old snapshots can make sync(1) somewhat slow, and some users
and applications still use it in a global fsync kind of workload.

This patch changes btrfs not to clean old snapshots during sync, which is
safe from a FS consistency point of view.  The major downside is that it
makes it difficult to tell when old snapshots have been reaped and
the space they were using has been reclaimed.  A new ioctl will be added
for this purpose instead.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 09:45:08 -05:00
Chris Mason
536ac8ae86 Btrfs: use larger metadata clusters in ssd mode
Larger metadata clusters can significantly improve writeback performance
on ssd drives with large erasure blocks.  The larger clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle.

On spinning media, lager metadata clusters end up spreading out the
metadata more over time, which makes fsck slower, so we don't want this
to be the default.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 09:41:38 -05:00
Chris Mason
b288052e17 Btrfs: process mount options on mount -o remount,
Btrfs wasn't parsing any new mount options during remount, making it
difficult to set mount options on a root drive.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 09:37:35 -05:00
Josef Bacik
eb09967089 Btrfs: make sure all pending extent operations are complete
Theres a slight problem with finish_current_insert, if we set all to 1 and then
go through and don't actually skip any of the extents on the pending list, we
could exit right after we've added new extents.

This is a problem because by inserting the new extents we could have gotten new
COW's to happen and such, so we may have some pending updates to do or even
more inserts to do after that.

So this patch will only exit if we have never skipped any of the extents in the
pending list, and we have no extents to insert, this will make sure that all of
the pending work is truly done before we return.  I've been running with this
patch for a few days with all of my other testing and have not seen issues.
Thanks,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-02-12 09:27:38 -05:00
Carsten Otte
0e4a9b5928 ext2/xip: refuse to change xip flag during remount with busy inodes
For a reason that I was unable to understand in three months of debugging,
mount ext2 -o remount stopped working properly when remounting from
regular operation to xip, or the other way around.  According to a git
bisect search, the problem was introduced with the VM_MIXEDMAP/PTE_SPECIAL
rework in the vm:

commit 70688e4dd1
Author: Nick Piggin <npiggin@suse.de>
Date:   Mon Apr 28 02:13:02 2008 -0700

    xip: support non-struct page backed memory

In the failing scenario, the filesystem is mounted read only via root=
kernel parameter on s390x.  During remount (in rc.sysinit), the inodes of
the bash binary and its libraries are busy and cannot be invalidated (the
bash which is running rc.sysinit resides on subject filesystem).
Afterwards, another bash process (running ifup-eth) recurses into a
subshell, runs dup_mm (via fork).  Some of the mappings in this bash
process were created from inodes that could not be invalidated during
remount.

Both parent and child process crash some time later due to inconsistencies
in their address spaces.  The issue seems to be timing sensitive, various
attempts to recreate it have failed.

This patch refuses to change the xip flag during remount in case some
inodes cannot be invalidated.  This patch keeps users from running into
that issue.

[akpm@linux-foundation.org: cleanup]
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Jared Hulbert <jaredeh@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11 14:25:36 -08:00
Jan Kara
02ac597c9b ext3: revert "ext3: wait on all pending commits in ext3_sync_fs"
This reverts commit c87591b719.

Since journal_start_commit() is now fixed to return 1 when we started a
transaction commit, there's some transaction waiting to be committed or
there's a transaction already committing, we don't need to call
ext3_force_commit() in ext3_sync_fs().  Furthermore ext3_force_commit()
can unnecessarily create sync transaction which is expensive so it's
worthwhile to remove it when we can.

Cc: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11 14:25:35 -08:00
Jan Kara
8fe4cd0dc5 jbd: fix return value of journal_start_commit()
journal_start_commit() returns 1 if either a transaction is committing or
the function has queued a transaction commit.  But it returns 0 if we
raced with somebody queueing the transaction commit as well.  This
resulted in ext3_sync_fs() not functioning correctly (description from
Arthur Jones): In the case of a data=ordered umount with pending long
symlinks which are delayed due to a long list of other I/O on the backing
block device, this causes the buffer associated with the long symlinks to
not be moved to the inode dirty list in the second phase of fsync_super.
Then, before they can be dirtied again, kjournald exits, seeing the UMOUNT
flag and the dirty pages are never written to the backing block device,
causing long symlink corruption and exposing new or previously freed block
data to userspace.

This can be reproduced with a script created by Eric Sandeen
<sandeen@redhat.com>:

        #!/bin/bash

        umount /mnt/test2
        mount /dev/sdb4 /mnt/test2
        rm -f /mnt/test2/*
        dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
        touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
        ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
        /mnt/test2/link
        umount /mnt/test2
        mount /dev/sdb4 /mnt/test2
        ls /mnt/test2/

This patch fixes journal_start_commit() to always return 1 when there's
a transaction committing or queued for commit.

Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Mike Snitzer <snitzer@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11 14:25:35 -08:00
Mel Gorman
5a6fe12595 Do not account for the address space used by hugetlbfs using VM_ACCOUNT
When overcommit is disabled, the core VM accounts for pages used by anonymous
shared, private mappings and special mappings. It keeps track of VMAs that
should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
with VM_NORESERVE.

Overcommit for hugetlbfs is much riskier than overcommit for base pages
due to contiguity requirements. It avoids overcommiting on both shared and
private mappings using reservation counters that are checked and updated
during mmap(). This ensures (within limits) that hugepages exist in the
future when faults occurs or it is too easy to applications to be SIGKILLed.

As hugetlbfs makes its own reservations of a different unit to the base page
size, VM_ACCOUNT should never be set. Even if the units were correct, we would
double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
be set because an application can request no reserves be made for hugetlbfs
at the risk of getting killed later.

With commit fc8744adc8, VM_NORESERVE and
VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
breaks the accounting for both the core VM and hugetlbfs, can trigger an
OOM storm when hugepage pools are too small lockups and corrupted counters
otherwise are used. This patch brings hugetlbfs more in line with how the
core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-10 10:48:42 -08:00
Aneesh Kumar K.V
ba4439165f ext4: Fix lockdep warning
We should not call ext4_mb_add_n_trim while holding alloc_semp.

    =============================================
    [ INFO: possible recursive locking detected ]
    2.6.29-rc4-git1-dirty #124
    ---------------------------------------------
    ffsb/3116 is trying to acquire lock:
     (&meta_group_info[i]->alloc_sem){----}, at: [<ffffffff8035a6e8>]
     ext4_mb_load_buddy+0xd2/0x343

    but task is already holding lock:
     (&meta_group_info[i]->alloc_sem){----}, at: [<ffffffff8035a6e8>]
     ext4_mb_load_buddy+0xd2/0x343

http://bugzilla.kernel.org/show_bug.cgi?id=12672

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-10 11:14:34 -05:00
Wei Yongjun
7be2baaa03 ext4: Fix to read empty directory blocks correctly in 64k
The rec_len field in the directory entry is 16 bits, so there was a
problem representing rec_len for filesystems with a 64k block size in
the case where the directory entry takes the entire 64k block.
Unfortunately, there were two schemes that were proposed; one where
all zeros meant 65536 and one where all ones (65535) meant 65536.
E2fsprogs used 0, whereas the kernel used 65535.  Oops.  Fortunately
this case happens extremely rarely, with the most common case being
the lost+found directory, created by mke2fs.

So we will be liberal in what we accept, and accept both encodings,
but we will continue to encode 65536 as 65535.  This will require a
change in e2fsprogs, but with fortunately ext4 filesystems normally
have the dir_index feature enabled, which precludes having a
completely empty directory block.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-10 09:53:42 -05:00
Jan Kara
7f5aa21508 jbd2: Avoid possible NULL dereference in jbd2_journal_begin_ordered_truncate()
If we race with commit code setting i_transaction to NULL, we could
possibly dereference it.  Proper locking requires the journal pointer
(to access journal->j_list_lock), which we don't have.  So we have to
change the prototype of the function so that filesystem passes us the
journal pointer.  Also add a more detailed comment about why the
function jbd2_journal_begin_ordered_truncate() does what it does and
how it should be used.

Thanks to Dan Carpenter <error27@gmail.com> for pointing to the
suspitious code.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Joel Becker <joel.becker@oracle.com>
CC: linux-ext4@vger.kernel.org
CC: ocfs2-devel@oss.oracle.com
CC: mfasheh@suse.de
CC: Dan Carpenter <error27@gmail.com>
2009-02-10 11:15:34 -05:00
Jan Kara
9eddacf9e9 Revert "ext4: wait on all pending commits in ext4_sync_fs()"
This undoes commit 14ce0cb411.

Since jbd2_journal_start_commit() is now fixed to return 1 when we
started a transaction commit, there's some transaction waiting to be
committed or there's a transaction already committing, we don't
need to call ext4_force_commit() in ext4_sync_fs(). Furthermore
ext4_force_commit() can unnecessarily create sync transaction which is
expensive so it's worthwhile to remove it when we can.

http://bugzilla.kernel.org/show_bug.cgi?id=12224

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: linux-ext4@vger.kernel.org
2009-02-10 06:46:05 -05:00
Jan Kara
c88ccea314 jbd2: Fix return value of jbd2_journal_start_commit()
The function jbd2_journal_start_commit() returns 1 if either a
transaction is committing or the function has queued a transaction
commit. But it returns 0 if we raced with somebody queueing the
transaction commit as well. This resulted in ext4_sync_fs() not
functioning correctly (description from Arthur Jones): 

   In the case of a data=ordered umount with pending long symlinks
   which are delayed due to a long list of other I/O on the backing
   block device, this causes the buffer associated with the long
   symlinks to not be moved to the inode dirty list in the second
   phase of fsync_super.  Then, before they can be dirtied again,
   kjournald exits, seeing the UMOUNT flag and the dirty pages are
   never written to the backing block device, causing long symlink
   corruption and exposing new or previously freed block data to
   userspace.

This can be reproduced with a script created by Eric Sandeen
<sandeen@redhat.com>:

        #!/bin/bash

        umount /mnt/test2
        mount /dev/sdb4 /mnt/test2
        rm -f /mnt/test2/*
        dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
        touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
        ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
        /mnt/test2/link
        umount /mnt/test2
        mount /dev/sdb4 /mnt/test2
        ls /mnt/test2/

This patch fixes jbd2_journal_start_commit() to always return 1 when
there's a transaction committing or queued for commit.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
CC: Eric Sandeen <sandeen@redhat.com>
CC: linux-ext4@vger.kernel.org
2009-02-10 11:27:46 -05:00
Linus Torvalds
4c098bcd55 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: don't use spin_is_contended
2009-02-09 14:00:16 -08:00
Chris Mason
284b066af4 Btrfs: don't use spin_is_contended
Btrfs was using spin_is_contended to see if it should drop locks before
doing extent allocations during btrfs_search_slot.  The idea was to avoid
expensive searches in the tree unless the lock was actually contended.

But, spin_is_contended is specific to the ticket spinlocks on x86, so this
is causing compile errors everywhere else.

In practice, the contention could easily appear some time after we started
doing the extent allocation, and it makes more sense to always drop the lock
instead.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-09 16:22:03 -05:00
Linus Torvalds
896abeb743 Merge branch 'for-2.6.29' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.29' of git://linux-nfs.org/~bfields/linux:
  lockd: fix regression in lockd's handling of blocked locks
2009-02-09 10:30:19 -08:00
J. Bruce Fields
9d9b87c121 lockd: fix regression in lockd's handling of blocked locks
If a client requests a blocking lock, is denied, then requests it again,
then here in nlmsvc_lock() we will call vfs_lock_file() without FL_SLEEP
set, because we've already queued a block and don't need the locks code
to do it again.

But that means vfs_lock_file() will return -EAGAIN instead of
FILE_LOCK_DENIED.  So we still need to translate that -EAGAIN return
into a nlm_lck_blocked error in this case, and put ourselves back on
lockd's block list.

The bug was introduced by bde74e4bc6 "locks: add special return
value for asynchronous locks".

Thanks to Frank van Maarseveen for the report; his original test
case was essentially

	for i in `seq 30`; do flock /nfsmount/foo sleep 10 & done

Tested-by: Frank van Maarseveen <frankvm@frankvm.com>
Reported-by: Frank van Maarseveen <frankvm@frankvm.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-02-09 13:19:46 -05:00
Cornelia Huck
766ccb9ed4 async: Rename _special -> _domain for clarity.
Rename the async_*_special() functions to async_*_domain(), which
describes the purpose of these functions much better.
[Broke up long lines to silence checkpatch]

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
2009-02-08 09:56:11 -08:00
Linus Torvalds
ccfef64621 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
  CRED: Fix SUID exec regression
2009-02-06 18:52:55 -08:00
Linus Torvalds
ae1a25da84 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (37 commits)
  Btrfs: Make sure dir is non-null before doing S_ISGID checks
  Btrfs: Fix memory leak in cache_drop_leaf_ref
  Btrfs: don't return congestion in write_cache_pages as often
  Btrfs: Only prep for btree deletion balances when nodes are mostly empty
  Btrfs: fix btrfs_unlock_up_safe to walk the entire path
  Btrfs: change btrfs_del_leaf to drop locks earlier
  Btrfs: Change btrfs_truncate_inode_items to stop when it hits the inode
  Btrfs: Don't try to compress pages past i_size
  Btrfs: join the transaction in __btrfs_setxattr
  Btrfs: Handle SGID bit when creating inodes
  Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks
  Btrfs: Change btree locking to use explicit blocking points
  Btrfs: hash_lock is no longer needed
  Btrfs: disable leak debugging checks in extent_io.c
  Btrfs: sort references by byte number during btrfs_inc_ref
  Btrfs: async threads should try harder to find work
  Btrfs: selinux support
  Btrfs: make btrfs acls selectable
  Btrfs: Catch missed bios in the async bio submission thread
  Btrfs: fix readdir on 32 bit machines
  ...
2009-02-06 18:37:22 -08:00
Tyler Hicks
fd9fc842bb eCryptfs: Regression in unencrypted filename symlinks
The addition of filename encryption caused a regression in unencrypted
filename symlink support.  ecryptfs_copy_filename() is used when dealing
with unencrypted filenames and it reported that the new, copied filename
was a character longer than it should have been.

This caused the return value of readlink() to count the NULL byte of the
symlink target.  Most applications don't care about the extra NULL byte,
but a version control system (bzr) helped in discovering the bug.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-06 18:36:40 -08:00
Linus Torvalds
1d87b0d388 Merge branch 'to-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland
* 'to-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland:
  elf core dump: fix get_user use
2009-02-06 18:10:04 -08:00
Roland McGrath
92dc07b1f9 elf core dump: fix get_user use
The elf_core_dump() code does its work with set_fs(KERNEL_DS) in force,
so vma_dump_size() needs to switch back with set_fs(USER_DS) to safely
use get_user() for a normal user-space address.

Checking for VM_READ optimizes out the case where get_user() would fail
anyway.  The vm_file check here was already superfluous given the control
flow earlier in the function, so that is a cleanup/optimization unrelated
to other changes but an obvious and trivial one.

Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Roland McGrath <roland@redhat.com>
2009-02-06 17:34:07 -08:00
David Howells
0bf2f3aec5 CRED: Fix SUID exec regression
The patch:

	commit a6f76f23d2
	CRED: Make execve() take advantage of copy-on-write credentials

moved the place in which the 'safeness' of a SUID/SGID exec was performed to
before de_thread() was called.  This means that LSM_UNSAFE_SHARE is now
calculated incorrectly.  This flag is set if any of the usage counts for
fs_struct, files_struct and sighand_struct are greater than 1 at the time the
determination is made.  All of which are true for threads created by the
pthread library.

However, since we wish to make the security calculation before irrevocably
damaging the process so that we can return it an error code in the case where
we decide we want to reject the exec request on this basis, we have to make the
determination before calling de_thread().

So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
(CLONE_SIGHAND/CLONE_THREAD) with us.  These will be killed by de_thread() and
so can be discounted by check_unsafe_exec().

We do have to be careful because CLONE_THREAD does not imply FS or FILES.

We _assume_ that there will be no extra references to these structs held by the
threads we're going to kill.

This can be tested with the attached pair of programs.  Build the two programs
using the Makefile supplied, and run ./test1 as a non-root user.  If
successful, you should see something like:

	[dhowells@andromeda tmp]$ ./test1
	--TEST1--
	uid=4043, euid=4043 suid=4043
	exec ./test2
	--TEST2--
	uid=4043, euid=0 suid=0
	SUCCESS - Correct effective user ID

and if unsuccessful, something like:

	[dhowells@andromeda tmp]$ ./test1
	--TEST1--
	uid=4043, euid=4043 suid=4043
	exec ./test2
	--TEST2--
	uid=4043, euid=4043 suid=4043
	ERROR - Incorrect effective user ID!

The non-root user ID you see will depend on the user you run as.

[test1.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>

static void *thread_func(void *arg)
{
	while (1) {}
}

int main(int argc, char **argv)
{
	pthread_t tid;
	uid_t uid, euid, suid;

	printf("--TEST1--\n");
	getresuid(&uid, &euid, &suid);
	printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);

	if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
		perror("pthread_create");
		exit(1);
	}

	printf("exec ./test2\n");
	execlp("./test2", "test2", NULL);
	perror("./test2");
	_exit(1);
}

[test2.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char **argv)
{
	uid_t uid, euid, suid;

	getresuid(&uid, &euid, &suid);
	printf("--TEST2--\n");
	printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);

	if (euid != 0) {
		fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
		exit(1);
	}
	printf("SUCCESS - Correct effective user ID\n");
	exit(0);
}

[Makefile]
CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
all: test1 test2

test1: test1.c
	gcc $(CFLAGS) -o test1 test1.c -lpthread

test2: test2.c
	gcc $(CFLAGS) -o test2 test2.c
	sudo chown root.root test2
	sudo chmod +s test2

Reported-by: David Smith <dsmith@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David Smith <dsmith@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-02-07 08:46:18 +11:00
Dave Kleikamp
d4cf109f05 vfs: Don't call attach_nobh_buffers() with an empty list
This is a modification of a patch by Bill Pemberton <wfp5p@virginia.edu>

nobh_write_end() could call attach_nobh_buffers() with head == NULL.
This would result in a trap when attach_nobh_buffers() attempted to
access bh->b_this_page.

This can be illustrated by running the writev01 testcase from LTP on jfs.

This error was introduced by commit 5b41e74a "vfs: fix data leak in
nobh_write_end()".  That patch did not take into account that if
PageMappedToDisk() is true upon entry to nobh_write_begin(), then no
buffers will be allocated for the page.  In that case, we won't have to
worry about a failed write leaving unitialized data in the page.

Of course, head != NULL implies !page_has_buffers(page), so no need to
test both.

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Bill Pemberton <wfp5p@virginia.edu>
Cc: Dmitri Monakhov <dmonakhov@openvz.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-06 13:34:22 -08:00
Chris Mason
42f15d77df Btrfs: Make sure dir is non-null before doing S_ISGID checks
The S_ISGID check in btrfs_new_inode caused an oops during subvol creation
because sometimes the dir is null.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-06 11:35:57 -05:00
Pablo Neira Ayuso
ff491a7334 netlink: change return-value logic of netlink_broadcast()
Currently, netlink_broadcast() reports errors to the caller if no
messages at all were delivered:

1) If, at least, one message has been delivered correctly, returns 0.
2) Otherwise, if no messages at all were delivered due to skb_clone()
   failure, return -ENOBUFS.
3) Otherwise, if there are no listeners, return -ESRCH.

With this patch, the caller knows if the delivery of any of the
messages to the listeners have failed:

1) If it fails to deliver any message (for whatever reason), return
   -ENOBUFS.
2) Otherwise, if all messages were delivered OK, returns 0.
3) Otherwise, if no listeners, return -ESRCH.

In the current ctnetlink code and in Netfilter in general, we can add
reliable logging and connection tracking event delivery by dropping the
packets whose events were not successfully delivered over Netlink. Of
course, this option would be settable via /proc as this approach reduces
performance (in terms of filtered connections per seconds by a stateful
firewall) but providing reliable logging and event delivery (for
conntrackd) in return.

This patch also changes some clients of netlink_broadcast() that
may report ENOBUFS errors via printk. This error handling is not
of any help. Instead, the userspace daemons that are listening to
those netlink messages should resync themselves with the kernel-side
if they hit ENOBUFS.

BTW, netlink_broadcast() clients include those that call
cn_netlink_send(), nlmsg_multicast() and genlmsg_multicast() since they
internally call netlink_broadcast() and return its error value.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-05 23:56:36 -08:00
Herbert Xu
33dccbb050 tun: Limit amount of queued packets per device
Unlike a normal socket path, the tuntap device send path does
not have any accounting.  This means that the user-space sender
may be able to pin down arbitrary amounts of kernel memory by
continuing to send data to an end-point that is congested.

Even when this isn't an issue because of limited queueing at
most end points, this can also be a problem because its only
response to congestion is packet loss.  That is, when those
local queues at the end-point fills up, the tuntap device will
start wasting system time because it will continue to send
data there which simply gets dropped straight away.

Of course one could argue that everybody should do congestion
control end-to-end, unfortunately there are people in this world
still hooked on UDP, and they don't appear to be going away
anywhere fast.  In fact, we've always helped them by performing
accounting in our UDP code, the sole purpose of which is to
provide congestion feedback other than through packet loss.

This patch attempts to apply the same bandaid to the tuntap device.
It creates a pseudo-socket object which is used to account our
packets just as a normal socket does for UDP.  Of course things
are a little complex because we're actually reinjecting traffic
back into the stack rather than out of the stack.

The stack complexities however should have been resolved by preceding
patches.  So this one can simply start using skb_set_owner_w.

For now the accounting is essentially disabled by default for
backwards compatibility.  In particular, we set the cap to INT_MAX.
This is so that existing applications don't get confused by the
sudden arrival EAGAIN errors.

In future we may wish (or be forced to) do this by default.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-05 21:25:32 -08:00
Al Viro
767b5828ad braino in sg_ioctl_trans()
... and yes, gcc is insane enough to eat that without complaint.
We probably want sparse to scream on those...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-05 16:35:52 -08:00
Linus Torvalds
082256333f Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
  Revert "configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()"
2009-02-05 16:12:38 -08:00
Alexey Dobriyan
f01d1d546a seq_file: fix big-enough lseek() + read()
lseek() further than length of the file will leave stale ->index
(second-to-last during iteration). Next seq_read() will not notice
that ->f_pos is big enough to return 0, but will print last item
as if ->f_pos is pointing to it.

Introduced in commit cb510b8172
aka "seq_file: more atomicity in traverse()".

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-05 14:18:14 -08:00
Eric Biederman
33da8892a2 seq_file: move traverse so it can be used from seq_read
In 2.6.25 some /proc files were converted to use the seq_file
infrastructure.  But seq_files do not correctly support pread(), which
broke some usersapce applications.

To handle pread correctly we can't assume that f_pos is where we left it
in seq_read.  So move traverse() so that we can eventually use it in
seq_read and do thus some day support pread().

Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Cc: Paul Turner <pjt@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-05 12:56:49 -08:00
Chris Mason
806638bce9 Btrfs: Fix memory leak in cache_drop_leaf_ref
The code wasn't doing a kfree on the sorted array

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-05 09:08:14 -05:00
Mark Fasheh
436443f0f7 Revert "configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()"
This reverts commit 0e0333429a.

I committed this by accident - Joel and Louis are working with the lockdep
maintainer to provide a better solution than just turning lockdep off.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: <Joel Becker <joel.becker@oracle.com>
2009-02-04 09:46:25 -08:00
Chris Mason
9b0d3ace33 Btrfs: don't return congestion in write_cache_pages as often
On fast devices that go from congested to uncongested very quickly, pdflush
is waiting too often in congestion_wait, and the FS is backing off to
easily in write_cache_pages.

For now, fix this on the btrfs side by only checking congestion after
some bios have already gone down.  Longer term a real fix is needed
for pdflush, but that is a larger project.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:33:00 -05:00
Chris Mason
7b78c170dc Btrfs: Only prep for btree deletion balances when nodes are mostly empty
Whenever an item deletion is done, we need to balance all the nodes
in the tree to make sure we don't end up with an empty node if a pointer
is deleted.  This balance prep happens from the root of the tree down
so we can drop our locks as we go.

reada_for_balance was triggering read-ahead on neighboring nodes even
when no balancing was required.  This adds an extra check to avoid
calling balance_level() and avoid reada_for_balance() when a balance
won't be required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:12:46 -05:00
Chris Mason
12f4daccfc Btrfs: fix btrfs_unlock_up_safe to walk the entire path
btrfs_unlock_up_safe would break out at the first NULL node entry or
unlocked node it found in the path.

Some of the callers have missing nodes at the lower levels of the path, so this
commit fixes things to check all the nodes in the path before returning.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:31:42 -05:00
Chris Mason
4d081c41a4 Btrfs: change btrfs_del_leaf to drop locks earlier
btrfs_del_leaf does two things.  First it removes the pointer in the
parent, and then it frees the block that has the leaf.  It has the
parent node locked for both operations.

But, it only needs the parent locked while it is deleting the pointer.
After that it can safely free the block without the parent locked.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:31:28 -05:00
Chris Mason
06d9a8d7c2 Btrfs: Change btrfs_truncate_inode_items to stop when it hits the inode
btrfs_truncate_inode_items is setup to stop doing btree searches when
it has finished removing the items for the inode.  It used to detect the
end of the inode by looking for an objectid that didn't match the
one we were searching for.

But, this would result in an extra search through the btree, which
adds extra balancing and cow costs to the operation.

This commit adds a check to see if we found the inode item, which means
we can stop searching early.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:30:58 -05:00
Chris Mason
f03d9301f1 Btrfs: Don't try to compress pages past i_size
The compression code had some checks to make sure we were only
compressing bytes inside of i_size, but it wasn't catching every
case.  To make things worse, some incorrect math about the number
of bytes remaining would make it try to compress more pages than the
file really had.

The fix used here is to fall back to the non-compression code in this
case, which does all the proper cleanup of delalloc and other accounting.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:31:06 -05:00
Josef Bacik
811449496b Btrfs: join the transaction in __btrfs_setxattr
With selinux on we end up calling __btrfs_setxattr when we create an inode,
which calls btrfs_start_transaction().  The problem is we've already called
that in btrfs_new_inode, and in btrfs_start_transaction we end up doing a
wait_current_trans().  If btrfs-transaction has started committing it will wait
for all handles to finish, while the other process is waiting for the
transaction to commit.  This is fixed by using btrfs_join_transaction, which
won't wait for the transaction to commit.  Thanks,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-02-04 09:18:33 -05:00
Chris Ball
8c087b5183 Btrfs: Handle SGID bit when creating inodes
Before this patch, new files/dirs would ignore the SGID bit on their
parent directory and always be owned by the creating user's uid/gid.

Signed-off-by: Chris Ball <cjb@laptop.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:29:54 -05:00
Chris Mason
bd56b30205 Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks
Every transaction in btrfs creates a new snapshot, and then schedules the
snapshot from the last transaction for deletion.  Snapshot deletion
works by walking down the btree and dropping the reference counts
on each btree block during the walk.

If if a given leaf or node has a reference count greater than one,
the reference count is decremented and the subtree pointed to by that
node is ignored.

If the reference count is one, walking continues down into that node
or leaf, and the references of everything it points to are decremented.

The old code would try to work in small pieces, walking down the tree
until it found the lowest leaf or node to free and then returning.  This
was very friendly to the rest of the FS because it didn't have a huge
impact on other operations.

But it wouldn't always keep up with the rate that new commits added new
snapshots for deletion, and it wasn't very optimal for the extent
allocation tree because it wasn't finding leaves that were close together
on disk and processing them at the same time.

This changes things to walk down to a level 1 node and then process it
in bulk.  All the leaf pointers are sorted and the leaves are dropped
in order based on their extent number.

The extent allocation tree and commit code are now fast enough for
this kind of bulk processing to work without slowing the rest of the FS
down.  Overall it does less IO and is better able to keep up with
snapshot deletions under high load.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:27:02 -05:00
Chris Mason
b4ce94de9b Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.

So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.

This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.

We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.

The basic idea is:

btrfs_tree_lock() returns with the spin lock held

btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock.  The buffer is
still considered locked by all of the btrfs code.

If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.

Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time.  So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.

btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.

btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.

ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:25:08 -05:00
Chris Mason
c487685d7c Btrfs: hash_lock is no longer needed
Before metadata is written to disk, it is updated to reflect that writeout
has begun.  Once this update is done, the block must be cow'd before it
can be modified again.

This update was originally synchronized by using a per-fs spinlock.  Today
the buffers for the metadata blocks are locked before writeout begins,
and everyone that tests the flag has the buffer locked as well.

So, the per-fs spinlock (called hash_lock for no good reason) is no
longer required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:24:25 -05:00
Chris Mason
3935127c50 Btrfs: disable leak debugging checks in extent_io.c
extent_io.c has debugging code to report and free leaked extent_state
and extent_buffer objects at rmmod time.  This helps track down
leaks and it saves you from rebooting just to properly remove the
kmem_cache object.

But, the code runs under a fairly expensive spinlock and the checks to
see if it is currently enabled are not entirely consistent.  Some use
#ifdef and some #if.

This changes everything to #if and disables the leak checking.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:24:05 -05:00
Chris Mason
b7a9f29fcf Btrfs: sort references by byte number during btrfs_inc_ref
When a block goes through cow, we update the reference counts of
everything that block points to.  The internal pointers of the block
can be in just about any order, and it is likely to have clusters of
things that are close together and clusters of things that are not.

To help reduce the seeks that come with updating all of these reference
counts, sort them by byte number before actual updates are done.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:23:45 -05:00
Chris Mason
b51912c91f Btrfs: async threads should try harder to find work
Tracing shows the delay between when an async thread goes to sleep
and when more work is added is often very short.  This commit adds
a little bit of delay and extra checking to the code right before
we schedule out.

It allows more work to be added to the worker
without requiring notifications from other procs.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:23:24 -05:00
Jim Owens
0279b4cd86 Btrfs: selinux support
Add call to LSM security initialization and save
resulting security xattr for new inodes.

Add xattr support to symlink inode ops.

Set inode->i_op for existing special files.

Signed-off-by: jim owens <jowens@hp.com>
2009-02-04 09:29:13 -05:00
Christian Hesse
bef62ef339 Btrfs: make btrfs acls selectable
This patch adds a menu entry to kconfig to enable acls for btrfs.
This allows you to enable FS_POSIX_ACL at kernel compile time.

(updated by Jeff Mahoney to make the changes in fs/btrfs/Kconfig instead)

Signed-off-by: Christian Hesse <mail@earthworm.de>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2009-02-04 09:28:28 -05:00
Chris Mason
a683705153 Btrfs: Catch missed bios in the async bio submission thread
The async bio submission thread was missing some bios that were
added after it had decided there was no work left to do.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:19:41 -05:00
Linus Torvalds
f96c08e8c5 Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6
* 'linux-next' of git://git.infradead.org/ubifs-2.6:
  UBIFS: remove fast unmounting
  UBIFS: return sensible error codes
  UBIFS: remount ro fixes
  UBIFS: spelling fix 'date' -> 'data'
  UBIFS: sync wbufs after syncing inodes and pages
  UBIFS: fix LPT out-of-space bug (again)
  UBIFS: fix no_chk_data_crc
  UBIFS: fix assertions
  UBIFS: ensure orphan area head is initialized
  UBIFS: always clean up GC LEB space
  UBIFS: add re-mount debugging checks
  UBIFS: fix LEB list freeing
  UBIFS: simplify locking
  UBIFS: document dark_wm and dead_wm better
  UBIFS: do not treat all data as short term
  UBIFS: constify operations
  UBIFS: do not commit twice
2009-02-03 16:52:44 -08:00
Linus Torvalds
3e1c400513 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
  ocfs2: add quota call to ocfs2_remove_btree_range()
  ocfs2: Wakeup the downconvert thread after a successful cancel convert
  ocfs2: Access the xattr bucket only before modifying it.
  configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()
  ocfs2: Fix possible deadlock in ocfs2_write_dquot()
  ocfs2: Push out dropping of dentry lock to ocfs2_wq
2009-02-03 16:50:20 -08:00
Felix Blyakher
43f3f057c5 [XFS] Warn on transaction in flight on read-only remount
Till VFS can correctly support read-only remount without racing,
use WARN_ON instead of BUG_ON on detecting transaction in flight
after quiescing filesystem.

Signed-off-by: Felix Blyakher <felixb@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-02-03 11:04:54 -06:00
Dave Chinner
6139a23609 xfs: Check buffer lengths in log recovery
Before trying to obtain, read or write a buffer,
check that the buffer length is actually valid. If
it is not valid, then something read in the recovery
process has been corrupted and we should abort
recovery.

Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
Tested-by: Eric Sesterhenn <snakebyte@gmx.de>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-02-03 11:01:32 -06:00
Felix Blyakher
6d2160bfe7 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 into for-linus 2009-02-03 10:38:41 -06:00
Steve French
e1f81c8a41 Merge branch 'master' of /pub/scm/linux/kernel/git/torvalds/linux-2.6 2009-02-03 15:19:23 +00:00
Mark Fasheh
fd4ef23196 ocfs2: add quota call to ocfs2_remove_btree_range()
We weren't reclaiming the clusters which get free'd from this function,
so any user punching holes in a file would still have those bytes accounted
against him/her. Add the call to vfs_dq_free_space_nodirty() to fix this.
Interestingly enough, the journal credits calculation already took this into
account.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Jan Kara <jack@suse.cz>
2009-02-02 14:20:20 -08:00
Sunil Mushran
a4b91965d3 ocfs2: Wakeup the downconvert thread after a successful cancel convert
When two nodes holding PR locks on a resource concurrently attempt to
upconvert the locks to EX, the master sends a BAST to one of the nodes. This
message tells that node to first cancel convert the upconvert request,
followed by downconvert to a NL. Only when this lock is downconverted to NL,
can the master upconvert the first node's lock to EX.

While the fs was doing the cancel convert, it was forgetting to wake up the
dc thread after a successful cancel, leading to a deadlock.

Reported-and-Tested-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-02 14:20:19 -08:00
Tao Ma
554e7f9e04 ocfs2: Access the xattr bucket only before modifying it.
In ocfs2_xattr_value_truncate, we may call b-tree codes which will
extend the journal transaction. It has a potential problem that it
may let the already-accessed-but-not-dirtied buffers gone. So we'd
better access the bucket after we call ocfs2_xattr_value_truncate.
And as for the root buffer for the xattr value, b-tree code will
acess and dirty it, so we don't need to worry about it.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-02 14:20:18 -08:00
Joel Becker
0e0333429a configfs: Silence lockdep on mkdir(), rmdir() and configfs_depend_item()
When attaching default groups (subdirs) of a new group (in mkdir() or
in configfs_register()), configfs recursively takes inode's mutexes
along the path from the parent of the new group to the default
subdirs. This is needed to ensure that the VFS will not race with
operations on these sub-dirs. This is safe for the following reasons:

- the VFS allows one to lock first an inode and second one of its
  children (The lock subclasses for this pattern are respectively
  I_MUTEX_PARENT and I_MUTEX_CHILD);
- from this rule any inode path can be recursively locked in
  descending order as long as it stays under a single mountpoint and
  does not follow symlinks.

Unfortunately lockdep does not know (yet?) how to handle such
recursion.

I've tried to use Peter Zijlstra's lock_set_subclass() helper to
upgrade i_mutexes from I_MUTEX_CHILD to I_MUTEX_PARENT when we know
that we might recursively lock some of their descendant, but this
usage does not seem to fit the purpose of lock_set_subclass() because
it leads to several i_mutex locked with subclass I_MUTEX_PARENT by
the same task.

>From inside configfs it is not possible to serialize those recursive
locking with a top-level one, because mkdir() and rmdir() are already
called with inodes locked by the VFS. So using some
mutex_lock_nest_lock() is not an option.

I am proposing two solutions:
1) one that wraps recursive mutex_lock()s with
   lockdep_off()/lockdep_on().
2) (as suggested earlier by Peter Zijlstra) one that puts the
   i_mutexes recursively locked in different classes based on their
   depth from the top-level config_group created. This
   induces an arbitrary limit (MAX_LOCK_DEPTH - 2 == 46) on the
   nesting of configfs default groups whenever lockdep is activated
   but this limit looks reasonably high. Unfortunately, this alos
   isolates VFS operations on configfs default groups from the others
   and thus lowers the chances to detect locking issues.

This patch implements solution 1).

Solution 2) looks better from lockdep's point of view, but fails with
configfs_depend_item(). This needs to rework the locking
scheme of configfs_depend_item() by removing the variable lock recursion
depth, and I think that it's doable thanks to the configfs_dirent_lock.
For now, let's stick to solution 1).

Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-02 14:20:18 -08:00
Jan Kara
f8afead716 ocfs2: Fix possible deadlock in ocfs2_write_dquot()
It could happen that some limit has been set via quotactl() and in parallel
->mark_dirty() is called from another thread doing e.g. dquot_alloc_space(). In
such case ocfs2_write_dquot() must not try to sync the dquot because that needs
global quota lock but that ranks above transaction start.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-02 14:20:17 -08:00
Jan Kara
ea455f8ab6 ocfs2: Push out dropping of dentry lock to ocfs2_wq
Dropping of last reference to dentry lock is a complicated operation involving
dropping of reference to inode. This can get complicated and quota code in
particular needs to obtain some quota locks which leads to potential deadlock.
Thus we defer dropping of inode reference to ocfs2_wq.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-02 14:20:16 -08:00
Steve French
0e2bedaa39 [CIFS] ipv6_addr_equal for address comparison
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-01-30 21:24:41 +00:00
Linus Torvalds
c01a25e7cf Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Remove bogus BUG() check in ext4_bmap()
  ext4: Fix building with EXT4FS_DEBUG
  ext4: Initialize the new group descriptor when resizing the filesystem
  ext4: Fix ext4_free_blocks() w/o a journal when files have indirect blocks
  jbd2: On a __journal_expect() assertion failure printk "JBD2", not "EXT3-fs"
  ext3: Add sanity check to make_indexed_dir
  ext4: Add sanity check to make_indexed_dir
  ext4: only use i_size_high for regular files
  ext4: fix wrong use of do_div
2009-01-30 08:54:29 -08:00
Linus Torvalds
ae704e9f92 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  cfq-iosched: Allow RT requests to pre-empt ongoing BE timeslice
  block: add sysfs file for controlling io stats accounting
  Mark mandatory elevator functions in the biodoc.txt
  include/linux: Add bsg.h to the Kernel exported headers
  block: silently error an unsupported barrier bio
  block: Fix documentation for blkdev_issue_flush()
  block: add bio_rw_flagged() for testing bio->bi_rw
  block: seperate bio/request unplug and sync bits
  block: export SSD/non-rotational queue flag through sysfs
  Fix small typo in bio.h's documentation
  block: get rid of the manual directory counting in blktrace
  block: Allow empty integrity profile
  block: Remove obsolete BUG_ON
  block: Don't verify integrity metadata on read error
2009-01-30 08:46:42 -08:00
Linus Torvalds
dbeb17016e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (29 commits)
  tulip: fix 21142 with 10Mbps without negotiation
  drivers/net/skfp: if !capable(CAP_NET_ADMIN): inverted logic
  gianfar: Fix Wake-on-LAN support
  smsc911x: timeout reaches -1
  smsc9420: fix interrupt signalling test failures
  ucc_geth: Change uec phy id to the same format as gianfar's
  wimax: fix build issue when debugfs is disabled
  netxen: fix memory leak in drivers/net/netxen_nic_init.c
  tun: Add some missing TUN compat ioctl translations.
  ipv4: fix infinite retry loop in IP-Config
  net: update documentation ip aliases
  net: Fix OOPS in skb_seq_read().
  net: Fix frag_list handling in skb_seq_read
  netxen: revert jumbo ringsize
  ath5k: fix locking in ath5k_config
  cfg80211: print correct intersected regulatory domain
  cfg80211: Fix sanity check on 5 GHz when processing country IE
  iwlwifi: fix kernel oops when ucode DMA memory allocation failure
  rtl8187: Fix error in setting OFDM power settings for RTL8187L
  mac80211: remove Michael Wu as maintainer
  ...
2009-01-30 08:41:36 -08:00
Martin K. Petersen
8ae372e3bb block: Remove obsolete BUG_ON
Now that bio_vecs are no longer cleared in bvec_alloc_bs() the following
BUG_ON must go.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-01-30 12:34:36 +01:00
Martin K. Petersen
7b24fc4d7e block: Don't verify integrity metadata on read error
If we get an I/O error on a read request there is no point in doing a
verify pass on the integrity buffer.  Adjust the completion path
accordingly.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-01-30 12:34:36 +01:00
Theodore Ts'o
b9ec63f78b ext4: Remove bogus BUG() check in ext4_bmap()
The code to support journal-less ext4 operation added a BUG to
ext4_bmap() which fired if there was no journal and the
EXT4_STATE_JDATA bit was set in the i_state field.  This caused
running the filefrag program (which uses the FIMBAP ioctl) to trigger
a BUG().

The EXT4_STATE_JDATA bit is only used for ext4_bmap(), and it's
harmless for the bit to be set.  We could add a check in
__ext4_journalled_writepage() and ext4_journalled_write_end() to only
set the EXT4_STATE_JDATA bit if the journal is present, but that adds
an extra test and jump instruction.  It's easier to simply remove the
BUG check.

http://bugzilla.kernel.org/show_bug.cgi?id=12568

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-30 00:00:24 -05:00
Linus Torvalds
f2257b70b0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: make sure we allocate enough storage for socket address
  [CIFS] Make socket retry timeouts consistent between blocking and nonblocking cases
  [CIFS] some cleanup to dir.c prior to addition of posix_open
  [CIFS] revalidate parent inode when rmdir done within that directory
  [CIFS] Rename md5 functions to avoid collision with new rt modules
  cifs: turn smb_send into a wrapper around smb_sendv
2009-01-29 18:21:14 -08:00
Davide Libenzi
9df04e1f25 epoll: drop max_user_instances and rely only on max_user_watches
Linus suggested to put limits where the money is, and max_user_watches
already does that w/out the need of max_user_instances.  That has the
advantage to mitigate the potential DoS while allowing pretty generous
default behavior.

Allowing top 4% of low memory (per user) to be allocated in epoll watches,
we have:

LOMEM    MAX_WATCHES (per user)
512MB    ~178000
1GB      ~356000
2GB      ~712000

A box with 512MB of lomem, will meet some challenge in hitting 180K
watches, socket buffers math teaches us.  No more max_user_instances
limits then.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Bron Gondwana <brong@fastmail.fm>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-29 18:04:45 -08:00
David S. Miller
df1c46b2b6 tun: Add some missing TUN compat ioctl translations.
Based upon a report from Michael Tokarev <mjt@tls.msk.ru>:

	Just saw in dmesg:

	ioctl32(kvm:4408): Unknown cmd fd(9) cmd(800454cf){t:'T';sz:4} arg(ffc668e4) on /dev/net/tun

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-29 16:53:35 -08:00
Artem Bityutskiy
27ad279933 UBIFS: remove fast unmounting
This UBIFS feature has never worked properly, and it was a mistake
to add it because we simply have no use-cases. So, lets still accept
the fast_unmount mount option, but ignore it. This does not change
much, because UBIFS commit in sync_fs anyway, and sync_fs is called
while unmounting.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-29 16:34:30 +02:00
Artem Bityutskiy
a2b9df3ff6 UBIFS: return sensible error codes
When mounting/re-mounting, UBIFS returns EINVAL even if the ENOSPC
or EROFS codes are are much better, just because we have not found
references to ENOSPC/EROFS in mount (2) man pages. This patch
changes this behaviour and makes UBIFS return real error code,
because:

1. It is just less confusing and more logical
2. mount is not described in SuSv3, so it seems to be not really
   well-standartized
3. we do not cover all cases, and any random undocumented in man
   pages error code may be returned anyway

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-29 16:22:54 +02:00
Adrian Hunter
b466f17d78 UBIFS: remount ro fixes
- preserve the idx_gc list - it will be needed in the same
state, should UBIFS be remounted rw again
- prevent remounting ro if we have switched to read only
mode (due to a fatal error)

Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-29 16:19:36 +02:00
Adrian Hunter
227c75c91d UBIFS: spelling fix 'date' -> 'data'
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-29 16:15:51 +02:00
Adrian Hunter
3eb14297c4 UBIFS: sync wbufs after syncing inodes and pages
All writes go through wbufs so they must be sync'd
after syncing inodes and pages.

Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-29 16:15:39 +02:00
Jeff Layton
a9ac49d303 cifs: make sure we allocate enough storage for socket address
The sockaddr declared on the stack in cifs_get_tcp_session is too small
for IPv6 addresses. Change it from "struct sockaddr" to "struct
sockaddr_storage" to prevent stack corruption when IPv6 is used.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-01-29 03:32:13 +00:00
Steve French
da505c386c [CIFS] Make socket retry timeouts consistent between blocking and nonblocking cases
We have used approximately 15 second timeouts on nonblocking sends in the past, and
also 15 second SMB timeout (waiting for server responses, for most request types).
Now that we can do blocking tcp sends,
make blocking send timeout approximately the same (15 seconds).

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-01-29 03:32:13 +00:00
Steve French
f818dd55c4 [CIFS] some cleanup to dir.c prior to addition of posix_open
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-01-29 03:32:13 +00:00
Steve French
42c245447c [CIFS] revalidate parent inode when rmdir done within that directory
When a search is pending of a parent directory, and a child directory
within it is removed, we need to reset the parent directory's time
so that we don't reuse the (now stale) search results.

Thanks to Gunter Kukkukk for reporting this:

> got the following failure notification on irc #samba:
>
> A user was updating from subversion 1.4 to 1.5, where the
> repository is located on a samba share (independent of
> unix extensions = Yes or No).
> svn 1.4 did work, 1.5 does not.
>
> The user did a lot of stracing of subversion - and wrote a
> testapplet to simulate the failing behaviour.
> I've converted the C++ source to C and added some error cases.
>
> When using "./testdir" on a local file system, "result2"
> is always (nil) as expected - cifs vfs behaves different here!
>
>   ./testdir /mnt/cifs/mounted/share
>
> returns a (failing) valid pointer.

Acked-by: Dave Kleikamp <shaggy@us.ibm.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-01-29 03:32:12 +00:00