Commit graph

12813 commits

Author SHA1 Message Date
Martin K. Petersen
8ae372e3bb block: Remove obsolete BUG_ON
Now that bio_vecs are no longer cleared in bvec_alloc_bs() the following
BUG_ON must go.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-01-30 12:34:36 +01:00
Martin K. Petersen
7b24fc4d7e block: Don't verify integrity metadata on read error
If we get an I/O error on a read request there is no point in doing a
verify pass on the integrity buffer.  Adjust the completion path
accordingly.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-01-30 12:34:36 +01:00
Theodore Ts'o
b9ec63f78b ext4: Remove bogus BUG() check in ext4_bmap()
The code to support journal-less ext4 operation added a BUG to
ext4_bmap() which fired if there was no journal and the
EXT4_STATE_JDATA bit was set in the i_state field.  This caused
running the filefrag program (which uses the FIMBAP ioctl) to trigger
a BUG().

The EXT4_STATE_JDATA bit is only used for ext4_bmap(), and it's
harmless for the bit to be set.  We could add a check in
__ext4_journalled_writepage() and ext4_journalled_write_end() to only
set the EXT4_STATE_JDATA bit if the journal is present, but that adds
an extra test and jump instruction.  It's easier to simply remove the
BUG check.

http://bugzilla.kernel.org/show_bug.cgi?id=12568

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-30 00:00:24 -05:00
Linus Torvalds
f2257b70b0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: make sure we allocate enough storage for socket address
  [CIFS] Make socket retry timeouts consistent between blocking and nonblocking cases
  [CIFS] some cleanup to dir.c prior to addition of posix_open
  [CIFS] revalidate parent inode when rmdir done within that directory
  [CIFS] Rename md5 functions to avoid collision with new rt modules
  cifs: turn smb_send into a wrapper around smb_sendv
2009-01-29 18:21:14 -08:00
Davide Libenzi
9df04e1f25 epoll: drop max_user_instances and rely only on max_user_watches
Linus suggested to put limits where the money is, and max_user_watches
already does that w/out the need of max_user_instances.  That has the
advantage to mitigate the potential DoS while allowing pretty generous
default behavior.

Allowing top 4% of low memory (per user) to be allocated in epoll watches,
we have:

LOMEM    MAX_WATCHES (per user)
512MB    ~178000
1GB      ~356000
2GB      ~712000

A box with 512MB of lomem, will meet some challenge in hitting 180K
watches, socket buffers math teaches us.  No more max_user_instances
limits then.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Bron Gondwana <brong@fastmail.fm>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-29 18:04:45 -08:00
David S. Miller
df1c46b2b6 tun: Add some missing TUN compat ioctl translations.
Based upon a report from Michael Tokarev <mjt@tls.msk.ru>:

	Just saw in dmesg:

	ioctl32(kvm:4408): Unknown cmd fd(9) cmd(800454cf){t:'T';sz:4} arg(ffc668e4) on /dev/net/tun

Signed-off-by: David S. Miller <davem@davemloft.net>
2009-01-29 16:53:35 -08:00
Artem Bityutskiy
27ad279933 UBIFS: remove fast unmounting
This UBIFS feature has never worked properly, and it was a mistake
to add it because we simply have no use-cases. So, lets still accept
the fast_unmount mount option, but ignore it. This does not change
much, because UBIFS commit in sync_fs anyway, and sync_fs is called
while unmounting.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-29 16:34:30 +02:00
Artem Bityutskiy
a2b9df3ff6 UBIFS: return sensible error codes
When mounting/re-mounting, UBIFS returns EINVAL even if the ENOSPC
or EROFS codes are are much better, just because we have not found
references to ENOSPC/EROFS in mount (2) man pages. This patch
changes this behaviour and makes UBIFS return real error code,
because:

1. It is just less confusing and more logical
2. mount is not described in SuSv3, so it seems to be not really
   well-standartized
3. we do not cover all cases, and any random undocumented in man
   pages error code may be returned anyway

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-29 16:22:54 +02:00
Adrian Hunter
b466f17d78 UBIFS: remount ro fixes
- preserve the idx_gc list - it will be needed in the same
state, should UBIFS be remounted rw again
- prevent remounting ro if we have switched to read only
mode (due to a fatal error)

Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-29 16:19:36 +02:00
Adrian Hunter
227c75c91d UBIFS: spelling fix 'date' -> 'data'
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-29 16:15:51 +02:00
Adrian Hunter
3eb14297c4 UBIFS: sync wbufs after syncing inodes and pages
All writes go through wbufs so they must be sync'd
after syncing inodes and pages.

Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-29 16:15:39 +02:00
Jeff Layton
a9ac49d303 cifs: make sure we allocate enough storage for socket address
The sockaddr declared on the stack in cifs_get_tcp_session is too small
for IPv6 addresses. Change it from "struct sockaddr" to "struct
sockaddr_storage" to prevent stack corruption when IPv6 is used.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-01-29 03:32:13 +00:00
Steve French
da505c386c [CIFS] Make socket retry timeouts consistent between blocking and nonblocking cases
We have used approximately 15 second timeouts on nonblocking sends in the past, and
also 15 second SMB timeout (waiting for server responses, for most request types).
Now that we can do blocking tcp sends,
make blocking send timeout approximately the same (15 seconds).

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-01-29 03:32:13 +00:00
Steve French
f818dd55c4 [CIFS] some cleanup to dir.c prior to addition of posix_open
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-01-29 03:32:13 +00:00
Steve French
42c245447c [CIFS] revalidate parent inode when rmdir done within that directory
When a search is pending of a parent directory, and a child directory
within it is removed, we need to reset the parent directory's time
so that we don't reuse the (now stale) search results.

Thanks to Gunter Kukkukk for reporting this:

> got the following failure notification on irc #samba:
>
> A user was updating from subversion 1.4 to 1.5, where the
> repository is located on a samba share (independent of
> unix extensions = Yes or No).
> svn 1.4 did work, 1.5 does not.
>
> The user did a lot of stracing of subversion - and wrote a
> testapplet to simulate the failing behaviour.
> I've converted the C++ source to C and added some error cases.
>
> When using "./testdir" on a local file system, "result2"
> is always (nil) as expected - cifs vfs behaves different here!
>
>   ./testdir /mnt/cifs/mounted/share
>
> returns a (failing) valid pointer.

Acked-by: Dave Kleikamp <shaggy@us.ibm.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-01-29 03:32:12 +00:00
Steve French
6a7f8d36c0 [CIFS] Rename md5 functions to avoid collision with new rt modules
When rt modules were added they (each) included their own md5
with names which collided with the existing names of cifs's md5 functions.

Renaming cifs's md5 modules so we don't collide with them.

> Stephen Rothwell wrote:
> When CIFS is built-in (=y) and staging/rt28[67]0 =y, there are multiple
> definitions of:
>
> build-r8250.out:(.text+0x1d8ad0): multiple definition of `MD5Init'
> build-r8250.out:(.text+0x1dbb30): multiple definition of `MD5Update'
> build-r8250.out:(.text+0x1db9b0): multiple definition of `MD5Final'
>
> all of which need to have more unique identifiers for their global
> symbols (e.g., rt28_md5_init, cifs_md5_init, foo, blah, bar).
>

CC: Greg K-H <gregkh@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-01-29 03:32:12 +00:00
Jeff Layton
0496e02d87 cifs: turn smb_send into a wrapper around smb_sendv
cifs: turn smb_send into a wrapper around smb_sendv

Rename smb_send2 to smb_sendv to make it consistent with kernel naming
conventions for functions that take a vector.

There's no need to have 2 functions to handle sending SMB calls. Turn
smb_send into a wrapper around smb_sendv. This also allows us to
properly mark the socket as needing to be reconnected when there's a
partial send from smb_send.

Also, in practice we always use the address and noblocksnd flag
that's attached to the TCP_Server_Info. There's no need to pass
them in as separate args to smb_sendv.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-01-29 03:32:12 +00:00
Chris Mason
89f135d8b5 Btrfs: fix readdir on 32 bit machines
After btrfs_readdir has gone through all the directory items, it
sets the directory f_pos to the largest possible int.  This way
applications that mix readdir with creating new files don't
end up in an endless loop finding the new directory items as they go.

It was a workaround for a bug in git, but the assumption was that if git
could make this looping mistake than it would be a common problem.

The largest possible int chosen was INT_LIMIT(typeof(file->f_pos),
and it is possible for that to be a larger number than 32 bit glibc
expects to come out of readdir.

This patches switches that to INT_LIMIT(off_t), which should keep
applications happy on 32 and 64 bit machines.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-28 15:34:27 -05:00
Chris Mason
e4f722fa42 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
Fix fs/btrfs/super.c conflict around #includes
2009-01-28 20:29:43 -05:00
Adrian Hunter
4a29d2005b UBIFS: fix LPT out-of-space bug (again)
The function to traverse and dirty the LPT was still not
dirtying all nodes, with the result that the LPT could
run out of space.

Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-28 16:02:07 +02:00
Jeff Layton
fa82a49127 nfsd: only set file_lock.fl_lmops in nfsd4_lockt if a stateowner is found
nfsd4_lockt does a search for a lockstateowner when building the lock
struct to test. If one is found, it'll set fl_owner to it. Regardless of
whether that happens, it'll also set fl_lmops. Given that this lock is
basically a "lightweight" lock that's just used for checking conflicts,
setting fl_lmops is probably not appropriate for it.

This behavior exposed a bug in DLM's GETLK implementation where it
wasn't clearing out the fields in the file_lock before filling in
conflicting lock info. While we were able to fix this in DLM, it
still seems pointless and dangerous to set the fl_lmops this way
when we may have a NULL lockstateowner.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@pig.fieldses.org>
2009-01-27 17:26:59 -05:00
J. Bruce Fields
b914152a6f nfsd: fix cred leak on every rpc
Since override_creds() took its own reference on new, we need to release
our own reference.

(Note the put_cred on the return value puts the *old* value of
current->creds, not the new passed-in value).

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-01-27 17:26:59 -05:00
J. Bruce Fields
bf935a7881 nfsd: fix null dereference on error path
We're forgetting to check the return value from groups_alloc().

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-01-27 17:26:58 -05:00
Eric Sandeen
f0e0059b9c don't reallocate sxp variable passed into xfs_swapext
fixes kernel.org bugzilla 12538, xfs_fsr fails on 2.6.29-rc kernels

Regression caused by 743bb4650d

This was an embarrasing mistake, reallocating the sxp pointer passed
in from the main ioctl switch.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net
Reported-by: Paul Martin <pm@debian.org>
Tested-by: Paul Martin <pm@debian.org>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-01-27 14:51:39 -06:00
Artem Bityutskiy
6f7ab6d458 UBIFS: fix no_chk_data_crc
When data CRC checking is disabled, UBIFS returns incorrect return
code from the 'try_read_node()' function (0 instead of 1, which means
CRC error), which make the caller re-read the data node again, but using
a different code patch, so the second read is fine. Thus, we read the
same node twice. And the result of this is that UBIFS is slower
with no_chk_data_crc option than it is with chk_data_crc option.
This patches fixes the problem.

Reported-by: Reuben Dowle <Reuben.Dowle@navico.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-27 16:25:10 +02:00
Thadeu Lima de Souza Cascardo
9fd9784c91 ext4: Fix building with EXT4FS_DEBUG
When bg_free_blocks_count was renamed to bg_free_blocks_count_lo in
560671a0, its uses under EXT4FS_DEBUG were not changed to the helper
ext4_free_blks_count.

Another commit, 498e5f24, also did not change everything needed under
EXT4FS_DEBUG, thus making it spill some warnings related to printing
format.

This commit fixes both issues and makes ext4 build again when
EXT4FS_DEBUG is enabled.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-26 19:26:26 -05:00
Theodore Ts'o
fdff73f094 ext4: Initialize the new group descriptor when resizing the filesystem
Make sure all of the fields of the group descriptor are properly
initialized.  Previously, we allowed bg_flags field to be contain
random garbage, which could trigger non-deterministic behavior,
including a kernel OOPS.

http://bugzilla.kernel.org/show_bug.cgi?id=12433

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-26 19:06:41 -05:00
Linus Torvalds
a90e8a75fb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: initialize file_lock struct in GETLK before copying conflicting lock
  dlm: fix plock notify callback to lockd
2009-01-26 10:42:05 -08:00
Linus Torvalds
cc597bc3d3 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6:
  ocfs2: Remove ocfs2_dquot_initialize() and ocfs2_dquot_drop()
  quota: Improve locking
2009-01-26 10:41:00 -08:00
Linus Torvalds
ed80386295 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
  klist.c: bit 0 in pointer can't be used as flag
  debugfs: introduce stub for debugfs_create_size_t() when DEBUG_FS=n
  sysfs: fix problems with binary files
  PNP: fix broken pnp lowercasing for acpi module aliases
  driver core: Convert '/' to '!' in dev_set_name()
2009-01-26 10:40:28 -08:00
Linus Torvalds
a1c70a756f Merge branch 'Kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/misc
* 'Kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/misc: (36 commits)
  fs/Kconfig: move 9p out
  fs/Kconfig: move afs out
  fs/Kconfig: move coda out
  fs/Kconfig: move the rest of ncpfs out
  fs/Kconfig: move smbfs out
  fs/Kconfig: move sunrpc out
  fs/Kconfig: move nfsd out
  fs/Kconfig: move nfs out
  fs/Kconfig: move ufs out
  fs/Kconfig: move sysv out
  fs/Kconfig: move romfs out
  fs/Kconfig: move qnx4 out
  fs/Kconfig: move hpfs out
  fs/Kconfig: move omfs out
  fs/Kconfig: move minix out
  fs/Kconfig: move vxfs out
  fs/Kconfig: move squashfs out
  fs/Kconfig: move cramfs out
  fs/Kconfig: move efs out
  fs/Kconfig: move bfs out
  ...
2009-01-26 10:08:50 -08:00
Vegard Nossum
3632dee2f8 inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.

The best fix (contributed largely by Linus) is a total rewrite
of the function in question:

On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
>  - locking is done in just one place, and there is no question about it
>   not having an unlock.
>
>  - that whole double-while(1)-loop thing is gone.
>
>  - use multiple functions to make nesting and error handling sane
>
>  - do error testing after doing the things you always need to do, ie do
>   this:
>
>        mutex_lock(..)
>        ret = function_call();
>        mutex_unlock(..)
>
>        .. test ret here ..
>
>   instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.

Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-26 10:08:05 -08:00
Linus Torvalds
2d07d4d1bb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: fix poll notify
  fuse: destroy bdi on umount
  fuse: fuse_fill_super error handling cleanup
  fuse: fix missing fput on error
  fuse: fix NULL deref in fuse_file_alloc()
2009-01-26 09:49:22 -08:00
Artem Bityutskiy
6ba87c9b92 UBIFS: fix assertions
I introduce wrong assertions in one of the previous commits, this
patch fixes them.

Also, initialize debugfs after the debugging check. This is a little
nicer because we want the FS data to be accessible to external users
after everything has been initialized.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-26 18:22:47 +02:00
Miklos Szeredi
f6d47a1761 fuse: fix poll notify
Move fuse_copy_finish() to before calling fuse_notify_poll_wakeup().
This is not a big issue because fuse_notify_poll_wakeup() should be
atomic, but it's cleaner this way, and later uses of notification will
need to be able to finish the copying before performing some actions.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-01-26 15:00:59 +01:00
Miklos Szeredi
26c3679101 fuse: destroy bdi on umount
If a fuse filesystem is unmounted but the device file descriptor
remains open and a new mount reuses the old device number, then the
mount fails with EEXIST and the following warning is printed in the
kernel log:

  WARNING: at fs/sysfs/dir.c:462 sysfs_add_one+0x35/0x3d()
  sysfs: duplicate filename '0:15' can not be created

The cause is that the bdi belonging to the fuse filesystem was
destoryed only after the device file was released.  Fix this by
calling bdi_destroy() from fuse_put_super() instead.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
2009-01-26 15:00:59 +01:00
Miklos Szeredi
c2b8f00690 fuse: fuse_fill_super error handling cleanup
Clean up error handling for the whole of fuse_fill_super() function.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-01-26 15:00:58 +01:00
Miklos Szeredi
3ddf1e7f57 fuse: fix missing fput on error
Fix the leaking file reference if allocation or initialization of
fuse_conn failed.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
2009-01-26 15:00:58 +01:00
Dan Carpenter
bb875b38dc fuse: fix NULL deref in fuse_file_alloc()
ff is set to NULL and then dereferenced on line 65.  Compile tested only.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
2009-01-26 15:00:58 +01:00
Adrian Hunter
49d128aa60 UBIFS: ensure orphan area head is initialized
When mounting read-only the orphan area head is
not initialized.  It must be initialized when
remounting read/write, but it was not.  This patch
fixes that.

[Artem: sorry, added comment tweaking noise]
Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-26 12:54:11 +02:00
Artem Bityutskiy
b4978e9491 UBIFS: always clean up GC LEB space
When we mount UBIFS, GC LEB may contain out-of-date information,
and UBIFS should update lprops and set free space for thei LEB.
Currently UBIFS does this only if mounted R/W. But for R/O mount
we have to do the same, because otherwise we will have incorrect
FS free space reported to user-space.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-26 12:54:11 +02:00
Artem Bityutskiy
84abf972cc UBIFS: add re-mount debugging checks
We observe space corrupted accounting when re-mounting. So add some
debbugging checks to catch problems like this.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-26 12:54:11 +02:00
Artem Bityutskiy
e4d9b6cbfc UBIFS: fix LEB list freeing
When freeing the c->idx_lebs list, we have to release the LEBs as well,
because we might be called from mount to read-only mode code. Otherwise
the LEBs stay taken forever, which may cause problems when we re-mount
back ro RW mode.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-26 12:54:11 +02:00
Artem Bityutskiy
82c1593cad UBIFS: simplify locking
This patch simplifies lock_[23]_inodes functions. We do not have
to care about locking order, because UBIFS does this for @i_mutex
and this is enough. Thanks to Al Viro for suggesting this.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-26 12:54:11 +02:00
Chris Mason
a717531942 Btrfs: do less aggressive btree readahead
Just before reading a leaf, btrfs scans the node for blocks that are
close by and reads them too.  It tries to build up a large window
of IO looking for blocks that are within a max distance from the top
and bottom of the IO window.

This patch changes things to just look for blocks within 64k of the
target block.  It will trigger less IO and make for lower latencies on
the read size.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-22 09:23:10 -05:00
Alexey Dobriyan
0fcb440889 fs/Kconfig: move 9p out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:16:01 +03:00
Alexey Dobriyan
b2480c7fbf fs/Kconfig: move afs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:16:01 +03:00
Alexey Dobriyan
33a1a6fedf fs/Kconfig: move coda out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:16:01 +03:00
Alexey Dobriyan
9d7d6447ef fs/Kconfig: move the rest of ncpfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:16:01 +03:00
Alexey Dobriyan
213a41d404 fs/Kconfig: move smbfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:16:01 +03:00
Alexey Dobriyan
9098c24f35 fs/Kconfig: move sunrpc out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:16:00 +03:00
Alexey Dobriyan
e2b329e200 fs/Kconfig: move nfsd out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:16:00 +03:00
Alexey Dobriyan
97afe47ac3 fs/Kconfig: move nfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:16:00 +03:00
Alexey Dobriyan
a276a52f9f fs/Kconfig: move ufs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:16:00 +03:00
Alexey Dobriyan
8af915ba1d fs/Kconfig: move sysv out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:59 +03:00
Alexey Dobriyan
41810246df fs/Kconfig: move romfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:59 +03:00
Alexey Dobriyan
4c7415830c fs/Kconfig: move qnx4 out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:59 +03:00
Alexey Dobriyan
928ea19295 fs/Kconfig: move hpfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:59 +03:00
Alexey Dobriyan
da55e6f928 fs/Kconfig: move omfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:58 +03:00
Alexey Dobriyan
8b1cd7d3c5 fs/Kconfig: move minix out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:58 +03:00
Alexey Dobriyan
22135169dd fs/Kconfig: move vxfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:58 +03:00
Alexey Dobriyan
22635ec9e0 fs/Kconfig: move squashfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:58 +03:00
Alexey Dobriyan
2a22783be0 fs/Kconfig: move cramfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:58 +03:00
Alexey Dobriyan
571f0a0bde fs/Kconfig: move efs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:57 +03:00
Alexey Dobriyan
0ff423849d fs/Kconfig: move bfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:57 +03:00
Alexey Dobriyan
0b09eb3298 fs/Kconfig: move befs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:57 +03:00
Alexey Dobriyan
b08bac1f18 fs/Kconfig: move hfs, hfsplus out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:57 +03:00
Alexey Dobriyan
295c896cb9 fs/Kconfig: move ecryptfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:56 +03:00
Alexey Dobriyan
10951bf05d fs/Kconfig: move affs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:56 +03:00
Alexey Dobriyan
bc2de2ae67 fs/Kconfig: move adfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:56 +03:00
Alexey Dobriyan
4591dabe27 fs/Kconfig: move configfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:56 +03:00
Alexey Dobriyan
5f3a211a8b fs/Kconfig: move sysfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:56 +03:00
Alexey Dobriyan
9d73ac9e8f fs/Kconfig: move ntfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:55 +03:00
Alexey Dobriyan
1c6ace019b fs/Kconfig: move fat out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:55 +03:00
Alexey Dobriyan
ddfaccd995 fs/Kconfig: move iso9660, udf out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:55 +03:00
Alexey Dobriyan
3ef7784e47 fs/Kconfig: move fuse out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:55 +03:00
Alexey Dobriyan
90ffd46793 fs/Kconfig: move autofs, autofs4 out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:54 +03:00
Alexey Dobriyan
335debee07 fs/Kconfig: move btrfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:54 +03:00
Alexey Dobriyan
2fe4371dff fs/Kconfig: move ocfs2 out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:54 +03:00
Alexey Dobriyan
f5c77969b3 fs/Kconfig: move jfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:54 +03:00
Alexey Dobriyan
b16ecfe2f9 fs/Kconfig: move reiserfs out
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-01-22 13:15:53 +03:00
Dave Chinner
74e2d06521 Long btree pointers are still 64 bit on disk
[XFS] Long btree pointers are still 64 bit on disk

On 32 bit machines with CONFIG_LBD=n, XFS reduces the
in memory size of xfs_fsblock_t to 32 bits so that it
will fit within 32 bit addressing. However, the disk format
for long btree pointers are still 64 bits in size.

The recent btree rewrite failed to take this into account
when initialising new btree blocks, setting sibling pointers
to NULL and checking if they are NULL. Hence checking whether
a 64 bit NULL was the same as a 32 bit NULL was failingi
resulting in NULL sibling pointers failing to be detected
correctly. This showed up as WANT_CORRUPTED_GOTO shutdowns
in xfs_btree_delrec.

Fix this by making all the comparisons and setting of long
pointer btree NULL blocks to the disk format, not the
in memory format. i.e. use NULLDFSBNO.

Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Reported-by: Jacek Luczak <difrost.kernel@gmail.com>
Reported-by: Danny ter Haar <dth@dth.net>
Tested-by: Jacek Luczak <difrost.kernel@gmail.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-01-22 01:23:11 -06:00
Jeff Layton
20d5a39929 dlm: initialize file_lock struct in GETLK before copying conflicting lock
dlm_posix_get fills out the relevant fields in the file_lock before
returning when there is a lock conflict, but doesn't clean out any of
the other fields in the file_lock.

When nfsd does a NFSv4 lockt call, it sets the fl_lmops to
nfsd_posix_mng_ops before calling the lower fs. When the lock comes back
after testing a lock on GFS2, it still has that field set. This confuses
nfsd into thinking that the file_lock is a nfsd4 lock.

Fix this by making DLM reinitialize the file_lock before copying the
fields from the conflicting lock.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2009-01-21 15:28:45 -06:00
David Teigland
24179f4880 dlm: fix plock notify callback to lockd
We should use the original copy of the file_lock, fl, instead
of the copy, flc in the lockd notify callback.  The range in flc has
been modified by posix_lock_file(), so it will not match a copy of the
lock in lockd.

Signed-off-by: David Teigland <teigland@redhat.com>
2009-01-21 15:28:45 -06:00
Yehuda Sadeh
1506fcc818 Btrfs: fiemap support
Now that bmap support is gone, this is the only way to get extent
mappings for userland.  These are still not valid for IO, but they
can tell us if a file has holes or how much fragmentation there is.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2009-01-21 14:39:14 -05:00
Chris Mason
35054394c4 Btrfs: stop providing a bmap operation to avoid swapfile corruptions
Swapfiles use bmap to build a list of extents belonging to the file,
and they assume these extents won't change over the life of the file.
They also use resulting list to do IO directly to the block device.

This causes problems for btrfs in a few ways:

btrfs returns logical block numbers through bmap, and these are not suitable
for IO.  They might translate to different devices, raid etc.

COW means that file block mappings are going to change frequently.

Using swapfiles on btrfs will lead to corruption, so we're avoiding the
problem for now by dropping bmap support entirely.  A later commit
will add fiemap support for people that really want to know how
a file is laid out.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 13:11:13 -05:00
Yan Zheng
7237f18336 Btrfs: fix tree logs parallel sync
To improve performance, btrfs_sync_log merges tree log sync
requests. But it wrongly merges sync requests for different
tree logs. If multiple tree logs are synced at the same time,
only one of them actually gets synced.

This patch has following changes to fix the bug:

Move most tree log related fields in btrfs_fs_info to
btrfs_root. This allows merging sync requests separately
for each tree log.

Don't insert root item into the log root tree immediately
after log tree is allocated. Root item for log tree is
inserted when log tree get synced for the first time. This
allows syncing the log root tree without first syncing all
log trees.

At tree-log sync, btrfs_sync_log first sync the log tree;
then updates corresponding root item in the log root tree;
sync the log root tree; then update the super block.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21 12:54:03 -05:00
Qinghuang Feng
7e6628544a Btrfs: open_ctree() error handling can oops on fs_info
a bug in open_ctree:

struct btrfs_root *open_ctree(..)
{
....
	if (!extent_root || !tree_root || !fs_info ||
	    !chunk_root || !dev_root || !csum_root) {
		err = -ENOMEM;
		goto fail;
//When code flow goes to "fail", fs_info may be NULL or uninitialized.
	}
....

fail:
	btrfs_close_devices(fs_info->fs_devices);// !
	btrfs_mapping_tree_free(&fs_info->mapping_tree);// !

	kfree(extent_root);
	kfree(tree_root);
	bdi_destroy(&fs_info->bdi);// !
...
)

Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Yan Zheng
86288a198d Btrfs: fix stop searching test in replace_one_extent
replace_one_extent searches tree leaves for references to a given extent. It
stops searching if it goes beyond the last possible position.

The last possible position is computed by adding the starting offset of a found
file extent to the full size of the extent. The code uses physical size of the
extent as the full size. This is incorrect when compression is used.

The fix is get the full size from ram_bytes field of file extent item.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21 10:49:16 -05:00
Jan Engelhardt
95029d7d59 Btrfs: change/remove typedef
Change one typedef to a regular enum, and remove an unused one.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Huang Weiyi
653249ff9a Btrfs: remove duplicated #include
Removed duplicated #include "compat.h"in
fs/btrfs/extent-tree.c

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Yan Zheng
5a7be515b1 Btrfs: Fix infinite loop in btrfs_extent_post_op
btrfs_extent_post_op calls finish_current_insert and del_pending_extents. They
both may enter infinite loops.

finish_current_insert enters infinite loop if it only finds some backrefs to
update.  The fix is to check for pending backref updates before restarting the
loop.

The infinite loop in del_pending_extents is due to a the skipped variable
not being properly reset before looping around.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21 10:49:16 -05:00
Yan Zheng
3dfdb9348a Btrfs: fix locking issue in btrfs_remove_block_group
We should hold the block_group_cache_lock while modifying the
block groups red-black tree. Thank you,

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21 10:49:16 -05:00
Qinghuang Feng
c6e308713a Btrfs: simplify iteration codes
Merge list_for_each* and list_entry to list_for_each_entry*

Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:59:08 -05:00
Qinghuang Feng
57506d50ed Btrfs: check return value for kthread_run() correctly
kthread_run() returns the kthread or ERR_PTR(-ENOMEM), not NULL.

Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Roland Dreier
119e10cf1b Btrfs: Remove extra KERN_INFO in the middle of a line
The "devid <xxx> transid <xxx>" printk in btrfs_scan_one_device()
actually follows another printk that doesn't end in a newline (since the
intention is for the two printks to make one line of output), so the
KERN_INFO just ends up messing up the output:

    device label exp <6>devid 1 transid 9 /dev/sda5

Fix this by changing the extra KERN_INFO to KERN_CONT.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Huang Weiyi
7eaebe7d50 Btrfs: removed unused #include <version.h>'s
Removed unused #include <version.h>'s in btrfs

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Josef Bacik
070604040b Btrfs: cleanup xattr code
Andrew's review of the xattr code revealed some minor issues that this patch
addresses.  Just an error return fix, got rid of a useless statement and
commented one of the trickier parts of __btrfs_getxattr.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Wang Cong
19d00cc196 Btrfs: cleanup fs/btrfs/super.c::btrfs_control_ioctl()
- Remove the unused local variable 'len';
- Check return value of kmalloc().

Signed-off-by: Wang Cong <wangcong@zeuux.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Jan Kara
c475146d8f ocfs2: Remove ocfs2_dquot_initialize() and ocfs2_dquot_drop()
Since ->acquire_dquot and ->release_dquot callbacks aren't called under
dqptr_sem anymore, we don't have to start a transaction and obtain locks
so early. So we can just remove all this complicated stuff.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mark Fasheh <mfasheh@suse.de>
2009-01-21 15:25:57 +01:00
Greg Kroah-Hartman
4503efd089 sysfs: fix problems with binary files
Some sysfs binary files don't like having 0 passed to them as a size.
Fix this up at the root by just returning to the vfs if userspace asks
us for a zero sized buffer.

Thanks to Pavel Roskin for pointing this out.

Reported-by: Pavel Roskin <proski@gnu.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-20 20:52:09 -08:00
Theodore Ts'o
e7f07968c1 ext4: Fix ext4_free_blocks() w/o a journal when files have indirect blocks
When trying to unlink a file with indirect blocks on a filesystem
without a journal, the "circular indirect block" sanity test was
getting falsely triggered.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-20 09:50:19 -05:00
Artem Bityutskiy
7078202e55 UBIFS: document dark_wm and dead_wm better
Just add more commentaries. Also some commentary fixes for
lprops flags.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-20 10:10:47 +02:00
Artem Bityutskiy
a50412e3f8 UBIFS: do not treat all data as short term
UBIFS wrongly tells UBI that all data is short term. Use proper
hints instead. Thanks to Xiaochuan-Xu for noticing this.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-20 10:10:31 +02:00
Eric Sandeen
b6e3222732 [XFS] Remove the rest of the macro-to-function indirections.
Remove the last of the macros-defined-to-static-functions.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2009-01-19 14:45:55 +11:00
Christoph Hellwig
b828d8c338 xfs: sanity check attr fork size
Recently we have quite a few kerneloops reports about dereferencing a NULL
if_data in the attribute fork.  From looking over the code this can only
happen if we pass a 0 size argument to xfs_iformat_local.  This implies some
sort of corruption and in fact the only mailinglist report about this from
earlier this year was after a powerfail presumably on a system with write
cache and without barriers.

Add a quick sanity check for the attr fork size in xfs_iformat to catch
these early and without an oops.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2009-01-19 14:45:11 +11:00
Christoph Hellwig
49739140e5 xfs: fix bad_features2 fixups for the root filesystem
Currently the bad_features2 fixup and the alignment updates in the superblock
are skipped if we mount a filesystem read-only.  But for the root filesystem
the typical case is to mount read-only first and only later remount writeable
so we'll never perform this update at all.  It's not a big problem but means
the logs of people needing the fixup get spammed at every boot because they
never happen on disk.

Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2009-01-19 14:45:04 +11:00
Christoph Hellwig
5aa2dc0a06 xfs: add a lock class for group/project dquots
We can have both a user and a group/project dquot locked at the same time,
as long as the user dquot is locked first.  Tell lockdep about that fact
by making the group/project dquots a different lock class.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2009-01-19 14:44:59 +11:00
Christoph Hellwig
4f2d4ac6e5 xfs: lockdep annotations for xfs_dqlock2
xfs_dqlock2 locks two xfs_dquots, which is fine as it always locks the
dquot with the lower id first.  Use mutex_lock_nested to tell lockdep
about this fact.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2009-01-19 14:44:52 +11:00
Christoph Hellwig
080dda7f5e xfs: add a separate lock class for the per-mount list of dquots
We can have both a a quota hash chain and the per-mount list locked at
the same time.  But given that both use the same struct dqhash as list
head we have to tell lockdep that they are different lock classes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2009-01-19 14:44:44 +11:00
Christoph Hellwig
62e194ecda xfs: use mnt_want_write in compat_attrmulti ioctl
The compat version of the attrmulti ioctl needs to ask for and then
later release write access to the mount just like the native version,
otherwise we could potentially write to read-only mounts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2009-01-19 14:44:30 +11:00
Christoph Hellwig
ab596ad897 xfs: fix dentry aliasing issues in open_by_handle
Open by handle just grabs an inode by handle and then creates itself
a dentry for it.  While this works for regular files it is horribly
broken for directories, where the VFS locking relies on the fact that
there is only just one single dentry for a given inode, and that
these are always connected to the root of the filesystem so that
it's locking algorithms work (see Documentations/filesystems/Locking)

Remove all the existing open by handle code and replace it with a small
wrapper around the exportfs code which deals with all these issues.
At the same time we also make the checks for a valid handle strict
enough to reject all not perfectly well formed handles - given that
we never hand out others that's okay and simplifies the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2009-01-19 14:43:18 +11:00
Artem Bityutskiy
e8b815663b UBIFS: constify operations
Mark super, file, and inode operation structcutes with 'const'.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-18 14:05:08 +02:00
Artem Bityutskiy
dedb0d48a9 UBIFS: do not commit twice
VFS calls '->sync_fs()' twice - first time with @wait = 0, second
time with @wait = 1. As a result, we may commit and synchronize
write-buffers twice. Avoid doing this by returning immediatelly if
@wait = 0.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-01-18 14:04:57 +02:00
Linus Torvalds
4b48d9d44e Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix ioctl arg size (userland incompatible change!)
  Btrfs: Clear the device->running_pending flag before bailing on congestion
2009-01-16 09:32:33 -08:00
Jan Kara
cc33412fb1 quota: Improve locking
We implement dqget() and dqput() that need neither dqonoff_mutex nor dqptr_sem.
Then move dqget() and dqput() calls so that they are not called from under
dqptr_sem. This is important because filesystem callbacks aren't called from
under dqptr_sem which used to cause *lots* of problems with lock ranking
(and with OCFS2 they became close to unsolvable).

The patch also removes two functions which were introduced solely because OCFS2
needed them to cope with the old locking scheme. As time showed, they were not
enough for OCFS2 anyway and it would be unnecessary work to adapt them to the
new locking scheme in which they aren't needed.  As a result OCFS2 needs the
following patch to compile properly with quotas.  Sorry to any bisecters which
hit this in advance.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-01-16 18:02:10 +01:00
Chris Mason
c071fcfdb6 Btrfs: fix ioctl arg size (userland incompatible change!)
The structure used to send device in btrfs ioctl calls was not
properly aligned, and so 32 bit ioctls would not work properly on
64 bit kernels.

We could fix this with compat ioctls, but we're just one byte away
and it doesn't make sense at this stage to carry about the compat ioctls
forever at this stage in the project.

This patch brings the ioctl arg up to an evenly aligned 4k.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-16 11:59:08 -05:00
Chris Mason
1d9e2ae949 Btrfs: Clear the device->running_pending flag before bailing on congestion
Btrfs maintains a queue of async bio submissions so the checksumming
threads don't have to wait on get_request_wait.  In order to avoid
extra wakeups, this code has a running_pending flag that is used
to tell new submissions they don't need to wake the thread.

When the threads notice congestion on a single device, they
may decide to requeue the job and move on to other devices.  This
makes sure the running_pending flag is cleared before the
job is requeued.

It should help avoid IO stalls by making sure the task is woken up
when new submissions come in.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-16 11:58:19 -05:00
Theodore Ts'o
a21102b55c ext3: Add sanity check to make_indexed_dir
Make sure the rec_len field in the '..' entry is sane, lest we overrun
the directory block and cause a kernel oops on a purposefully
corrupted filesystem.

This fixes a bug related to a bug originally reported by Sami Liedes
for ext4 at:

http://bugzilla.kernel.org/show_bug.cgi?id=12430

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-16 11:13:47 -05:00
Theodore Ts'o
e6b8bc09ba ext4: Add sanity check to make_indexed_dir
Make sure the rec_len field in the '..' entry is sane, lest we overrun
the directory block and cause a kernel oops on a purposefully
corrupted filesystem.

Thanks to Sami Liedes for reporting this bug.

http://bugzilla.kernel.org/show_bug.cgi?id=12430

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-16 11:13:40 -05:00
Theodore Ts'o
06a279d636 ext4: only use i_size_high for regular files
Directories are not allowed to be bigger than 2GB, so don't use
i_size_high for anything other than regular files.  E2fsck should
complain about these inodes, but the simplest thing to do for the
kernel is to only use i_size_high for regular files.

This prevents an intentially corrupted filesystem from causing the
kernel to burn a huge amount of CPU and issuing error messages such
as:

EXT4-fs warning (device loop0): ext4_block_to_path: block 135090028 > max

Thanks to David Maciejak from Fortinet's FortiGuard Global Security
Research Team for reporting this issue.

http://bugzilla.kernel.org/show_bug.cgi?id=12375

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-17 18:41:37 -05:00
Jan Kara
6b7021ef7e ext2: also update the inode on disk when dir is IS_DIRSYNC
We used to just write changed page for IS_DIRSYNC inodes.  But we also
have to update the directory inode itself just for the case that we've
allocated a new block and changed i_size.

[akpm@linux-foundation.org: still sync the data page]
Signed-off-by: Jan Kara <jack@suse.cz>
Tested-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-15 16:39:42 -08:00
Qinghuang Feng
1bcbf31337 btrfs & squashfs: Move btrfs and squashfsto's magic number to <linux/magic.h>
Use the standard magic.h for btrfs and squashfs.

Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Cc: Phillip Lougher <phillip@lougher.demon.co.uk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-15 16:39:38 -08:00
Linus Torvalds
bca268565f Merge branch 'syscalls' of git://git390.osdl.marist.edu/pub/scm/linux-2.6
* 'syscalls' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: (44 commits)
  [CVE-2009-0029] s390 specific system call wrappers
  [CVE-2009-0029] System call wrappers part 33
  [CVE-2009-0029] System call wrappers part 32
  [CVE-2009-0029] System call wrappers part 31
  [CVE-2009-0029] System call wrappers part 30
  [CVE-2009-0029] System call wrappers part 29
  [CVE-2009-0029] System call wrappers part 28
  [CVE-2009-0029] System call wrappers part 27
  [CVE-2009-0029] System call wrappers part 26
  [CVE-2009-0029] System call wrappers part 25
  [CVE-2009-0029] System call wrappers part 24
  [CVE-2009-0029] System call wrappers part 23
  [CVE-2009-0029] System call wrappers part 22
  [CVE-2009-0029] System call wrappers part 21
  [CVE-2009-0029] System call wrappers part 20
  [CVE-2009-0029] System call wrappers part 19
  [CVE-2009-0029] System call wrappers part 18
  [CVE-2009-0029] System call wrappers part 17
  [CVE-2009-0029] System call wrappers part 16
  [CVE-2009-0029] System call wrappers part 15
  ...
2009-01-14 19:58:40 -08:00
Heiko Carstens
2b66421995 [CVE-2009-0029] System call wrappers part 33
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:32 +01:00
Heiko Carstens
d4e82042c4 [CVE-2009-0029] System call wrappers part 32
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:31 +01:00
Heiko Carstens
836f92adf1 [CVE-2009-0029] System call wrappers part 31
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:31 +01:00
Heiko Carstens
6559eed8ca [CVE-2009-0029] System call wrappers part 30
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:30 +01:00
Heiko Carstens
2e4d0924eb [CVE-2009-0029] System call wrappers part 29
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:30 +01:00
Heiko Carstens
938bb9f5e8 [CVE-2009-0029] System call wrappers part 28
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:30 +01:00
Heiko Carstens
1e7bfb2134 [CVE-2009-0029] System call wrappers part 27
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:29 +01:00
Heiko Carstens
5a8a82b1d3 [CVE-2009-0029] System call wrappers part 23
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:28 +01:00
Heiko Carstens
20f37034fb [CVE-2009-0029] System call wrappers part 21
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:26 +01:00
Heiko Carstens
3cdad42884 [CVE-2009-0029] System call wrappers part 20
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:26 +01:00
Heiko Carstens
003d7ab479 [CVE-2009-0029] System call wrappers part 19
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:26 +01:00
Heiko Carstens
ca013e945b [CVE-2009-0029] System call wrappers part 17
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:25 +01:00
Heiko Carstens
002c8976ee [CVE-2009-0029] System call wrappers part 16
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:25 +01:00
Heiko Carstens
a26eab2400 [CVE-2009-0029] System call wrappers part 15
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:24 +01:00
Heiko Carstens
3480b25743 [CVE-2009-0029] System call wrappers part 14
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:24 +01:00
Heiko Carstens
6a6160a7b5 [CVE-2009-0029] System call wrappers part 13
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:23 +01:00
Heiko Carstens
64fd1de3d8 [CVE-2009-0029] System call wrappers part 12
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:23 +01:00
Heiko Carstens
257ac264d6 [CVE-2009-0029] System call wrappers part 11
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:23 +01:00
Heiko Carstens
bdc480e3be [CVE-2009-0029] System call wrappers part 10
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:22 +01:00
Heiko Carstens
a5f8fa9e9b [CVE-2009-0029] System call wrappers part 09
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:21 +01:00
Heiko Carstens
6673e0c3fb [CVE-2009-0029] System call wrapper special cases
System calls with an unsigned long long argument can't be converted with
the standard wrappers since that would include a cast to long, which in
turn means that we would lose the upper 32 bit on 32 bit architectures.
Also semctl can't use the standard wrapper since it has a 'union'
parameter.

So we handle them as special case and add some extra wrappers instead.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:18 +01:00
Heiko Carstens
c9da9f2129 [CVE-2009-0029] Make sys_pselect7 static
Not a single architecture has wired up sys_pselect7 plus it is the
only system call with seven parameters. Just make it static and
rename it to do_pselect which will do the work for sys_pselect6.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:16 +01:00
Heiko Carstens
1134723e96 [CVE-2009-0029] Remove __attribute__((weak)) from sys_pipe/sys_pipe2
Remove __attribute__((weak)) from common code sys_pipe implemantation.
IA64, ALPHA, SUPERH (32bit) and SPARC (32bit) have own implemantations
with the same name. Just rename them.
For sys_pipe2 there is no architecture specific implementation.

Cc: Richard Henderson <rth@twiddle.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:15 +01:00
Heiko Carstens
e55380edf6 [CVE-2009-0029] Rename old_readdir to sys_old_readdir
This way it matches the generic system call name convention.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:15 +01:00
Heiko Carstens
2ed7c03ec1 [CVE-2009-0029] Convert all system calls to return a long
Convert all system calls to return a long. This should be a NOP since all
converted types should have the same size anyway.
With the exception of sys_exit_group which returned void. But that doesn't
matter since the system call doesn't return.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:14 +01:00
Lachlan McIlroy
cb7a97d015 Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 into for-linus 2009-01-14 16:29:51 +11:00
Bernd Schmidt
62568510b8 Fix timeouts in sys_pselect7
Since we (Analog Devices) updated our Blackfin kernel to 2.6.28, we've
seen occasional 5-second hangs from telnet.  telnetd calls select with a
NULL timeout, but with the new kernel, the system call occasionally
returns 0, which causes telnet to call sleep (5).  This did not happen
with earlier kernels.

The code in sys_pselect7 looks a bit strange, in particular the variable
"to" is initialized to NULL, then changed if a non-null timeout was
passed in, but not used further.  It needs to be passed to
core_sys_select instead of &end_time.

This bug was introduced by 8ff3e8e85f
("select: switch select() and poll() over to hrtimers").

Signed-off-by: Bernd Schmidt <bernd.schmidt@analog.com>
Reviewed-by: Ulrich Drepper <drepper@redhat.com>
Tested-by: Robin Getz <rgetz@blackfin.uclinux.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-13 14:45:17 -08:00
Linus Torvalds
c69e8839c2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: change rsbtbl rwlock to spinlock
  dlm: fix seq_file usage in debugfs lock dump
2009-01-12 15:54:27 -08:00
Simon Holm Thøgersen
c225aa57ff ext4: fix wrong use of do_div
the following warning:

fs/jbd2/journal.c: In function ‘jbd2_seq_info_show’:
fs/jbd2/journal.c:850: warning: format ‘%lu’ expects type ‘long
unsigned int’, but argument 3 has type ‘uint32_t’

is caused by wrong usage of do_div that modifies the dividend in-place
and returns the quotient. So not only would an incorrect value be
displayed, but s->journal->j_average_commit_time would also be changed
to a wrong value!

Fix it by using div_u64 instead.

Signed-off-by: Simon Holm Thøgersen <odie@cs.aau.dk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-11 22:34:01 -05:00
Linus Torvalds
0176260fc3 btrfs: fix for write_super_lockfs/unlockfs error handling
Commit c4be0c1dc4 added the ability for
write_super_lockfs to return errors, and renamed them to match.  But
btrfs didn't get converted.

Do the minimal conversion to make it compile again.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-10 06:09:52 -08:00
Takashi Sato
8e961870bb filesystem freeze: remove XFS specific ioctl interfaces for freeze feature
It removes XFS specific ioctl interfaces and request codes
for freeze feature.

This patch has been supplied by David Chinner.

Signed-off-by: Dave Chinner <dgc@sgi.com>
Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: <xfs-masters@oss.sgi.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-09 16:54:42 -08:00
Takashi Sato
fcccf50254 filesystem freeze: implement generic freeze feature
The ioctls for the generic freeze feature are below.
o Freeze the filesystem
  int ioctl(int fd, int FIFREEZE, arg)
    fd: The file descriptor of the mountpoint
    FIFREEZE: request code for the freeze
    arg: Ignored
    Return value: 0 if the operation succeeds. Otherwise, -1

o Unfreeze the filesystem
  int ioctl(int fd, int FITHAW, arg)
    fd: The file descriptor of the mountpoint
    FITHAW: request code for unfreeze
    arg: Ignored
    Return value: 0 if the operation succeeds. Otherwise, -1
    Error number: If the filesystem has already been unfrozen,
                  errno is set to EINVAL.

[akpm@linux-foundation.org: fix CONFIG_BLOCK=n]
Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com>
Signed-off-by: Masayuki Hamaguchi <m-hamaguchi@ys.jp.nec.com>
Cc: <xfs-masters@oss.sgi.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-09 16:54:42 -08:00
Takashi Sato
c4be0c1dc4 filesystem freeze: add error handling of write_super_lockfs/unlockfs
Currently, ext3 in mainline Linux doesn't have the freeze feature which
suspends write requests.  So, we cannot take a backup which keeps the
filesystem's consistency with the storage device's features (snapshot and
replication) while it is mounted.

In many case, a commercial filesystem (e.g.  VxFS) has the freeze feature
and it would be used to get the consistent backup.

If Linux's standard filesystem ext3 has the freeze feature, we can do it
without a commercial filesystem.

So I have implemented the ioctls of the freeze feature.
I think we can take the consistent backup with the following steps.
1. Freeze the filesystem with the freeze ioctl.
2. Separate the replication volume or create the snapshot
   with the storage device's feature.
3. Unfreeze the filesystem with the unfreeze ioctl.
4. Take the backup from the separated replication volume
   or the snapshot.

This patch:

VFS:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that they can return an error.
Rename write_super_lockfs and unlockfs of the super block operation
freeze_fs and unfreeze_fs to avoid a confusion.

ext3, ext4, xfs, gfs2, jfs:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that write_super_lockfs returns an error if needed,
and unlockfs always returns 0.

reiserfs:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that they always return 0 (success) to keep a current behavior.

Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com>
Signed-off-by: Masayuki Hamaguchi <m-hamaguchi@ys.jp.nec.com>
Cc: <xfs-masters@oss.sgi.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-09 16:54:42 -08:00
David Brownell
2d96d1053d CORE_DUMP_DEFAULT_ELF_HEADERS depends on ELF_CORE
Kernels that don't support ELF coredumps at all surely can't be supporting
new partial-segment flavored ELF coredumps ...  don't make folk answer
Kconfig questions about that flavor.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-09 16:54:41 -08:00
Linus Torvalds
9a100a4464 Merge git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async-2
* git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async-2:
  async: make async a command line option for now
  partial revert of asynchronous inode delete
2009-01-09 15:32:26 -08:00
Linus Torvalds
32b838b8cf Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6:
  [JFFS2] remove junk prototypes
2009-01-09 15:29:04 -08:00
Linus Torvalds
31aeb6c815 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
  MAINTAINERS: squashfs entry
  Squashfs: documentation
  Squashfs: initrd support
  Squashfs: Kconfig entry
  Squashfs: Makefiles
  Squashfs: header files
  Squashfs: block operations
  Squashfs: cache operations
  Squashfs: uid/gid lookup operations
  Squashfs: fragment block operations
  Squashfs: export operations
  Squashfs: super block operations
  Squashfs: symlink operations
  Squashfs: regular file operations
  Squashfs: directory readdir operations
  Squashfs: directory lookup operations
  Squashfs: inode operations
2009-01-09 15:18:49 -08:00
Linus Torvalds
c40f6f8bbc Merge git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-nommu
* git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-nommu:
  NOMMU: Support XIP on initramfs
  NOMMU: Teach kobjsize() about VMA regions.
  FLAT: Don't attempt to expand the userspace stack to fill the space allocated
  FDPIC: Don't attempt to expand the userspace stack to fill the space allocated
  NOMMU: Improve procfs output using per-MM VMAs
  NOMMU: Make mmap allocation page trimming behaviour configurable.
  NOMMU: Make VMAs per MM as for MMU-mode linux
  NOMMU: Delete askedalloc and realalloc variables
  NOMMU: Rename ARM's struct vm_region
  NOMMU: Fix cleanup handling in ramfs_nommu_get_umapped_area()
2009-01-09 14:00:58 -08:00
Arjan van de Ven
b32714ba29 partial revert of asynchronous inode delete
let the core of this one bake in -next as well, but leave
some of the infrastructure in place.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
2009-01-09 13:15:49 -08:00
Artem Bityutskiy
ab5610b434 [JFFS2] remove junk prototypes
'rb_prev()', 'rb_next()' and 'rb_replace_node()' are declared in
include/linux/rbtree.h, no need for JFFS2 to re-declare them. I
believe these are left-overs from the old days when the common
RB tree code did not have those call and JFFS2 had private
implementation.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-01-09 21:05:21 +00:00
Linus Torvalds
73d59314e6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (864 commits)
  Btrfs: explicitly mark the tree log root for writeback
  Btrfs: Drop the hardware crc32c asm code
  Btrfs: Add Documentation/filesystem/btrfs.txt, remove old COPYING
  Btrfs: kmap_atomic(KM_USER0) is safe for btrfs_readpage_end_io_hook
  Btrfs: Don't use kmap_atomic(..., KM_IRQ0) during checksum verifies
  Btrfs: tree logging checksum fixes
  Btrfs: don't change file extent's ram_bytes in btrfs_drop_extents
  Btrfs: Use btrfs_join_transaction to avoid deadlocks during snapshot creation
  Btrfs: drop remaining LINUX_KERNEL_VERSION checks and compat code
  Btrfs: drop EXPORT symbols from extent_io.c
  Btrfs: Fix checkpatch.pl warnings
  Btrfs: Fix free block discard calls down to the block layer
  Btrfs: avoid orphan inode caused by log replay
  Btrfs: avoid potential super block corruption
  Btrfs: do not call kfree if kmalloc failed in btrfs_sysfs_add_super
  Btrfs: fix a memory leak in btrfs_get_sb
  Btrfs: Fix typo in clear_state_cb
  Btrfs: Fix memset length in btrfs_file_write
  Btrfs: update directory's size when creating subvol/snapshot
  Btrfs: add permission checks to the ioctls
  ...
2009-01-09 13:01:38 -08:00
Linus Torvalds
6ddaab20c3 Merge branch 'for-2.6.29' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.29' of git://git.kernel.dk/linux-2.6-block:
  block: fix bug in ptbl lookup cache
2009-01-09 12:57:34 -08:00
Neil Brown
54b0d12769 block: fix bug in ptbl lookup cache
Neil writes:

   Hi Jens,

    I've found a little bug for you.  It was introduced by
        a6f23657d3

        block: add one-hit cache for disk partition lookup

    and has the effect of killing my machine whenever I try to assemble
    an md array :-(
    One of the devices in the array has partitions, and mdadm always
    deletes partitions before putting a whole-device in an array (as it
    can cause confusion).  The next IO to that device locks the machine.
    I don't really understand exactly why it locks up, but it happens in
    disk_map_sector_rcu().  This patch fixes it.

Which is due to a missing clear of the (now) stale partition lookup
data. So clear that when we delete a partition.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-01-09 21:46:13 +01:00
Linus Torvalds
7c51d57e9d Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6: (67 commits)
  [MTD] [MAPS] Fix printk format warning in nettel.c
  [MTD] [NAND] add cmdline parsing (mtdparts=) support to cafe_nand
  [MTD] CFI: remove major/minor version check for command set 0x0002
  [MTD] [NAND] ndfc driver
  [MTD] [TESTS] Fix some size_t printk format warnings
  [MTD] LPDDR Makefile and KConfig
  [MTD] LPDDR extended physmap driver to support LPDDR flash
  [MTD] LPDDR added new pfow_base parameter
  [MTD] LPDDR Command set driver
  [MTD] LPDDR PFOW definition
  [MTD] LPDDR QINFO records definitions
  [MTD] LPDDR qinfo probing.
  [MTD] [NAND] pxa3xx: convert from ns to clock ticks more accurately
  [MTD] [NAND] pxa3xx: fix non-page-aligned reads
  [MTD] [NAND] fix nandsim sched.h references
  [MTD] [NAND] alauda: use USB API functions rather than constants
  [MTD] struct device - replace bus_id with dev_name(), dev_set_name()
  [MTD] fix m25p80 64-bit divisions
  [MTD] fix dataflash 64-bit divisions
  [MTD] [NAND] Set the fsl elbc ECCM according the settings in bootloader.
  ...

Fixed up trivial debug conflicts in drivers/mtd/devices/{m25p80.c,mtd_dataflash.c}
2009-01-09 12:37:15 -08:00
Chris Mason
e293e97e36 Btrfs: explicitly mark the tree log root for writeback
Each subvolume has an extent_state_tree used to mark metadata
that needs to be sent to disk while syncing the tree.  This is
used in addition to the dirty bits on the pages themselves so that
a single subvolume can be sent to disk efficiently in disk order.

Normally this marking happens in btrfs_alloc_free_block, which also does
special recording of dirty tree blocks for the tree log roots.

Yan Zheng noticed that when the root of the log tree is allocated, it is added
to the wrong writeback list.  The fix used here is to explicitly set
it dirty as part of tree log creation.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-09 13:14:17 -05:00
Nick Piggin
0087167c9d [XFS] use scalable vmap API
Implement XFS's large buffer support with the new vmap APIs. See the vmap
rewrite (db64fe02) for some numbers. The biggest improvement that comes from
using the new APIs is avoiding the global KVA allocation lock on every call.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2009-01-09 17:09:47 +11:00
Nick Piggin
958f8c0e4f [XFS] remove old vmap cache
XFS's vmap batching simply defers a number (up to 64) of vunmaps, and keeps
track of them in a list. To purge the batch, it just goes through the list and
calls vunamp on each one. This is pretty poor: a global TLB flush is generally
still performed on each vunmap, with the most expensive parts of the operation
being the broadcast IPIs and locking involved in the SMP callouts, and the
locking involved in the vmap management -- none of these are avoided by just
batching up the calls. I'm actually surprised it ever made much difference.
(Now that the lazy vmap allocator is upstream, this description is not quite
right, but the vunmap batching still doesn't seem to do much)

Rip all this logic out of XFS completely. I will improve vmap performance
and scalability directly in subsequent patch.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2009-01-09 17:09:25 +11:00
Christoph Hellwig
058652a37d [XFS] make xfs_ino_t an unsigned long long
Currently xfs_ino_t is defined as a u64 which can either be an unsigned
long long or on some 64 bit platforms and unsigned long.  Just making
it and unsigned long long mean's it's still always 64 bits wide, but we
don't need to resort to cases to print it.

Fixes a warning regression on 64 bit powerpc in current git.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2009-01-09 16:19:14 +11:00
Christoph Hellwig
1544031976 [XFS] truncate readdir offsets to signed 32 bit values
John Stanley reported EOVERFLOW errors in readdir from his self-build
glibc.  I traced this down to glibc enabling d_off overflow checks
in one of the about five million different getdents implementations.

In 2.6.28 Dave Woodhouse moved our readdir double buffering required
for NFS4 readdirplus into nfsd and at that point we lost the capping
of the directory offsets to 32 bit signed values.  Johns glibc used
getdents64 to even implement readdir for normal 32 bit offset dirents,
and failed with EOVERFLOW only if this happens on the first dirent in
a getdents call.  I managed to come up with a testcase that uses
raw getdents and does the EOVERFLOW check manually.  We always hit
it with our last entry due to the special end of directory marker.

The patch below is a dumb version of just putting back the masking,
to make sure we have the same behavior as in 2.6.27 and earlier.

I will work on a better and cleaner fix for 2.6.30.

Reported-by: John Stanley <jpsinthemix@verizon.net>
Tested-by: John Stanley <jpsinthemix@verizon.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2009-01-09 16:18:24 +11:00
Christoph Hellwig
e6edbd1c1c [XFS] fix compile of xfs_btree_readahead_lblock on m68k
Change the left/right variables to the proper always 64bit xfs_dfsbo_t
type because otherwise compilation fails for Geert on m68k without
CONFIG_LBD:

| fs/xfs/xfs_btree.c: In function 'xfs_btree_readahead_lblock':
| fs/xfs/xfs_btree.c:736: warning: comparison is always true due to limited range of data type
| fs/xfs/xfs_btree.c:741: warning: comparison is always true due to limited range of data type

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2009-01-09 16:16:51 +11:00
Eric Sandeen
fb82557f16 [XFS] Remove macro-to-function indirections in the mask code
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2009-01-09 15:53:54 +11:00
Eric Sandeen
c9fb86a917 [XFS] Remove macro-to-function indirections in attr code
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2009-01-09 15:46:44 +11:00
Eric Sandeen
9800b55035 [XFS] Remove several unused typedefs.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2009-01-09 15:46:16 +11:00
Christoph Hellwig
c9a98553d5 [XFS] pass XFS_IGET_BULKSTAT to xfs_iget for handle operations
NFS clients or users of the handle ioctls can pass us arbitrary inode
numbers through the exportfs interface.  Make sure we use the
XFS_IGET_BULKSTAT so that these don't cause shutdowns due to the corruption
checks.  Also translate the EINVAL we get back for invalid inode clusters
into an ESTALE which is more appropinquate, and remove the useless check
for a NULL inode on a successfull xfs_iget return.

I have a testcase to reproduce this using the handle interface which
I will submit to xfsqa.

Reported-by: Mario Becroft <mb@gem.win.co.nz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
2009-01-09 15:17:17 +11:00
Linus Torvalds
2150edc6c5 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits)
  jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs
  ext4: Remove "extents" mount option
  block: Add Kconfig help which notes that ext4 needs CONFIG_LBD
  ext4: Make printk's consistently prefixed with "EXT4-fs: "
  ext4: Add sanity checks for the superblock before mounting the filesystem
  ext4: Add mount option to set kjournald's I/O priority
  jbd2: Submit writes to the journal using WRITE_SYNC
  jbd2: Add pid and journal device name to the "kjournald2 starting" message
  ext4: Add markers for better debuggability
  ext4: Remove code to create the journal inode
  ext4: provide function to release metadata pages under memory pressure
  ext3: provide function to release metadata pages under memory pressure
  add releasepage hooks to block devices which can be used by file systems
  ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc
  ext4: Init the complete page while building buddy cache
  ext4: Don't allow new groups to be added during block allocation
  ext4: mark the blocks/inode bitmap beyond end of group as used
  ext4: Use new buffer_head flag to check uninit group bitmaps initialization
  ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
  ext4: code cleanup
  ...
2009-01-08 17:14:59 -08:00
Linus Torvalds
cd764695b6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (45 commits)
  [SCSI] qla2xxx: Update version number to 8.03.00-k1.
  [SCSI] qla2xxx: Add ISP81XX support.
  [SCSI] qla2xxx: Use proper request/response queues with MQ instantiations.
  [SCSI] qla2xxx: Correct MQ-chain information retrieval during a firmware dump.
  [SCSI] qla2xxx: Collapse EFT/FCE copy procedures during a firmware dump.
  [SCSI] qla2xxx: Don't pollute kernel logs with ZIO/RIO status messages.
  [SCSI] qla2xxx: Don't fallback to interrupt-polling during re-initialization with MSI-X enabled.
  [SCSI] qla2xxx: Remove support for reading/writing HW-event-log.
  [SCSI] cxgb3i: add missing include
  [SCSI] scsi_lib: fix DID_RESET status problems
  [SCSI] fc transport: restore missing dev_loss_tmo callback to LLDD
  [SCSI] aha152x_cs: Fix regression that keeps driver from using shared interrupts
  [SCSI] sd: Correctly handle 6-byte commands with DIX
  [SCSI] sd: DIF: Fix tagging on platforms with signed char
  [SCSI] sd: DIF: Show app tag on error
  [SCSI] Fix error handling for DIF/DIX
  [SCSI] scsi_lib: don't decrement busy counters when inserting commands
  [SCSI] libsas: fix test for negative unsigned and typos
  [SCSI] a2091, gvp11: kill warn_unused_result warnings
  [SCSI] fusion: Move a dereference below a NULL test
  ...

Fixed up trivial conflict due to moving the async part of sd_probe
around in the async probes vs using dev_set_name() in naming.
2009-01-08 16:27:31 -08:00
Linus Torvalds
894bcdfb1a Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
  md: don't retry recovery of raid1 that fails due to error on source drive.
  md: Allow md devices to be created by name.
  md: make devices disappear when they are no longer needed.
  md: centralise all freeing of an 'mddev' in 'md_free'
  md: move allocation of ->queue from mddev_find to md_probe
  md: need another print_sb for mdp_superblock_1
  md: use list_for_each_entry macro directly
  md: raid0: make hash_spacing and preshift sector-based.
  md: raid0: Represent the size of strip zones in sectors.
  md: raid0 create_strip_zones(): Add KERN_INFO/KERN_ERR to printk's.
  md: raid0 create_strip_zones(): Make two local variables sector-based.
  md: raid0: Represent zone->zone_offset in sectors.
  md: raid0: Represent device offset in sectors.
  md: raid0_make_request(): Replace local variable block by sector.
  md: raid0_make_request(): Remove local variable chunk_size.
  md: raid0_make_request(): Replace chunksize_bits by chunksect_bits.
  md: use sysfs_notify_dirent to notify changes to md/sync_action.
  md: fix bitmap-on-external-file bug.
2009-01-08 14:03:34 -08:00
NeilBrown
d3374825ce md: make devices disappear when they are no longer needed.
Currently md devices, once created, never disappear until the module
is unloaded.  This is essentially because the gendisk holds a
reference to the mddev, and the mddev holds a reference to the
gendisk, this a circular reference.

If we drop the reference from mddev to gendisk, then we need to ensure
that the mddev is destroyed when the gendisk is destroyed.  However it
is not possible to hook into the gendisk destruction process to enable
this.

So we drop the reference from the gendisk to the mddev and destroy the
gendisk when the mddev gets destroyed.  However this has a
complication.
Between the call
   __blkdev_get->get_gendisk->kobj_lookup->md_probe
and the call
   __blkdev_get->md_open

there is no obvious way to hold a reference on the mddev any more, so
unless something is done, it will disappear and gendisk will be
destroyed prematurely.

Also, once we decide to destroy the mddev, there will be an unlockable
moment before the gendisk is unlinked (blk_unregister_region) during
which a new reference to the gendisk can be created.  We need to
ensure that this reference can not be used.  i.e. the ->open must
fail.

So:
 1/  in md_probe we set a flag in the mddev (hold_active) which
     indicates that the array should be treated as active, even
     though there are no references, and no appearance of activity.
     This is cleared by md_release when the device is closed if it
     is no longer needed.
     This ensures that the gendisk will survive between md_probe and
     md_open.

 2/  In md_open we check if the mddev we expect to open matches
     the gendisk that we did open.
     If there is a mismatch we return -ERESTARTSYS and modify
     __blkdev_get to retry from the top in that case.
     In the -ERESTARTSYS sys case we make sure to wait until
     the old gendisk (that we succeeded in opening) is really gone so
     we loop at most once.

Some udev configurations will always open an md device when it first
appears.   If we allow an md device that was just created by an open
to disappear on an immediate close, then this can race with such udev
configurations and result in an infinite loop the device being opened
and closed, then re-open due to the 'ADD' even from the first open,
and then close and so on.
So we make sure an md device, once created by an open, remains active
at least until some md 'ioctl' has been made on it.  This means that
all normal usage of md devices will allow them to disappear promptly
when not needed, but the worst that an incorrect usage will do it
cause an inactive md device to be left in existence (it can easily be
removed).

As an array can be stopped by writing to a sysfs attribute
  echo clear > /sys/block/mdXXX/md/array_state
we need to use scheduled work for deleting the gendisk and other
kobjects.  This allows us to wait for any pending gendisk deletion to
complete by simply calling flush_scheduled_work().



Signed-off-by: NeilBrown <neilb@suse.de>
2009-01-09 08:31:10 +11:00
David Teigland
c7be761a81 dlm: change rsbtbl rwlock to spinlock
The rwlock is almost always used in write mode, so there's no reason
to not use a spinlock instead.

Signed-off-by: David Teigland <teigland@redhat.com>
2009-01-08 15:12:39 -06:00
David Teigland
892c4467e3 dlm: fix seq_file usage in debugfs lock dump
The old code would leak iterators and leave reference counts on
rsbs because it was ignoring the "stop" seq callback.  The code
followed an example that used the seq operations differently.
This new code is based on actually understanding how the seq
operations work.  It also improves things by saving the hash bucket
in the position to avoid cycling through completed buckets in start.

Siged-off-by: Davd Teigland <teigland@redhat.com>
2009-01-08 15:12:31 -06:00
Coly Li
73ac36ea14 fix similar typos to successfull
When I review ocfs2 code, find there are 2 typos to "successfull".  After
doing grep "successfull " in kernel tree, 22 typos found totally -- great
minds always think alike :)

This patch fixes all the similar typos. Thanks for Randy's ack and comments.

Signed-off-by: Coly Li <coyli@suse.de>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Roland Dreier <rolandd@cisco.com>
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:15 -08:00
Wu Fengguang
9a8d5bb4ad generic swap(): dcache: use swap() instead of private do_switch()
Use the new generic implementation.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:15 -08:00
Wu Fengguang
97e133b454 generic swap(): ext4: remove local swap() macro
Use the new generic implementation.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:15 -08:00
Wu Fengguang
be857df1dd generic swap(): ext3: remove local swap() macro
Use the new generic implementation.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:15 -08:00
Fernando Carrijo
c19a28e119 remove lots of double-semicolons
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: James Morris <jmorris@namei.org>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:14 -08:00
roel kluin
f15659628b romfs: romfs_iget() - unsigned ino >= 0 is always true
romfs_strnlen() returns int
unsigned X >= 0 is always true

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: roel kluin <roel.kluin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:14 -08:00
Magnus Damm
921d58c0e6 vmcore: remove saved_max_pfn check
Remove the saved_max_pfn check from the /proc/vmcore function
read_from_oldmem().  No need to verify, we should be able to just trust
that "elfcorehdr=" is correctly passed to the crash kernel on the kernel
command line like we do with other parameters.

The read_from_oldmem() function in fs/proc/vmcore.c is quite similar to
read_from_oldmem() in drivers/char/mem.c, but only in the latter it makes
sense to use saved_max_pfn.  For oldmem it is used to determine when to
stop reading.  For vmcore we already have the elf header info pointing out
the physical memory regions, no need to pass the end-of- old-memory twice.

Removing the saved_max_pfn check from vmcore makes it possible for
architectures to skip oldmem but still support crash dump through vmcore -
without the need for the old saved_max_pfn cruft.

Architectures that want to play safe can do the saved_max_pfn check in
copy_oldmem_page().  Not sure why anyone would want to do that, but that's
even safer than today - the saved_max_pfn check in vmcore removed by this
patch only checks the first page.

Signed-off-by: Magnus Damm <damm@igel.co.jp>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Simon Horman <horms@verge.net.au>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:14 -08:00
Kees Cook
f06295b44c ELF: implement AT_RANDOM for glibc PRNG seeding
While discussing[1] the need for glibc to have access to random bytes
during program load, it seems that an earlier attempt to implement
AT_RANDOM got stalled.  This implements a random 16 byte string, available
to every ELF program via a new auxv AT_RANDOM vector.

[1] http://sourceware.org/ml/libc-alpha/2008-10/msg00006.html

Ulrich said:

glibc needs right after startup a bit of random data for internal
protections (stack canary etc).  What is now in upstream glibc is that we
always unconditionally open /dev/urandom, read some data, and use it.  For
every process startup.  That's slow.

...

The solution is to provide a limited amount of random data to the
starting process in the aux vector.  I suggested 16 bytes and this is
what the patch implements.  If we need only 16 bytes or less we use the
data directly.  If we need more we'll use the 16 bytes to see a PRNG.
This avoids the costly /dev/urandom use and it allows the kernel to use
the most adequate source of random data for this purpose.  It might not
be the same pool as that for /dev/urandom.

Concerns were expressed about the depletion of the randomness pool.  But
this patch doesn't make the situation worse, it doesn't deplete entropy
more than happens now.

Signed-off-by: Kees Cook <kees.cook@canonical.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:12 -08:00
KAMEZAWA Hiroyuki
08e552c69c memcg: synchronized LRU
A big patch for changing memcg's LRU semantics.

Now,
  - page_cgroup is linked to mem_cgroup's its own LRU (per zone).

  - LRU of page_cgroup is not synchronous with global LRU.

  - page and page_cgroup is one-to-one and statically allocated.

  - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
    - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);

  - SwapCache is handled.

And, when we handle LRU list of page_cgroup, we do following.

	pc = lookup_page_cgroup(page);
	lock_page_cgroup(pc); .....................(1)
	mz = page_cgroup_zoneinfo(pc);
	spin_lock(&mz->lru_lock);
	.....add to LRU
	spin_unlock(&mz->lru_lock);
	unlock_page_cgroup(pc);

But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.

This is a trial to remove this dirty nesting of locks.
This patch changes mz->lru_lock to be zone->lru_lock.
Then, above sequence will be written as

        spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
	mem_cgroup_add/remove/etc_lru() {
		pc = lookup_page_cgroup(page);
		mz = page_cgroup_zoneinfo(pc);
		if (PageCgroupUsed(pc)) {
			....add to LRU
		}
        spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU

This is much simpler.
(*) We're safe even if we don't take lock_page_cgroup(pc). Because..
    1. When pc->mem_cgroup can be modified.
       - at charge.
       - at account_move().
    2. at charge
       the PCG_USED bit is not set before pc->mem_cgroup is fixed.
    3. at account_move()
       the page is isolated and not on LRU.

Pros.
  - easy for maintenance.
  - memcg can make use of laziness of pagevec.
  - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
  - LRU status of memcg will be synchronized with global LRU's one.
  - # of locks are reduced.
  - account_move() is simplified very much.
Cons.
  - may increase cost of LRU rotation.
    (no impact if memcg is not configured.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:05 -08:00
Jan Kara
e04a88a920 quota: don't set grace time when user isn't above softlimit
do_set_dqblk() allowed SETDQBLK quotactl to set user's grace time even if
user was not above his softlimit.  This does not make much sence and by
coincidence causes quota code to omit softlimit warning when user really
exceeds softlimit.  This patch makes do_set_dqblk() reset user's grace
time if he has not exceeded softlimit.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:01 -08:00
Richard A. Holden III
87d1fda5e2 coda: fix fs/coda/sysctl.c build warnings when !CONFIG_SYSCTL
Fix
fs/coda/sysctl.c:14: warning: 'fs_table_header' defined but not used
fs/coda/sysctl.c:44: warning: 'fs_table' defined but not used

these are only used when CONFIG_SYSCTL is defined.

Signed-off-by: Richard A. Holden III <aciddeath@gmail.com>
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:01 -08:00
Randy Dunlap
1579c3a15c jbd: remove excess kernel-doc notation
Remove excess kernel-doc from fs/jbd/transaction.c:

Warning(linux-2.6.28-git5//fs/jbd/transaction.c:764): Excess function parameter 'credits' description in 'journal_get_write_access'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:01 -08:00
Duane Griffin
04143e2fb9 ext3: tighten restrictions on inode flags
At the moment there are few restrictions on which flags may be set on
which inodes.  Specifically DIRSYNC may only be set on directories and
IMMUTABLE and APPEND may not be set on links.  Tighten that to disallow
TOPDIR being set on non-directories and only NODUMP and NOATIME to be set
on non-regular file, non-directories.

Introduces a flags masking function which masks flags based on mode and
use it during inode creation and when flags are set via the ioctl to
facilitate future consistency.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:01 -08:00
Duane Griffin
2e8671cb56 ext3: don't inherit inappropriate inode flags from parent
At present INDEX is the only flag that new ext3 inodes do NOT inherit from
their parent.  In addition prevent the flags DIRTY, ECOMPR, IMAGIC and
TOPDIR from being inherited.  List inheritable flags explicitly to prevent
future flags from accidentally being inherited.

This fixes the TOPDIR flag inheritance bug reported at
http://bugzilla.kernel.org/show_bug.cgi?id=9866.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:01 -08:00
Pekka Enberg
5df096d67e ext3: allocate ->s_blockgroup_lock separately
As spotted by kmemtrace, struct ext3_sb_info is 17152 bytes on 64-bit
which makes it a very bad fit for SLAB allocators.  The culprit of the
wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when
NR_CPUS >= 32.

To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2
page in the worst case, separately.  This shinks down struct ext3_sb_info
enough to fit a 1 KB slab cache so now we allocate 16 KB + 1 KB instead of
32 KB saving 15 KB of memory.

Acked-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:00 -08:00
Josef Bacik
f420d4dc42 jbd: improve fsync batching
There is a flaw with the way jbd handles fsync batching.  If we fsync() a
file and we were not the last person to run fsync() on this fs then we
automatically sleep for 1 jiffie in order to wait for new writers to join
into the transaction before forcing the commit.  The problem with this is
that with really fast storage (ie a Clariion) the time it takes to commit
a transaction to disk is way faster than 1 jiffie in most cases, so
sleeping means waiting longer with nothing to do than if we just committed
the transaction and kept going.  Ric Wheeler noticed this when using
fs_mark with more than 1 thread, the throughput would plummet as he added
more threads.

This patch attempts to fix this problem by recording the average time in
nanoseconds that it takes to commit a transaction to disk, and what time
we started the transaction.  If we run an fsync() and we have been running
for less time than it takes to commit the transaction to disk, we sleep
for the delta amount of time and then commit to disk.  We acheive
sub-jiffie sleeping using schedule_hrtimeout.  This means that the wait
time is auto-tuned to the speed of the underlying disk, instead of having
this static timeout.  I weighted the average according to somebody's
comments (Andreas Dilger I think) in order to help normalize random
outliers where we take way longer or way less time to commit than the
average.  I also have a min() check in there to make sure we don't sleep
longer than a jiffie in case our storage is super slow, this was requested
by Andrew.

I unfortunately do not have access to a Clariion, so I had to use a
ramdisk to represent a super fast array.  I tested with a SATA drive with
barrier=1 to make sure there was no regression with local disks, I tested
with a 4 way multipathed Apple Xserve RAID array and of course the
ramdisk.  I ran the following command

fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t $i

where $i was 2, 4, 8, 16 and 32.  I mkfs'ed the fs each time.  Here are my
results

type	threads		with patch	without patch
sata	2		24.6		26.3
sata	4		49.2		48.1
sata	8		70.1		67.0
sata	16		104.0		94.1
sata	32		153.6		142.7

xserve	2		246.4		222.0
xserve	4		480.0		440.8
xserve	8		829.5		730.8
xserve	16		1172.7		1026.9
xserve	32		1816.3		1650.5

ramdisk	2		2538.3		1745.6
ramdisk	4		2942.3		661.9
ramdisk	8		2882.5		999.8
ramdisk	16		2738.7		1801.9
ramdisk	32		2541.9		2394.0

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ric Wheeler <rwheeler@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:00 -08:00