linux

q3k/linux

Author	SHA1	Message	Date
Curt Wohlgemuth	0893ed458b	ext4: sync the directory inode in ext4_sync_parent() ext4 has taken the stance that, in the absence of a journal, when an fsync/fdatasync of an inode is done, the parent directory should be sync'ed if this inode entry is new. ext4_sync_parent(), which implements this, does indeed sync the dirent pages for parent directories, but it does not sync the directory inode. This patch fixes this. Also now return error status from ext4_sync_parent(). I tested this using a power fail test, which panics a machine running a file server getting requests from a client. Without this patch, on about every other test run, the server is missing many, many files that had been synced. With this patch, on > 6 runs, I see zero files being lost. Google-Bug-Id: 4179519 Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-04-10 22:05:31 -04:00
J. Bruce Fields	23fcf2ec93	nfsd4: fix oops on lock failure Lock stateid's can have access_bmap 0 if they were only partially initialized (due to a failed lock request); handle that case in free_generic_stateid. ------------[ cut here ]------------ kernel BUG at fs/nfsd/nfs4state.c:380! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/kernel/mm/ksm/run Modules linked in: nfs fscache md4 nls_utf8 cifs ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat bridge stp llc nfsd lockd nfs_acl auth_rpcgss sunrpc ipv6 ppdev parport_pc parport pcnet32 mii pcspkr microcode i2c_piix4 BusLogic floppy [last unloaded: mperf] Pid: 1468, comm: nfsd Not tainted 2.6.38+ #120 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform EIP: 0060:[<e24f180d>] EFLAGS: 00010297 CPU: 0 EIP is at nfs4_access_to_omode+0x1c/0x29 [nfsd] EAX: ffffffff EBX: dd758120 ECX: 00000000 EDX: 00000004 ESI: dd758120 EDI: ddfe657c EBP: dd54dde0 ESP: dd54dde0 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process nfsd (pid: 1468, ti=dd54c000 task=ddc92580 task.ti=dd54c000) Stack: dd54ddf0 e24f19ca 00000000 ddfe6560 dd54de08 e24f1a5d dd758130 deee3a20 ddfe6560 31270000 dd54df1c e24f52fd 0000000f dd758090 e2505dd0 0be304cf dbb51d68 0000000e ddfe657c ddcd8020 dd758130 dd758128 dd7580d8 dd54de68 Call Trace: [<e24f19ca>] free_generic_stateid+0x1c/0x3e [nfsd] [<e24f1a5d>] release_lockowner+0x71/0x8a [nfsd] [<e24f52fd>] nfsd4_lock+0x617/0x66c [nfsd] [<e24e57b6>] ? nfsd_setuser+0x199/0x1bb [nfsd] [<e24e056c>] ? nfsd_setuser_and_check_port+0x65/0x81 [nfsd] [<c07a0052>] ? _cond_resched+0x8/0x1c [<c04ca61f>] ? slab_pre_alloc_hook.clone.33+0x23/0x27 [<c04cac01>] ? kmem_cache_alloc+0x1a/0xd2 [<c04835a0>] ? __call_rcu+0xd7/0xdd [<e24e0dfb>] ? fh_verify+0x401/0x452 [nfsd] [<e24f0b61>] ? nfsd4_encode_operation+0x52/0x117 [nfsd] [<e24ea0d7>] ? nfsd4_putfh+0x33/0x3b [nfsd] [<e24f4ce6>] ? nfsd4_delegreturn+0xd4/0xd4 [nfsd] [<e24ea2c9>] nfsd4_proc_compound+0x1ea/0x33e [nfsd] [<e24de6ee>] nfsd_dispatch+0xd1/0x1a5 [nfsd] [<e1d6e1c7>] svc_process_common+0x282/0x46f [sunrpc] [<e1d6e578>] svc_process+0xdc/0xfa [sunrpc] [<e24de0fa>] nfsd+0xd6/0x115 [nfsd] [<e24de024>] ? nfsd_shutdown+0x24/0x24 [nfsd] [<c0454322>] kthread+0x62/0x67 [<c04542c0>] ? kthread_worker_fn+0x114/0x114 [<c07a6ebe>] kernel_thread_helper+0x6/0x10 Code: eb 05 b8 00 00 27 4f 8d 65 f4 5b 5e 5f 5d c3 83 e0 03 55 83 f8 02 89 e5 74 17 83 f8 03 74 05 48 75 09 eb 09 b8 02 00 00 00 eb 0b <0f> 0b 31 c0 eb 05 b8 01 00 00 00 5d c3 55 89 e5 57 56 89 d6 8d EIP: [<e24f180d>] nfs4_access_to_omode+0x1c/0x29 [nfsd] SS:ESP 0068:dd54dde0 ---[ end trace 2b0bf6c6557cb284 ]--- The trace route is: -> nfsd4_lock() -> if (lock->lk_is_new) { -> alloc_init_lock_stateid() 3739: stp->st_access_bmap = 0; ->if (status && lock->lk_is_new && lock_sop) -> release_lockowner() -> free_generic_stateid() -> nfs4_access_bmap_to_omode() -> nfs4_access_to_omode() 380: BUG(); ***** This problem was introduced by `0997b17360`. Reported-by: Mi Jinlong <mijinlong@cn.fujitsu.com> Tested-by: Mi Jinlong <mijinlong@cn.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-04-10 12:21:27 -04:00
Linus Torvalds	94c8a984ae	Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFS: Change initial mount authflavor only when server returns NFS4ERR_WRONGSEC NFS: Fix a signed vs. unsigned secinfo bug Revert "net/sunrpc: Use static const char arrays"	2011-04-08 11:47:35 -07:00
Josef Bacik	93a54bc4c2	Btrfs: check for duplicate iov_base's when doing dio reads Apparently it is ok to submit a read to an IDE device with the same target page for different offsets. This is what Windows does under qemu. The problem is under DIO we expect them to be different buffers for checksumming reasons, and so this sort of thing will result in checksum errors, when in reality the file is fine. So when reading, check to make sure that all iov bases are different, and if they aren't fall back to buffered mode, since that will work out right. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-04-08 13:00:43 -04:00
Josef Bacik	16d299ac74	Btrfs: reuse the extent_map we found when calling btrfs_get_extent In btrfs_get_block_direct we call btrfs_get_extent to lookup the extent for the range that we are looking for. If we don't find an extent, btrfs_get_extent will insert a extent_map for that area and mark it as a hole. So it does the job of allocating a new extent map and inserting it into the io tree. But if we're creating a new extent we free it up and redo all of that work. So instead pass the em to btrfs_new_extent_direct(), and if it will work just allocate the disk space and set it up properly and bypass the freeing/allocating of a new extent map and the expensive operation of inserting the thing into the io_tree. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-04-08 13:00:41 -04:00
Josef Bacik	1ae3993825	Btrfs: do not use async submit for small DIO io's When looking at our DIO performance Chris said that for small IO's doing the async submit stuff tends to be more overhead than it's worth. With this on top of my other fixes I get about a 17-20% speedup doing a sequential dd with 4k IO's. Basically if we don't have to split the bio for the map length it's small enough to be directly submitted, otherwise go back to the async submit. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-04-08 13:00:39 -04:00
Josef Bacik	02f57c7aed	Btrfs: don't split dio bios if we don't have to We have been unconditionally allocating a new bio and re-adding all pages from our original bio to the new bio. This is needed if our original bio is larger than our stripe size, but if it is smaller than the stripe size then there is no need to do this. So check the map length and if we are under that then go ahead and submit the original bio. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-04-08 13:00:37 -04:00
Josef Bacik	1ef30be142	Btrfs: do not call btrfs_update_inode in endio if nothing changed In the DIO code we often don't update the i_disk_size because the i_size isn't updated until after the DIO is completed, so basically we are allocating a path, doing a search, and updating the inode item for no reason since nothing changed. btrfs_ordered_update_i_size will return 1 if it didn't update i_disk_size, so only run btrfs_update_inode if btrfs_ordered_update_i_size returns 0. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-04-08 13:00:36 -04:00
Josef Bacik	12ddb96cb6	Btrfs: map the inode item when doing fill_inode_item Instead of calling kmap_atomic for every thing we set in the inode item, map the entire inode item at the start and unmap it at the end. This makes a sequential dd of 400mb O_DIRECT something like 1% faster. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-04-08 13:00:34 -04:00
Josef Bacik	06d5a5899d	Btrfs: only retry transaction reservation once I saw a lockup where we kept getting into this start transaction->commit transaction loop because of enospce. The fact is if we fail to make our reservation, we've tried _everything_ several times, so we only need to try and commit the transaction once, and if that doesn't work then we really are out of space and need to just exit. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-04-08 13:00:32 -04:00
Josef Bacik	be1a12a0df	Btrfs: deal with the case that we run out of space in the cache Currently we don't handle running out of space in the cache, so to fix this we keep track of how far in the cache we are. Then we only dirty the pages if we successfully modify all of them, otherwise if we have an error or run out of space we can just drop them and not worry about the vm writing them out. Thanks, Tested-by Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de> Signed-off-by: Josef Bacik <josef@redhat.com>	2011-04-08 13:00:27 -04:00
Linus Torvalds	3d762ca1cd	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: quota: Don't write quota info in dquot_commit() ext3: Fix writepage credits computation for ordered mode	2011-04-08 07:35:17 -07:00
Christoph Hellwig	a1b7ea5d58	xfs: use proper interfaces for on-stack plugging Add proper blk_start_plug/blk_finish_plug pairs for the two places where we issue buffer I/O, and remove the blk_flush_plug in xfs_buf_lock and xfs_buf_iowait, given that context switches already flush the per-process plugging lists. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-04-08 08:09:28 -05:00
Christoph Hellwig	957935dcd8	xfs: fix xfs_debug warnings For a CONFIG_XFS_DEBUG=n build gcc complains about statements with no effect in xfs_debug: fs/xfs/quota/xfs_qm_syscalls.c: In function 'xfs_qm_scall_trunc_qfiles': fs/xfs/quota/xfs_qm_syscalls.c:291:3: warning: statement with no effect The reason for that is that the various new xfs message functions have a return value which is never used, and in case of the non-debug build xfs_debug the macro evaluates to a plain 0 which produces the above warnings. This can be fixed by turning xfs_debug into an inline function instead of a macro, but in addition to that I've also changed all the message helpers to return void as we never use their return values. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-04-08 08:09:24 -05:00
Christoph Hellwig	ecb697c16c	xfs: fix variable set but not used warnings GCC 4.6 now warnings about variables set but not used. Fix the trivially fixable warnings of this sort. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-04-08 08:09:12 -05:00
Dave Chinner	da8a1a4a4d	xfs: convert log tail checking to a warning On the Power platform, the log tail debug checks fire excessively causing the system to panic early in testing. The debug checks are known to be racy, though on x86_64 there is no evidence that they trigger at all. We want to keep the checks active on debug systems to alert us to problems with log space accounting, but we need to reduce the impact of a racy check on testing on the Power platform. As a result, convert the ASSERT conditions to warnings, and allow them to fire only once per filesystem mount. This will prevent false positives from interfering with testing, whilst still providing us with the indication that they may be a problem with log space accounting should that occur. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-04-08 12:45:07 +10:00
Dave Chinner	be65b18a10	xfs: catch bad block numbers freeing extents. A fuzzed filesystem crashed a kernel when freeing an extent with a block number beyond the end of the filesystem. Convert all the debug asserts in xfs_free_extent() to active checks so that we catch bad extents and return that the filesytsem is corrupted rather than crashing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-04-08 12:45:07 +10:00
Dave Chinner	fd074841cf	xfs: push the AIL from memory reclaim and periodic sync When we are short on memory, we want to expedite the cleaning of dirty objects. Hence when we run short on memory, we need to kick the AIL flushing into action to clean as many dirty objects as quickly as possible. To implement this, sample the lsn of the log item at the head of the AIL and use that as the push target for the AIL flush. Further, we keep items in the AIL that are dirty that are not tracked any other way, so we can get objects sitting in the AIL that don't get written back until the AIL is pushed. Hence to get the filesystem to the idle state, we might need to push the AIL to flush out any remaining dirty objects sitting in the AIL. This requires the same push mechanism as the reclaim push. This patch also renames xfs_trans_ail_tail() to xfs_ail_min_lsn() to match the new xfs_ail_max_lsn() function introduced in this patch. Similarly for xfs_trans_ail_push -> xfs_ail_push. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-04-08 12:45:07 +10:00
Dave Chinner	cd4a3c503c	xfs: clean up code layout in xfs_trans_ail.c This patch rearranges the location of functions in xfs_trans_ail.c to remove the need for forward declarations of those functions in preparation for adding new functions without the need for forward declarations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-04-08 12:45:07 +10:00
Dave Chinner	0bf6a5bd4b	xfs: convert the xfsaild threads to a workqueue Similar to the xfssyncd, the per-filesystem xfsaild threads can be converted to a global workqueue and run periodically by delayed works. This makes sense for the AIL pushing because it uses variable timeouts depending on the work that needs to be done. By removing the xfsaild, we simplify the AIL pushing code and remove the need to spread the code to implement the threading and pushing across multiple files. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-04-08 12:45:07 +10:00
Dave Chinner	a7b339f1b8	xfs: introduce background inode reclaim work Background inode reclaim needs to run more frequently that the XFS syncd work is run as 30s is too long between optimal reclaim runs. Add a new periodic work item to the xfs syncd workqueue to run a fast, non-blocking inode reclaim scan. Background inode reclaim is kicked by the act of marking inodes for reclaim. When an AG is first marked as having reclaimable inodes, the background reclaim work is kicked. It will continue to run periodically untill it detects that there are no more reclaimable inodes. It will be kicked again when the first inode is queued for reclaim. To ensure shrinker based inode reclaim throttles to the inode cleaning and reclaim rate but still reclaim inodes efficiently, make it kick the background inode reclaim so that when we are low on memory we are trying to reclaim inodes as efficiently as possible. This kick shoul d not be necessary, but it will protect against failures to kick the background reclaim when inodes are first dirtied. To provide the rate throttling, make the shrinker pass do synchronous inode reclaim so that it blocks on inodes under IO. This means that the shrinker will reclaim inodes rather than just skipping over them, but it does not adversely affect the rate of reclaim because most dirty inodes are already under IO due to the background reclaim work the shrinker kicked. These two modifications solve one of the two OOM killer invocations Chris Mason reported recently when running a stress testing script. The particular workload trigger for the OOM killer invocation is where there are more threads than CPUs all unlinking files in an extremely memory constrained environment. Unlike other solutions, this one does not have a performance impact on performance when memory is not constrained or the number of concurrent threads operating is <= to the number of CPUs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-04-08 12:45:07 +10:00
Dave Chinner	89e4cb550a	xfs: convert ENOSPC inode flushing to use new syncd workqueue On of the problems with the current inode flush at ENOSPC is that we queue a flush per ENOSPC event, regardless of how many are already queued. Thi can result in hundreds of queued flushes, most of which simply burn CPU scanned and do no real work. This simply slows down allocation at ENOSPC. We really only need one active flush at a time, and we can easily implement that via the new xfs_syncd_wq. All we need to do is queue a flush if one is not already active, then block waiting for the currently active flush to complete. The result is that we only ever have a single ENOSPC inode flush active at a time and this greatly reduces the overhead of ENOSPC processing. On my 2p test machine, this results in tests exercising ENOSPC conditions running significantly faster - 042 halves execution time, 083 drops from 60s to 5s, etc - while not introducing test regressions. This allows us to remove the old xfssyncd threads and infrastructure as they are no longer used. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-04-08 12:45:07 +10:00
Dave Chinner	c6d09b666d	xfs: introduce a xfssyncd workqueue All of the work xfssyncd does is background functionality. There is no need for a thread per filesystem to do this work - it can al be managed by a global workqueue now they manage concurrency effectively. Introduce a new gglobal xfssyncd workqueue, and convert the periodic work to use this new functionality. To do this, use a delayed work construct to schedule the next running of the periodic sync work for the filesystem. When the sync work is complete, queue a new delayed work for the next running of the sync work. For laptop mode, we wait on completion for the sync works, so ensure that the sync work queuing interface can flush and wait for work to complete to enable the work queue infrastructure to replace the current sequence number and wakeup that is used. Because the sync work does non-trivial amounts of work, mark the new work queue as CPU intensive. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-04-08 12:45:07 +10:00
Dave Chinner	e828776a8a	xfs: fix extent format buffer allocation size When formatting an inode item, we have to allocate a separate buffer to hold extents when there are delayed allocation extents on the inode and it is in extent format. The allocation size is derived from the in-core data fork representation, which accounts for delayed allocation extents, while the on-disk representation does not contain any delalloc extents. As a result of this mismatch, the allocated buffer can be far larger than needed to hold the real extent list which, due to the fact the inode is in extent format, is limited to the size of the literal area of the inode. However, we can have thousands of delalloc extents, resulting in an allocation size orders of magnitude larger than is needed to hold all the real extents. Fix this by limiting the size of the buffer being allocated to the size of the literal area of the inodes in the filesystem (i.e. the maximum size an inode fork can grow to). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-04-08 12:45:07 +10:00
Bryan Schumaker	37adb89fad	NFS: Change initial mount authflavor only when server returns NFS4ERR_WRONGSEC When attempting an initial mount, we should only attempt other authflavors if AUTH_UNIX receives a NFS4ERR_WRONGSEC error. This allows other errors to be passed back to userspace programs. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-04-07 13:19:40 -07:00
Linus Torvalds	ccfeef0ff7	Merge branch 'for-linus' of git://git.infradead.org/ubifs-2.6 * 'for-linus' of git://git.infradead.org/ubifs-2.6: UBI: do not select KALLSYMS_ALL UBI: do not compare array with NULL UBI: check if we are in RO mode in the erase routine UBIFS: fix debugging failure in dbg_check_space_info UBIFS: fix error path in dbg_debugfs_init_fs UBIFS: unify error path dbg_debugfs_init_fs UBIFS: do not select KALLSYMS_ALL UBIFS: fix assertion warnings UBIFS: fix oops on error path in read_pnode UBIFS: do not read flash unnecessarily	2011-04-07 11:31:03 -07:00
Linus Torvalds	42933bac11	Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6 * 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6: Fix common misspellings	2011-04-07 11:14:49 -07:00
Bryan Schumaker	418875900e	NFS: Fix a signed vs. unsigned secinfo bug rpc_authflavor_t is cast from an unsigned int, but the initial code tried to use it as a signed int. I fix this by passing an rpc_authflavor_t pointer around, and returning signed integers from functions. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-04-06 13:25:04 -07:00
Tao Ma	0449641130	ext4: init timer earlier to avoid a kernel panic in __save_error_info During mount, when we fail to open journal inode or root inode, the __save_error_info will mod_timer. But actually s_err_report isn't initialized yet and the kernel oops. The detailed information can be found https://bugzilla.kernel.org/show_bug.cgi?id=32082. The best way is to check whether the timer s_err_report is initialized or not. But it seems that in include/linux/timer.h, we can't find a good function to check the status of this timer, so this patch just move the initializtion of s_err_report earlier so that we can avoid the kernel panic. The corresponding del_timer is also added in the error path. Reported-by: Sami Liedes <sliedes@cc.hut.fi> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-04-05 19:55:28 -04:00
Zhang Huan	6cba611e60	jbd2: fix potential memory leak on transaction commit There is potential memory leak of journal head in function jbd2_journal_commit_transaction. The problem is that JBD2 will not reclaim the journal head of commit record if error occurs or journal is abotred. I use the following script to reproduce this issue, on a RHEL6 system. I found it very easy to reproduce with async commit enabled. mount /dev/sdb /mnt -o journal_checksum,journal_async_commit touch /mnt/xxx echo offline > /sys/block/sdb/device/state sync umount /mnt rmmod ext4 rmmod jbd2 Removal of the jbd2 module will make slab complaining that "cache `jbd2_journal_head': can't free all objects". Signed-off-by: Zhang Huan <zhhuan@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-04-05 19:16:20 -04:00
Linus Torvalds	44148a667d	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-2.6-block * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-2.6-block: ide: always ensure that blk_delay_queue() is called if we have pending IO block: fix request sorting at unplug dm: improve block integrity support fs: export empty_aops ide: ide_requeue_and_plug() reinstate "always plug" behaviour blk-throttle: don't call xchg on bool ufs: remove unessecary blk_flush_plug block: make the flush insertion use the tail of the dispatch list block: get rid of elv_insert() interface block: dump request state on seeing a corrupted request completion	2011-04-05 15:29:01 -07:00
Eric Paris	d0de4dc584	inotify: fix double free/corruption of stuct user On an error path in inotify_init1 a normal user can trigger a double free of struct user. This is a regression introduced by `a2ae4cc9a1` ("inotify: stop kernel memory leak on file creation failure"). We fix this by making sure that if a group exists the user reference is dropped when the group is cleaned up. We should not explictly drop the reference on error and also drop the reference when the group is cleaned up. The new lifetime rules are that an inotify group lives from inotify_new_group to the last fsnotify_put_group. Since the struct user and inotify_devs are directly tied to this lifetime they are only changed/updated in those two locations. We get rid of all special casing of struct user or user->inotify_devs. Signed-off-by: Eric Paris <eparis@redhat.com> Cc: stable@kernel.org (2.6.37 and up) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-04-05 15:27:14 -07:00
Jens Axboe	7dcda1c96d	fs: export empty_aops With the ->sync_page() hook gone, we have a few users that add their own static address_space_operations without any functions defined. fs/inode.c already has an empty_aops that it uses for init purposes. Lets export that and use it in the places where an otherwise empty aops was defined. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-05 23:51:48 +02:00
Christoph Hellwig	ee3dea3549	ufs: remove unessecary blk_flush_plug We already flush the per-process plugging list when context switching, so a blk_flush_plug call just before a yield() is not needed. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-05 23:51:37 +02:00
Linus Torvalds	884b8267d5	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: don't warn in btrfs_add_orphan Btrfs: fix free space cache when there are pinned extents and clusters V2 Btrfs: Fix uninitialized root flags for subvolumes btrfs: clear __GFP_FS flag in the space cache inode Btrfs: fix memory leak in start_transaction() Btrfs: fix memory leak in btrfs_ioctl_start_sync() Btrfs: fix subvol_sem leak in btrfs_rename() Btrfs: Fix oops for defrag with compression turned on Btrfs: fix /proc/mounts info. Btrfs: fix compiler warning in file.c	2011-04-05 12:29:25 -07:00
Artem Bityutskiy	7da6443aca	UBIFS: fix debugging failure in dbg_check_space_info This patch fixes a debugging failure with which looks like this: UBIFS error (pid 32313): dbg_check_space_info: free space changed from 6019344 to 6022654 The reason for this failure is described in the comment this patch adds to the code. But in short - 'c->freeable_cnt' may be different before and after re-mounting, and this is normal. So the debugging code should make sure that free space calculations do not depend on 'c->freeable_cnt'. A similar issue has been reported here: http://lists.infradead.org/pipermail/linux-mtd/2011-April/034647.html This patch should fix it. For the -stable guys: this patch is only relevant for kernels 2.6.30 onwards. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Cc: stable@kernel.org [2.6.30+]	2011-04-05 11:07:37 +03:00
Artem Bityutskiy	9516953511	UBIFS: fix error path in dbg_debugfs_init_fs The debug interface is substandard and on error returns either NULL or an error code packed in the pointer. So using "IS_ERR" for the pointers returned by debugfs function is incorrect. Instead, we should use IS_ERR_OR_NULL. This path is an improved vestion of the original patch from Phil Carmody. Reported-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Acked-by: Phil Carmody <ext-phil.2.carmody@nokia.com>	2011-04-05 10:46:01 +03:00
Artem Bityutskiy	cc6a86b950	UBIFS: unify error path dbg_debugfs_init_fs This is just a small clean-up patch which simlifies and unifies the error path in the dbg_debugfs_init_fs(). We have common error path for all failure cases in this function except of the very first case. And this patch makes the first failure case use the same error path as the other cases by using the 'fname' and 'dent' variables. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Acked-by: Phil Carmody <ext-phil.2.carmody@nokia.com>	2011-04-05 10:46:01 +03:00
Artem Bityutskiy	81354de3d8	UBIFS: do not select KALLSYMS_ALL All UBIFS needs is to make sure we stacktraces when UBIFS debugging is enabled. It is enough to select KALLSYMS for this, KALLSYMS_ALL is not necessary. Moreover, Randy Dunlap reported that UBIFS causes the following Kconfig dependency warning: warning: (UBIFS_FS_DEBUG && LOCKDEP && LATENCYTOP) selects KALLSYMS_ALL which has unmet direct dependencies (DEBUG_KERNEL && KALLSYMS) The reason is that KALLSYMS_ALL requires DEBUG_KERNEL and KALLSYMS, so ideally, to select KALLSYMS_ALL we'd need to select DEBUG_KERNEL and KALLSYMS first. This seems to be too much to select. The easiest way to go is to forget about KALLSYMS_ALL and just select KALLSYMS when UBIFS debugging is enabled - that should be enough for stackdumps. Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com>	2011-04-05 10:45:45 +03:00
Artem Bityutskiy	c88ac00c5a	UBIFS: fix assertion warnings This patch fixes UBIFS assertion warnings like: UBIFS assert failed in ubifs_leb_unmap at 135 (pid 29365) Pid: 29365, comm: integck Tainted: G I 2.6.37-ubi-2.6+ #34 Call Trace: [<ffffffffa047c663>] ubifs_lpt_init+0x95e/0x9ee [ubifs] [<ffffffffa04623a7>] ubifs_remount_fs+0x2c7/0x762 [ubifs] [<ffffffff810f066e>] do_remount_sb+0xb6/0x101 [<ffffffff81106ff4>] ? do_mount+0x191/0x78e [<ffffffff811070bb>] do_mount+0x258/0x78e [<ffffffff810da1e8>] ? alloc_pages_current+0xa2/0xc5 [<ffffffff81107674>] sys_mount+0x83/0xbd [<ffffffff81009a12>] system_call_fastpath+0x16/0x1b They happen when we re-mount from R/O mode to R/W mode. While re-mounting, we write to the media, but we still have the c->ro_mount flag set. The fix is very simple - just clear the flag before starting re-mounting R/W. These warnings are caused by the following commit: `2ef13294d2` For -stable guys: this bug was introduced in 2.6.38, this is materieal for 2.6.38-stable. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Cc: stable@kernel.org [2.6.38]	2011-04-05 10:45:09 +03:00
Artem Bityutskiy	54acbaaa52	UBIFS: fix oops on error path in read_pnode Thanks to coverity which spotted that UBIFS will oops if 'kmalloc()' in 'read_pnode()' fails and we dereference a NULL 'pnode' pointer when we 'goto out'. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Cc: stable@kernel.org	2011-04-05 10:40:31 +03:00
Artem Bityutskiy	8b229c7676	UBIFS: do not read flash unnecessarily This fix makes the 'dbg_check_old_index()' function return immediately if debugging is disabled, instead of executing incorrect 'goto out' which causes UBIFS to: 1. Allocate memory 2. Read the flash On every commit. OK, we do not commit that often, but it is still silly to do unneeded I/O anyway. Credits to coverity for spotting this silly issue. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Cc: stable@kernel.org	2011-04-05 10:39:40 +03:00
Josef Bacik	c9ddec74aa	Btrfs: don't warn in btrfs_add_orphan When I moved the orphan adding to btrfs_truncate I missed the fact that during orphan cleanup we just add the orphan items to the orphan list without going through btrfs_orphan_add, which results in lots of warnings on mount if you have any orphan items that need to be truncated. Just remove this warning since it's ok, this will allow all of the normal space accounting take place. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-04-05 01:20:24 -04:00
Josef Bacik	43be21462d	Btrfs: fix free space cache when there are pinned extents and clusters V2 I noticed a huge problem with the free space cache that was presenting as an early ENOSPC. Turns out when writing the free space cache out I forgot to take into account pinned extents and more importantly clusters. This would result in us leaking free space everytime we unmounted the filesystem and remounted it. I fix this by making sure to check and see if the current block group has a cluster and writing out any entries that are in the cluster to the cache, as well as writing any pinned extents we currently have to the cache since those will be available for us to use the next time the fs mounts. This patch also adds a check to the end of load_free_space_cache to make sure we got the right amount of free space cache, and if not make sure to clear the cache and re-cache the old fashioned way. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-04-05 01:20:24 -04:00
Li Zefan	08fe4db170	Btrfs: Fix uninitialized root flags for subvolumes root_item->flags and root_item->byte_limit are not initialized when a subvolume is created. This bug is not revealed until we added readonly snapshot support - now you mount a btrfs filesystem and you may find the subvolumes in it are readonly. To work around this problem, we steal a bit from root_item->inode_item->flags, and use it to indicate if those fields have been properly initialized. When we read a tree root from disk, we check if the bit is set, and if not we'll set the flag and initialize the two fields of the root item. Reported-by: Andreas Philipp <philipp.andreas@gmail.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Tested-by: Andreas Philipp <philipp.andreas@gmail.com> cc: stable@kernel.org Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-04-05 01:20:24 -04:00
Miao Xie	adae52b94e	btrfs: clear __GFP_FS flag in the space cache inode the object id of the space cache inode's key is allocated from the relative root, just like the regular file. So we can't identify space cache inode by checking the object id of the inode's key, and we have to clear __GFP_FS flag at the time we look up the space cache inode. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-04-05 01:19:43 -04:00
Yoshinori Sano	6e8df2ae89	Btrfs: fix memory leak in start_transaction() Free btrfs_trans_handle when join_transaction() fails in start_transaction() Signed-off-by: Yoshinori Sano <yoshinori.sano@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-04-05 01:19:43 -04:00
Tsutomu Itoh	8b2b2d3cbe	Btrfs: fix memory leak in btrfs_ioctl_start_sync() Call btrfs_end_transaction() if btrfs_commit_transaction_async() fails. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-04-05 01:19:42 -04:00
Johann Lombardi	b44c59a80d	Btrfs: fix subvol_sem leak in btrfs_rename() btrfs_rename() does not release the subvol_sem if the transaction failed to start. Signed-off-by: Johann Lombardi <johann@whamcloud.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-04-05 01:19:42 -04:00
Li Zefan	fe3f566cd1	Btrfs: Fix oops for defrag with compression turned on When we defrag a file, whose size can be fit into an inline extent, with compression enabled, the compress type is set to be fs_info->compress_type, which is 0 if the btrfs filesystem is mounted without compress option. This leads to oops. Reported-by: Daniel Blueman <daniel.blueman@gmail.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-04-05 01:19:42 -04:00
Tsutomu Itoh	200da64e0b	Btrfs: fix /proc/mounts info. Some mount options are not displayed by /proc/mounts. This patch displays the option such as compress_type by /proc/mounts. Ex. [before] $ mount \| grep sdc2 /dev/sdc2 on /test12 type btrfs (rw,space_cache,compress=lzo) $ cat /proc/mounts \| grep sdc2 /dev/sdc2 /test12 btrfs rw,relatime,compress 0 0 [after] $ mount \| grep sdc2 /dev/sdc2 on /test12 type btrfs (rw,space_cache,compress=lzo) $ cat /proc/mounts \| grep sdc2 /dev/sdc2 /test12 btrfs rw,relatime,compress=lzo,space_cache 0 0 Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-04-05 01:19:41 -04:00
Tsutomu Itoh	c9149235a4	Btrfs: fix compiler warning in file.c While compiling Btrfs, I got following messages: CC [M] fs/btrfs/file.o fs/btrfs/file.c: In function '__btrfs_buffered_write': fs/btrfs/file.c:909: warning: 'ret' may be used uninitialized in this function CC [M] fs/btrfs/tree-defrag.o This patch fixes compiler warning. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-04-05 01:19:41 -04:00
Tao Ma	46e4690bbd	ext4: fix a double free in ext4_register_li_request In ext4_register_li_request, we malloc a ext4_li_request and inserts it into ext4_li_info->li_request_list. In case of any error later, we free it in the end. But if we have some error in ext4_run_lazyinit_thread, the whole li_request_list will be dropped and freed in it. So we will double free this ext4_li_request. This patch just sets elr to NULL after it is inserted to the list so that the latter kfree won't double free it. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-04-04 16:00:49 -04:00
Yongqiang Yang	5b41395fcc	ext4: fix credits computing for indirect mapped files When writing a contiguous set of blocks, two indirect blocks could be needed depending on how the blocks are aligned, so we need to increase the number of credits needed by one. [ Also fixed a another bug which could further underestimate the number of journal credits needed by 1; the code was using integer division instead of DIV_ROUND_UP() -- tytso] Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-04-04 15:40:24 -04:00
Jan Kara	21f976975c	ext4: remove unnecessary [cm]time update of quota file It is not necessary to update [cm]time of quota file on each quota file write and it wastes journal space and IO throughput with inode writes. So just remove the updating from ext4_quota_write() and only update times when quotas are being turned off. Userspace cannot get anything reliable from quota files while they are used by the kernel anyway. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-04-04 15:33:39 -04:00
Zhu Yanhai	50f689af01	jbd2: move bdget out of critical section bdget() should not be called when we hold spinlocks since it might sleep. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhu Yanhai <gaoyang.zyh@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-04-04 12:58:12 -04:00
Jan Kara	b03f24567c	quota: Don't write quota info in dquot_commit() There's no reason to write quota info in dquot_commit(). The writing is a relict from the old days when we didn't have dquot_acquire() and dquot_release() and thus dquot_commit() could have created / removed quota structures from the file. These days dquot_commit() only updates usage counters / limits in quota structure and thus there's no need to write quota info. This also fixes an issue with journaling filesystem which didn't reserve enough space in the transaction for write of quota info (it could have been dirty at the time of dquot_commit() because of a race with other operation changing it). CC: stable@kernel.org Reported-and-tested-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>	2011-04-01 00:23:46 +02:00
Lucas De Marchi	25985edced	Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>	2011-03-31 11:26:23 -03:00
Dave Chinner	89b3600ccf	xfs: fix unreferenced var error in xfs_buf.c Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2011-03-30 23:34:20 -05:00
Linus Torvalds	85cf0ac38c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: nilfs2: fix whitespace coding style issues nilfs2: fix oops due to a bad aops initialization nilfs2: fix data loss in mmap page write for hole blocks	2011-03-30 09:46:36 -07:00
Linus Torvalds	50f3515828	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: libceph: Create a new key type "ceph". libceph: Get secret from the kernel keys api when mounting with key=NAME. ceph: Move secret key parsing earlier. libceph: fix null dereference when unregistering linger requests ceph: unlock on error in ceph_osdc_start_request() ceph: fix possible NULL pointer dereference ceph: flush msgr_wq during mds_client shutdown	2011-03-30 09:46:09 -07:00
Nicolas Kaiser	af71dda0b8	nilfs2: fix whitespace coding style issues Fixes whitespace coding style issues. Signed-off-by: Nicolas Kaiser <nikai@nikai.net> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2011-03-30 17:39:54 +09:00
Ryusuke Konishi	d611b22f1a	nilfs2: fix oops due to a bad aops initialization Nilfs in 2.6.39-rc1 hit the following oops: BUG: unable to handle kernel NULL pointer dereference at 0000000000000048 IP: [<ffffffff810ac235>] try_to_release_page+0x2a/0x3d PGD 234cb6067 PUD 234c72067 PMD 0 Oops: 0000 [#1] SMP <snip> Process truncate (pid: 10995, threadinfo ffff8802353c2000, task ffff880234cfa000) Stack: ffff8802333c77b8 ffffffff810b64b0 0000000000003802 ffffffffa0052cca 0000000000000000 ffff8802353c3b58 0000000000000000 ffff8802353c3b58 0000000000000001 0000000000000000 ffffea0007b92308 ffffea0007b92308 Call Trace: [<ffffffff810b64b0>] ? invalidate_inode_pages2_range+0x15f/0x273 [<ffffffffa0052cca>] ? nilfs_palloc_get_block+0x2d/0xaf [nilfs2] [<ffffffff810589e7>] ? bit_waitqueue+0x14/0xa1 [<ffffffff81058ab1>] ? wake_up_bit+0x10/0x20 [<ffffffffa00433fd>] ? nilfs_forget_buffer+0x66/0x7a [nilfs2] [<ffffffffa00467b8>] ? nilfs_btree_concat_left+0x5c/0x77 [nilfs2] [<ffffffffa00471fc>] ? nilfs_btree_delete+0x395/0x3cf [nilfs2] [<ffffffffa00449a3>] ? nilfs_bmap_do_delete+0x6e/0x79 [nilfs2] [<ffffffffa0045845>] ? nilfs_btree_last_key+0x14b/0x15e [nilfs2] [<ffffffffa00449dd>] ? nilfs_bmap_truncate+0x2f/0x83 [nilfs2] [<ffffffffa0044ab2>] ? nilfs_bmap_last_key+0x35/0x62 [nilfs2] [<ffffffffa003e99b>] ? nilfs_truncate_bmap+0x6b/0xc7 [nilfs2] [<ffffffffa003ee4a>] ? nilfs_truncate+0x79/0xe4 [nilfs2] [<ffffffff810b6c00>] ? vmtruncate+0x33/0x3b [<ffffffffa003e8f1>] ? nilfs_setattr+0x4d/0x8c [nilfs2] [<ffffffff81026106>] ? do_page_fault+0x31b/0x356 [<ffffffff810f9d61>] ? notify_change+0x17d/0x262 [<ffffffff810e5046>] ? do_truncate+0x65/0x80 [<ffffffff810e52af>] ? sys_ftruncate+0xf1/0xf6 [<ffffffff8132c012>] ? system_call_fastpath+0x16/0x1b Code: c3 48 83 ec 08 48 8b 17 48 8b 47 18 80 e2 01 75 04 0f 0b eb fe 48 8b 17 80 e6 20 74 05 31 c0 41 59 c3 48 85 c0 74 11 48 8b 40 58 8b 40 48 48 85 c0 74 04 41 58 ff e0 59 e9 b1 b5 05 00 41 54 RIP [<ffffffff810ac235>] try_to_release_page+0x2a/0x3d RSP <ffff8802353c3b08> CR2: 0000000000000048 This oops was brought in by the change "block: remove per-queue plugging" (commit: `7eaceaccab`). It initializes mapping->a_ops with a NULL pointer for some pages in nilfs (e.g. btree node pages), but mm code doesn't NULL pointer checks against mapping->a_ops. (the check is done for each callback function) This corrects the aops initialization and fixes the oops. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Acked-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-30 17:39:25 +09:00
Ryusuke Konishi	3409453794	nilfs2: fix data loss in mmap page write for hole blocks From the result of a function test of mmap, mmap write to shared pages turned out to be broken for hole blocks. It doesn't write out filled blocks and the data will be lost after umount. This is due to a bug that the target file is not queued for log writer when filling hole blocks. Also, nilfs_page_mkwrite function exits normal code path even after successfully filled hole blocks due to a change of block_page_mkwrite function; just after nilfs was merged into the mainline, block_page_mkwrite() started to return VM_FAULT_LOCKED instead of zero by the patch "mm: close page_mkwrite races" (commit: `b827e496c8`). The current nilfs_page_mkwrite() is not handling this value properly. This corrects nilfs_page_mkwrite() and will resolve the data loss problem in mmap write. [This should be applied to every kernel since 2.6.30 but a fix is needed for 2.6.37 and prior kernels] Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: stable <stable@kernel.org> [2.6.38]	2011-03-30 10:45:31 +09:00
Tommi Virtanen	8323c3aa74	ceph: Move secret key parsing earlier. This makes the base64 logic be contained in mount option parsing, and prepares us for replacing the homebew key management with the kernel key retention service. Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-29 12:11:16 -07:00
Dave Chinner	0444d76ae6	fs: don't use igrab() while holding i_lock Fix the incorrect use of igrab() inside the i_lock in NFS and Ceph‥ If we are already holding the i_lock, we have a reference to the inode so we can safely use ihold() to gain an extra reference. This avoids hangs due to lock recursion on the i_lock now that the inode_lock is gone and igrab() uses the i_lock itself. Signed-off-by: Dave Chinner <dchinner@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: Ryan Mallon <ryan@bluewatersys.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-29 07:50:34 -07:00
Linus Torvalds	c5850150d0	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: stop using the page cache to back the buffer cache xfs: register the inode cache shrinker before quotachecks xfs: xfs_trans_read_buf() should return an error on failure xfs: introduce inode cluster buffer trylocks for xfs_iflush vmap: flush vmap aliases when mapping fails xfs: preallocation transactions do not need to be synchronous Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_buf.c due to plug removal.	2011-03-28 15:51:02 -07:00
Linus Torvalds	7f5fe3ec8e	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6: eCryptfs: write lock requested keys eCryptfs: move ecryptfs_find_auth_tok_for_sig() call before mutex_lock eCryptfs: verify authentication tokens before their use eCryptfs: modified size of keysig in the ecryptfs_key_sig structure eCryptfs: removed num_global_auth_toks from ecryptfs_mount_crypt_stat eCryptfs: ecryptfs_keyring_auth_tok_for_sig() bug fix eCryptfs: Unlock page in write_begin error path ecryptfs: modify write path to encrypt page in writepage eCryptfs: Remove ECRYPTFS_NEW_FILE crypt stat flag eCryptfs: Remove unnecessary grow_file() function	2011-03-28 15:43:25 -07:00
Linus Torvalds	212a17ab87	Merge branch 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (45 commits) Btrfs: fix __btrfs_map_block on 32 bit machines btrfs: fix possible deadlock by clearing __GFP_FS flag btrfs: check link counter overflow in link(2) btrfs: don't mess with i_nlink of unlocked inode in rename() Btrfs: check return value of btrfs_alloc_path() Btrfs: fix OOPS of empty filesystem after balance Btrfs: fix memory leak of empty filesystem after balance Btrfs: fix return value of setflags ioctl Btrfs: fix uncheck memory allocations btrfs: make inode ref log recovery faster Btrfs: add btrfs_trim_fs() to handle FITRIM Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP Btrfs: make update_reserved_bytes() public btrfs: return EXDEV when linking from different subvolumes Btrfs: Per file/directory controls for COW and compression Btrfs: add datacow flag in inode flag btrfs: use GFP_NOFS instead of GFP_KERNEL Btrfs: check return value of read_tree_block() btrfs: properly access unaligned checksum buffer ... Fix up trivial conflicts in fs/btrfs/volumes.c due to plug removal in the block layer.	2011-03-28 15:31:05 -07:00
Linus Torvalds	03e4970c10	Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (39 commits) Treat writes as new when holes span across page boundaries fs,ocfs2: Move o2net_get_func_run_time under CONFIG_OCFS2_FS_STATS. ocfs2/dlm: Move kmalloc() outside the spinlock ocfs2: Make the left masklogs compat. ocfs2: Remove masklog ML_AIO. ocfs2: Remove masklog ML_UPTODATE. ocfs2: Remove masklog ML_BH_IO. ocfs2: Remove masklog ML_JOURNAL. ocfs2: Remove masklog ML_EXPORT. ocfs2: Remove masklog ML_DCACHE. ocfs2: Remove masklog ML_NAMEI. ocfs2: Remove mlog(0) from fs/ocfs2/dir.c ocfs2: remove NAMEI from symlink.c ocfs2: Remove masklog ML_QUOTA. ocfs2: Remove mlog(0) from quota_local.c. ocfs2: Remove masklog ML_RESERVATIONS. ocfs2: Remove masklog ML_XATTR. ocfs2: Remove masklog ML_SUPER. ocfs2: Remove mlog(0) from fs/ocfs2/heartbeat.c ocfs2: Remove mlog(0) from fs/ocfs2/slot_map.c ... Fix up trivial conflict in fs/ocfs2/super.c	2011-03-28 13:03:31 -07:00
Goldwyn Rodrigues	272b62c1f0	Treat writes as new when holes span across page boundaries When a hole spans across page boundaries, the next write forces a read of the block. This could end up reading existing garbage data from the disk in ocfs2_map_page_blocks. This leads to non-zero holes. In order to avoid this, mark the writes as new when the holes span across page boundaries. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de> Signed-off-by: jlbec <jlbec@evilplan.org>	2011-03-28 09:44:58 -07:00
Joel Becker	99bdc3880c	Merge branch 'mlog_replace_for_39' of git://repo.or.cz/taoma-kernel into ocfs2-merge-window-fix	2011-03-28 09:44:26 -07:00
Rakib Mullick	ed59992e8d	fs,ocfs2: Move o2net_get_func_run_time under CONFIG_OCFS2_FS_STATS. When CONFIG_DEBUG_FS=y and CONFIG_OCFS2_FS_STATS=n, we get the following warning: fs/ocfs2/cluster/tcp.c:213:16: warning: ‘o2net_get_func_run_time’ defined but not used Since o2net_get_func_run_time is only called from o2net_update_recv_stats, so move it under CONFIG_OCFS2_FS_STATS. Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com> Signed-off-by: jlbec <jlbec@evilplan.org>	2011-03-28 09:43:28 -07:00
Linus Torvalds	1788c208aa	Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFS: Ensure that rpc_release_resources_task() can be called twice. NFS: Don't leak RPC clients in NFSv4 secinfo negotiation NFS: Fix a hang in the writeback path	2011-03-28 07:52:58 -07:00
Chris Mason	d9d0487932	Btrfs: fix __btrfs_map_block on 32 bit machines Recent changes for discard support didn't compile, this fixes them not to try and % 64 bit numbers. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:59 -04:00
Miao Xie	1561deda68	btrfs: fix possible deadlock by clearing __GFP_FS flag Using the GFP_HIGHUSER_MOVABLE flag to allocate the metadata's page may cause deadlock. Task1 open() ... btrfs_search_slot() ... btrfs_cow_block() ... alloc_page() wait for reclaiming shrink_slab() ... shrink_icache_memory() ... btrfs_evict_inode() ... btrfs_search_slot() If the path is locked by task1, the deadlock happens. So the btree's page cache is different with the file's page cache, it can not allocate pages by GFP_HIGHUSER_MOVABLE flag, we must clear __GFP_FS flag in GFP_HIGHUSER_MOVABLE flag. Reported-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:58 -04:00
Al Viro	c055e99eea	btrfs: check link counter overflow in link(2) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:56 -04:00
Al Viro	92986796d8	btrfs: don't mess with i_nlink of unlocked inode in rename() old_inode is not locked; it's not safe to play with its link count. Instead of bumping it and calling btrfs_unlink_inode(), add a variant of the latter that does not do btrfs_drop_nlink()/ btrfs_update_inode(), call it instead of btrfs_inc_nlink()/ btrfs_unlink_inode() and do btrfs_update_inode() ourselves. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:55 -04:00
Tsutomu Itoh	c2db1073fd	Btrfs: check return value of btrfs_alloc_path() Adding the check on the return value of btrfs_alloc_path() to several places. And, some of callers are modified by this change. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:54 -04:00
liubo	c59021f846	Btrfs: fix OOPS of empty filesystem after balance btrfs will remove unused block groups after balance. When a empty filesystem is balanced, the block group with tag "DATA" may be dropped, and after umount and mount again, it will not find "DATA" space_info and lead to OOPS. So we initial the necessary space_infos(DATA, SYSTEM, METADATA) to avoid OOPS. Reported-by: Daniel J Blueman <daniel.blueman@gmail.com> Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:53 -04:00
liubo	9f7c43c967	Btrfs: fix memory leak of empty filesystem after balance After Josef's patch(commit `3c14874acc`), btrfs will exclude super bytes when reading block groups(by marking a extent state UPTODATE). However, these bytes do not get freed while balance remove unused block groups, and we won't process those removed ones any more, when we do umount and unload the btrfs module, btrfs hits a memory leak. This patch add the missing free operation. Reproduce steps: $ mkfs.btrfs disk $ mount disk /mnt/btrfs -o loop $ btrfs filesystem balance /mnt/btrfs $ umount /mnt/btrfs $ rmmod btrfs Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:52 -04:00
liubo	2d4e6f6ad2	Btrfs: fix return value of setflags ioctl setflags ioctl should return error when any checks fail. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:51 -04:00
Yoshinori Sano	dac97e516c	Btrfs: fix uncheck memory allocations To make Btrfs code more robust, several return value checks where memory allocation can fail are introduced. I use BUG_ON where I don't know how to handle the error properly, which increases the number of using the notorious BUG_ON, though. Signed-off-by: Yoshinori Sano <yoshinori.sano@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:49 -04:00
liubo	c622ae6085	btrfs: make inode ref log recovery faster When we recover from crash via write-ahead log tree and process the inode refs, for each btrfs_inode_ref item, we will 1) check if we already have a perfect match in fs/file tree, if we have, then we're done. 2) search the corresponding back reference in fs/file tree, and check all the names in this back reference to see if they are also in the log to avoid conflict corners. 3) recover the logged inode refs to fs/file tree. In current btrfs, however, - for 2)'s check, once is enough, since the checked back reference will remain unchanged after processing all the inode refs belonged to the key. - it has no need to do another 1) between 2) and 3). I've made a small test to show how it improves, $dd if=/dev/zero of=foobar bs=4K count=1 $sync $make 100 hard links continuously, like ln foobar link_i $fsync foobar $echo b > /proc/sysrq-trigger after reboot $time mount DEV PATH without patch: real 0m0.285s user 0m0.001s sys 0m0.009s with patch: real 0m0.123s user 0m0.000s sys 0m0.010s Changelog v1->v2: - fix double free - pointed by David Sterba Changelog v2->v3: - adjust free order Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:48 -04:00
Li Dongyang	f7039b1d5c	Btrfs: add btrfs_trim_fs() to handle FITRIM We take an free extent out from allocator, trim it, then put it back, but before we trim the block group, we should make sure the block group is cached, so plus a little change to make cache_block_group() run without a transaction. Signed-off-by: Li Dongyang <lidongyang@novell.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:47 -04:00
Li Dongyang	5378e60734	Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes Callers of btrfs_discard_extent() should check if we are mounted with -o discard, as we want to make fitrim to work even the fs is not mounted with -o discard. Also we should use REQ_DISCARD to map the free extent to get a full mapping, last we only return errors if 1. the error is not a EOPNOTSUPP 2. no device supports discard Signed-off-by: Li Dongyang <lidongyang@novell.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:46 -04:00
Li Dongyang	fce3bb9a1b	Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP btrfs_map_block() will only return a single stripe length, but we want the full extent be mapped to each disk when we are trimming the extent, so we add length to btrfs_bio_stripe and fill it if we are mapping for REQ_DISCARD. Signed-off-by: Li Dongyang <lidongyang@novell.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:45 -04:00
Li Dongyang	b4d00d569a	Btrfs: make update_reserved_bytes() public Make the function public as we should update the reserved extents calculations after taking out an extent for trimming. Signed-off-by: Li Dongyang <lidongyang@novell.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:43 -04:00
Mark Fasheh	3ab3564f01	btrfs: return EXDEV when linking from different subvolumes btrfs_link returns EPERM if a cross-subvolume link is attempted. However, in this case I believe EXDEV to be the more appropriate value. >From the link(2) man page: EXDEV oldpath and newpath are not on the same mounted file system. (Linux permits a file system to be mounted at multiple points, but link() does not work across different mount points, even if the same file system is mounted on both.) This matters because an application may have different behaviors based on return codes. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:42 -04:00
Liu Bo	75e7cb7fe0	Btrfs: Per file/directory controls for COW and compression Data compression and data cow are controlled across the entire FS by mount options right now. ioctls are needed to set this on a per file or per directory basis. This has been proposed previously, but VFS developers wanted us to use generic ioctls rather than btrfs-specific ones. According to Chris's comment, there should be just one true compression method(probably LZO) stored in the super. However, before this, we would wait for that one method is stable enough to be adopted into the super. So I list it as a long term goal, and just store it in ram today. After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to control file and directory's datacow and compression attribute. NOTE: - The compression type is selected by such rules: If we mount btrfs with compress options, ie, zlib/lzo, the type is it. Otherwise, we'll use the default compress type (zlib today). v1->v2: - rebase to the latest btrfs. v2->v3: - fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW will be screwed by inheritance from parent directory. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:41 -04:00
Miao Xie	fc0e4a314e	btrfs: use GFP_NOFS instead of GFP_KERNEL In the filesystem context, we must allocate memory by GFP_NOFS, or we may start another filesystem operation and make kswap thread hang up. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:39 -04:00
Tsutomu Itoh	97d9a8a420	Btrfs: check return value of read_tree_block() This patch is checking return value of read_tree_block(), and if it is NULL, error processing. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:37 -04:00
David Sterba	7e75bf3ff3	btrfs: properly access unaligned checksum buffer On Fri, Mar 18, 2011 at 11:56:53AM -0400, Chris Mason wrote: > Thanks for fielding this one. Does put_unaligned_le32 optimize away on > platforms with efficient access? It would be great if we didn't need > the #ifdef. (quicktest: assembly output is same for put_unaligned_le32 and direct assignment on my x86_64) I was originally following examples in Documentation/unaligned-memory-access.txt. From other code it seems to me that the define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is intended for larger portions of code. Macros/wrappers for {put,get}_unaligned* are chosen via arch/<arch>/include/asm/unaligned.h accordingly, therefore it's safe to use put_unaligned_le32 without the ifdef. dave Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:36 -04:00
Tsutomu Itoh	db5b493ac7	Btrfs: cleanup some BUG_ON() This patch changes some BUG_ON() to the error return. (but, most callers still use BUG_ON()) Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:35 -04:00
liubo	1abe9b8a13	Btrfs: add initial tracepoint support for btrfs Tracepoints can provide insight into why btrfs hits bugs and be greatly helpful for debugging, e.g dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0 dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0 btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0) btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0) btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8 flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0) flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0) flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0) btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0) btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0) Here is what I have added: 1) ordere_extent: btrfs_ordered_extent_add btrfs_ordered_extent_remove btrfs_ordered_extent_start btrfs_ordered_extent_put These provide critical information to understand how ordered_extents are updated. 2) extent_map: btrfs_get_extent extent_map is used in both read and write cases, and it is useful for tracking how btrfs specific IO is running. 3) writepage: __extent_writepage btrfs_writepage_end_io_hook Pages are cirtical resourses and produce a lot of corner cases during writeback, so it is valuable to know how page is written to disk. 4) inode: btrfs_inode_new btrfs_inode_request btrfs_inode_evict These can show where and when a inode is created, when a inode is evicted. 5) sync: btrfs_sync_file btrfs_sync_fs These show sync arguments. 6) transaction: btrfs_transaction_commit In transaction based filesystem, it will be useful to know the generation and who does commit. 7) back reference and cow: btrfs_delayed_tree_ref btrfs_delayed_data_ref btrfs_delayed_ref_head btrfs_cow_block Btrfs natively supports back references, these tracepoints are helpful on understanding btrfs's COW mechanism. 8) chunk: btrfs_chunk_alloc btrfs_chunk_free Chunk is a link between physical offset and logical offset, and stands for space infomation in btrfs, and these are helpful on tracing space things. 9) reserved_extent: btrfs_reserved_extent_alloc btrfs_reserved_extent_free These can show how btrfs uses its space. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:33 -04:00
Chris Mason	240f62c875	Btrfs: use RCU instead of a spinlock to protect the root node The pointer to the extent buffer for the root of each tree is protected by a spinlock so that we can safely read the pointer and take a reference on the extent buffer. But now that the extent buffers are freed via RCU, we can safely use rcu_read_lock instead. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-28 05:37:22 -04:00
Roberto Sassu	b5695d0463	eCryptfs: write lock requested keys A requested key is write locked in order to prevent modifications on the authentication token while it is being used. Signed-off-by: Roberto Sassu <roberto.sassu@polito.it> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2011-03-28 01:49:43 -05:00
Roberto Sassu	950983fc04	eCryptfs: move ecryptfs_find_auth_tok_for_sig() call before mutex_lock The ecryptfs_find_auth_tok_for_sig() call is moved before the mutex_lock(s->tfm_mutex) instruction in order to avoid possible deadlocks that may occur by holding the lock on the two semaphores 'key->sem' and 's->tfm_mutex' in reverse order. Signed-off-by: Roberto Sassu <roberto.sassu@polito.it> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2011-03-28 01:49:42 -05:00
Roberto Sassu	0e1fc5ef47	eCryptfs: verify authentication tokens before their use Authentication tokens content may change if another requestor calls the update() method of the corresponding key. The new function ecryptfs_verify_auth_tok_from_key() retrieves the authentication token from the provided key and verifies if it is still valid before being used to encrypt or decrypt an eCryptfs file. Signed-off-by: Roberto Sassu <roberto.sassu@polito.it> [tyhicks: Minor formatting changes] Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2011-03-28 01:49:41 -05:00
Roberto Sassu	7762e230fd	eCryptfs: modified size of keysig in the ecryptfs_key_sig structure The size of the 'keysig' array is incremented of one byte in order to make room for the NULL character. The 'keysig' variable is used, in the function ecryptfs_generate_key_packet_set(), to find an authentication token with the given signature and is printed a debug message if it cannot be retrieved. Signed-off-by: Roberto Sassu <roberto.sassu@polito.it> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2011-03-28 01:49:40 -05:00
Roberto Sassu	cf35ca6913	eCryptfs: removed num_global_auth_toks from ecryptfs_mount_crypt_stat This patch removes the 'num_global_auth_toks' field of the ecryptfs_mount_crypt_stat structure, used to count the number of items in the 'global_auth_tok_list' list. This variable is not needed because there are no checks based upon it. Signed-off-by: Roberto Sassu <roberto.sassu@polito.it> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2011-03-28 01:49:39 -05:00
Roberto Sassu	1821df040a	eCryptfs: ecryptfs_keyring_auth_tok_for_sig() bug fix The pointer '(*auth_tok_key)' is set to NULL in case request_key() fails, in order to prevent its use by functions calling ecryptfs_keyring_auth_tok_for_sig(). Signed-off-by: Roberto Sassu <roberto.sassu@polito.it> Cc: <stable@kernel.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2011-03-28 01:49:15 -05:00
Tyler Hicks	50f198ae16	eCryptfs: Unlock page in write_begin error path Unlock the page in error path of ecryptfs_write_begin(). This may happen, for example, if decryption fails while bring the page up-to-date. Cc: <stable@kernel.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2011-03-28 01:47:46 -05:00
Thieu Le	57db4e8d73	ecryptfs: modify write path to encrypt page in writepage Change the write path to encrypt the data only when the page is written to disk in ecryptfs_writepage. Previously, ecryptfs encrypts the page in ecryptfs_write_end which means that if there are multiple write requests to the same page, ecryptfs ends up re-encrypting that page over and over again. This patch minimizes the number of encryptions needed. Signed-off-by: Thieu Le <thieule@chromium.org> [tyhicks: Changed NULL .drop_inode sop pointer to generic_drop_inode] Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2011-03-28 01:47:45 -05:00
Tyler Hicks	fed8859b3a	eCryptfs: Remove ECRYPTFS_NEW_FILE crypt stat flag Now that grow_file() is not called in the ecryptfs_create() path, the ECRYPTFS_NEW_FILE flag is no longer needed. It helped ecryptfs_readpage() know not to decrypt zeroes that were read from the lower file in the grow_file() path. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2011-03-28 01:47:44 -05:00
Tyler Hicks	bd4f0fe8bb	eCryptfs: Remove unnecessary grow_file() function When creating a new eCryptfs file, the crypto metadata is written out and then the lower file was being "grown" with 4 kB of encrypted zeroes. I suspect that growing the encrypted file was to prevent an information leak that the unencrypted file was empty. However, the unencrypted file size is stored, in plaintext, in the metadata so growing the file is unnecessary. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2011-03-28 01:47:43 -05:00
Linus Torvalds	a17d47300b	Merge branch 'for-linus-1' of git://git.infradead.org/mtd-2.6 * 'for-linus-1' of git://git.infradead.org/mtd-2.6: (49 commits) mtd: mtdswap: fix compilation warning mtdswap: kill strict error handling option mtd: nand: enable software BCH ECC in nand simulator mtd: nand: add software BCH ECC support mtd: fix printf format warnings, mostly lack of %zd for size_t, in mtdswap mtd: sm_rtl: check kmalloc return value mtd: cfi: add support for AMIC flashes (e.g. A29L160AT) lib: add shared BCH ECC library mtd: mxc_nand: fix OOB corruption when page size > 2KiB mtd: DaVinci: Removed header file that is not required mtd: pxa3xx_nand: clean the keep configure code mtd: pxa3xx_nand: mtd scan id process could be defined by driver itself mtd: pxa3xx_nand: unify prepare command mtd: pxa3xx_nand: discard wait_for_event,write_cmd,__readid function mtd: pxa3xx_nand: rework irq logic mtd: pxa3xx_nand: make scan procedure more clear mtd: speedtest: fix integer overflow mtd: mxc_nand: fix read past buffer end mtd: omap3: nand: report corrected ecc errors jffs2: remove a trailing white space in commentaries ...	2011-03-27 19:40:56 -07:00
Randy Dunlap	b6d0ad686d	fs: fix inode.c kernel-doc warning Fix inode.c kernel-doc fatal error: 2 comment sections have the same name: Error(fs/inode.c:1171): duplicate section name 'Note' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-27 19:30:19 -07:00
Linus Torvalds	76597cd314	proc: fix oops on invalid /proc/<pid>/maps access When m_start returns an error, the seq_file logic will still call m_stop with that error entry, so we'd better make sure that we check it before using it as a vma. Introduced by commit `ec6fd8a435` ("report errors in /proc//map* sanely"), which replaced NULL with various ERR_PTR() cases. (On ia64, you happen to get a unaligned fault instead of a page fault, since the address used is generally some random error code like -EPERM) Reported-by: Anca Emanuel <anca.emanuel@gmail.com> Reported-by: Tony Luck <tony.luck@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Américo Wang <xiyou.wangcong@gmail.com> Cc: Stephen Wilson <wilsons@start.ca> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-27 19:09:29 -07:00
Trond Myklebust	a0e7e3cf79	NFS: Don't leak RPC clients in NFSv4 secinfo negotiation Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-27 17:48:17 +02:00
Trond Myklebust	4d65c520fb	NFS: Fix a hang in the writeback path Now that the inode scalability patches have been merged, it is no longer safe to call igrab() under the inode->i_lock. Now that we no longer call nfs_clear_request() until the nfs_page is being freed, we know that we are always holding a reference to the nfs_open_context, which again holds a reference to the path, and so the inode cannot be freed until the last nfs_page has been removed from the radix tree and freed. We can therefore skip the igrab()/iput() altogether. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-27 17:48:07 +02:00
Rakib Mullick	c03e3126e4	codafs: fix build break when CONFIG_PROC_SYSCTL=n Commit `0bc825d240` ("codafs: fix compile warning when CONFIG_SYSCTL=n") introduces build breakage, when CONFIG_PROC_SYSCTL=n and CONFIG_CODA_FS=y: fs/built-in.o: In function `init_coda': psdev.c:(.init.text+0xc02): undefined reference to `coda_sysctl_init' psdev.c:(.init.text+0xc7c): undefined reference to `coda_sysctl_clean' fs/built-in.o: In function `exit_coda': psdev.c:(.exit.text+0xa9): undefined reference to `coda_sysctl_clean' make: *** [.tmp_vmlinux1] Error 1 Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com> Reported-by: Ingo Molnar <mingo@elte.hu> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-25 17:45:16 -07:00
Josef Bacik	c0da7aa1a2	Btrfs: mark the bio with an error if we have a failure in dio I noticed that dio_end_io calls the appropriate endio function with an error, but the endio functions don't actually do anything with that error, they assume that if there was an error then the bio will not be uptodate. So if we had checksum failures we would never pass back EIO. So if there is an error in our endio functions make sure to clear the uptodate flag on the bio. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-25 19:08:19 -04:00
Josef Bacik	98bc3149fa	Btrfs: don't allocate dip->csums when doing writes When doing direct writes we store the checksums in the ordered sum stuff in the ordered extent for writing them when the write completes, so we don't even use the dip->csums array. So if we're writing, don't bother allocating dip->csums since we won't use it anyway. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-25 19:08:18 -04:00
Josef Bacik	4e69b598f6	Btrfs: cleanup how we setup free space clusters This patch makes the free space cluster refilling code a little easier to understand, and fixes some things with the bitmap part of it. Currently we either want to refill a cluster with 1) All normal extent entries (those without bitmaps) 2) A bitmap entry with enough space The current code has this ugly jump around logic that will first try and fill up the cluster with extent entries and then if it can't do that it will try and find a bitmap to use. So instead split this out into two functions, one that tries to find only normal entries, and one that tries to find bitmaps. This also fixes a suboptimal thing we would do with bitmaps. If we used a bitmap we would just tell the cluster that we were pointing at a bitmap and it would do the tree search in the block group for that entry every time we tried to make an allocation. Instead of doing that now we just add it to the clusters group. I tested this with my ENOSPC tests and xfstests and it survived. Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-25 19:08:08 -04:00
Dave Chinner	0e6e847ffe	xfs: stop using the page cache to back the buffer cache Now that the buffer cache has it's own LRU, we do not need to use the page cache to provide persistent caching and reclaim infrastructure. Convert the buffer cache to use alloc_pages() instead of the page cache. This will remove all the overhead of page cache management from setup and teardown of the buffers, as well as needing to mark pages accessed as we find buffers in the buffer cache. By avoiding the page cache, we also remove the need to keep state in the page_private(page) field for persistant storage across buffer free/buffer rebuild and so all that code can be removed. This also fixes the long-standing problem of not having enough bits in the page_private field to track all the state needed for a 512 sector/64k page setup. It also removes the need for page locking during reads as the pages are unique to the buffer and nobody else will be attempting to access them. Finally, it removes the buftarg address space lock as a point of global contention on workloads that allocate and free buffers quickly such as when creating or removing large numbers of inodes in parallel. This remove the 16TB limit on filesystem size on 32 bit machines as the page index (32 bit) is no longer used for lookups of metadata buffers - the buffer cache is now solely indexed by disk address which is stored in a 64 bit field in the buffer. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-03-26 09:16:45 +11:00
Dave Chinner	704b2907c2	xfs: register the inode cache shrinker before quotachecks During mount, we can do a quotacheck that involves a bulkstat pass on all inodes. If there are more inodes in the filesystem than can be held in memory, we require the inode cache shrinker to run to ensure that we don't run out of memory. Unfortunately, the inode cache shrinker is not registered until we get to the end of the superblock setup process, which is after a quotacheck is run if it is needed. Hence we need to register the inode cache shrinker earlier in the mount process so that we don't OOM during mount. This requires that we also initialise the syncd work before we register the shrinker, so we nee dto juggle that around as well. While there, make sure that we have set up the block sizes in the VFS superblock correctly before the quotacheck is run so that any inodes that are cached as a result of the quotacheck have their block size fields set up correctly. Cc: stable@kernel.org Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-03-26 09:14:57 +11:00
Dave Chinner	7401aafd50	xfs: xfs_trans_read_buf() should return an error on failure When inside a transaction and we fail to read a buffer, xfs_trans_read_buf returns a null buffer pointer and no error. xfs_do_da_buf() checks the error return, but not the buffer, and as a result this read failure condition causes a panic when it attempts to dereference the non-existant buffer. Make xfs_trans_read_buf() return the same error for this situation regardless of whether it is in a transaction or not. This means every caller does not need to check both the error return and the buffer before proceeding to use the buffer. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-03-26 09:14:44 +11:00
Dave Chinner	1bfd8d0419	xfs: introduce inode cluster buffer trylocks for xfs_iflush There is an ABBA deadlock between synchronous inode flushing in xfs_reclaim_inode and xfs_icluster_free. xfs_icluster_free locks the buffer, then takes inode ilocks, whilst synchronous reclaim takes the ilock followed by the buffer lock in xfs_iflush(). To avoid this deadlock, separate the inode cluster buffer locking semantics from the synchronous inode flush semantics, allowing callers to attempt to lock the buffer but still issue synchronous IO if it can get the buffer. This requires xfs_iflush() calls that currently use non-blocking semantics to pass SYNC_TRYLOCK rather than 0 as the flags parameter. This allows xfs_reclaim_inode to avoid the deadlock on the buffer lock and detect the failure so that it can drop the inode ilock and restart the reclaim attempt on the inode. This allows xfs_ifree_cluster to obtain the inode lock, mark the inode stale and release it and hence defuse the deadlock situation. It also has the pleasant side effect of avoiding IO in xfs_reclaim_inode when it tries to next reclaim the inode as it is now marked stale. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-03-26 09:13:55 +11:00
Dave Chinner	a19fb38096	vmap: flush vmap aliases when mapping fails On 32 bit systems, vmalloc space is limited and XFS can chew through it quickly as the vmalloc space is lazily freed. This can result in failure to map buffers, even when there is apparently large amounts of vmalloc space available. Hence, if we fail to map a buffer, purge the aliases that have not yet been freed to hopefuly free up enough vmalloc space to allow a retry to succeed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-03-26 09:13:42 +11:00
Dave Chinner	8287889742	xfs: preallocation transactions do not need to be synchronous Preallocation and hole punch transactions are currently synchronous and this is causing performance problems in some cases. The transactions don't need to be synchronous as we don't need to guarantee the preallocation is persistent on disk until a fdatasync, fsync, sync operation occurs. If the file is opened O_SYNC or O_DATASYNC, only then should the transaction be issued synchronously. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>	2011-03-26 09:13:08 +11:00
Sage Weil	ef550f6f4f	ceph: flush msgr_wq during mds_client shutdown The release method for mds connections uses a backpointer to the mds_client, so we need to flush the workqueue of any pending work (and ceph_connection references) prior to freeing the mds_client. This fixes an oops easily triggered under UML by while true ; do mount ... ; umount ... ; done Also fix an outdated comment: the flush in ceph_destroy_client only flushes OSD connections out. This bug is basically an artifact of the ceph -> ceph+libceph conversion. Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-25 13:27:48 -07:00
Linus Torvalds	40471856f2	Merge branch 'nfs-for-2.6.39' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'nfs-for-2.6.39' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (28 commits) Cleanup XDR parsing for LAYOUTGET, GETDEVICEINFO NFSv4.1 convert layoutcommit sync to boolean NFSv4.1 pnfs_layoutcommit_inode fixes NFS: Determine initial mount security NFS: use secinfo when crossing mountpoints NFS: Add secinfo procedure NFS: lookup supports alternate client NFS: convert call_sync() to a function NFSv4.1 remove temp code that prevented ds commits NFSv4.1: layoutcommit NFSv4.1: filelayout driver specific code for COMMIT NFSv4.1: remove GETATTR from ds commits NFSv4.1: add generic layer hooks for pnfs COMMIT NFSv4.1: alloc and free commit_buckets NFSv4.1: shift filelayout_free_lseg NFSv4.1: pull out code from nfs_commit_release NFSv4.1: pull error handling out of nfs_commit_list NFSv4.1: add callback to nfs4_commit_done NFSv4.1: rearrange nfs_commit_rpcsetup NFSv4.1: don't send COMMIT to ds for data sync writes ...	2011-03-25 10:03:28 -07:00
Linus Torvalds	ae005cbed1	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (43 commits) ext4: fix a BUG in mb_mark_used during trim. ext4: unused variables cleanup in fs/ext4/extents.c ext4: remove redundant set_buffer_mapped() in ext4_da_get_block_prep() ext4: add more tracepoints and use dev_t in the trace buffer ext4: don't kfree uninitialized s_group_info members ext4: add missing space in printk's in __ext4_grp_locked_error() ext4: add FITRIM to compat_ioctl. ext4: handle errors in ext4_clear_blocks() ext4: unify the ext4_handle_release_buffer() api ext4: handle errors in ext4_rename jbd2: add COW fields to struct jbd2_journal_handle jbd2: add the b_cow_tid field to journal_head struct ext4: Initialize fsync transaction ids in ext4_new_inode() ext4: Use single thread to perform DIO unwritten convertion ext4: optimize ext4_bio_write_page() when no extent conversion is needed ext4: skip orphan cleanup if fs has unknown ROCOMPAT features ext4: use the nblocks arg to ext4_truncate_restart_trans() ext4: fix missing iput of root inode for some mount error paths ext4: make FIEMAP and delayed allocation play well together ext4: suppress verbose debugging information if malloc-debug is off ... Fi up conflicts in fs/ext4/super.c due to workqueue changes	2011-03-25 09:57:41 -07:00
Artem Bityutskiy	7bf7e370d5	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 into for-linus-1 * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6: (9356 commits) [media] rc: update for bitop name changes fs: simplify iget & friends fs: pull inode->i_lock up out of writeback_single_inode fs: rename inode_lock to inode_hash_lock fs: move i_wb_list out from under inode_lock fs: move i_sb_list out from under inode_lock fs: remove inode_lock from iput_final and prune_icache fs: Lock the inode LRU list separately fs: factor inode disposal fs: protect inode->i_state with inode->i_lock lib, arch: add filter argument to show_mem and fix private implementations SLUB: Write to per cpu data when allocating it slub: Fix debugobjects with lockless fastpath autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd() autofs4 - remove autofs4_lock autofs4 - fix d_manage() return on rcu-walk autofs4 - fix autofs4_expire_indirect() traversal autofs4 - fix dentry leak in autofs4_expire_direct() autofs4 - reinstate last used update on access vfs - check non-mountpoint dentry might block in __follow_mount_rcu() ... NOTE! This merge commit was created to fix compilation error. The block tree was merged upstream and removed the 'elv_queue_empty()' function which the new 'mtdswap' driver is using. So a simple merge of the mtd tree with upstream does not compile. And the mtd tree has already be published, so re-basing it is not an option. To fix this unfortunate situation, I had to merge upstream into the mtd-2.6.git tree without committing, put the fixup patch on top of this, and then commit this. The result is that we do not have commits which do not compile. In other words, this merge commit "merges" 3 things: the MTD tree, the upstream tree, and the fixup patch.	2011-03-25 17:41:20 +02:00
J. Bruce Fields	954032d252	nfsd: fix auth_domain reference leak on nlm operations This was noticed by users who performed more than 2^32 lock operations and hence made this counter overflow (eventually leading to use-after-free's). Setting rq_client to NULL here means that it won't later get auth_domain_put() when it should be. Appears to have been introduced in 2.5.42 by "[PATCH] kNFSd: Move auth domain lookup into svcauth" which moved most of the rq_client handling to common svcauth code, but left behind this one line. Cc: Neil Brown <neilb@suse.de> Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-03-24 23:11:27 -04:00
Linus Torvalds	d39dd11c3e	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: fs: simplify iget & friends fs: pull inode->i_lock up out of writeback_single_inode fs: rename inode_lock to inode_hash_lock fs: move i_wb_list out from under inode_lock fs: move i_sb_list out from under inode_lock fs: remove inode_lock from iput_final and prune_icache fs: Lock the inode LRU list separately fs: factor inode disposal fs: protect inode->i_state with inode->i_lock autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd() autofs4 - remove autofs4_lock autofs4 - fix d_manage() return on rcu-walk autofs4 - fix autofs4_expire_indirect() traversal autofs4 - fix dentry leak in autofs4_expire_direct() autofs4 - reinstate last used update on access vfs - check non-mountpoint dentry might block in __follow_mount_rcu()	2011-03-24 19:01:30 -07:00
Christoph Hellwig	0b2d0724e2	fs: simplify iget & friends Merge get_new_inode/get_new_inode_fast into iget5_locked/iget_locked as those were the only callers. Remove the internal ifind/ifind_fast helpers - ifind_fast only had a single caller, and ifind had two callers wanting it to do different things. Also clean up the comments in this area to focus on information important to a developer trying to use it, instead of overloading them with implementation details. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-24 21:17:52 -04:00
Dave Chinner	0f1b1fd86f	fs: pull inode->i_lock up out of writeback_single_inode First thing we do in writeback_single_inode() is take the i_lock and the last thing we do is drop it. A caller already holds the i_lock, so pull the i_lock out of writeback_single_inode() to reduce the round trips on this lock during inode writeback. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-24 21:17:51 -04:00
Dave Chinner	67a23c4946	fs: rename inode_lock to inode_hash_lock All that remains of the inode_lock is protecting the inode hash list manipulation and traversals. Rename the inode_lock to inode_hash_lock to reflect it's actual function. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-24 21:17:51 -04:00
Dave Chinner	a66979abad	fs: move i_wb_list out from under inode_lock Protect the inode writeback list with a new global lock inode_wb_list_lock and use it to protect the list manipulations and traversals. This lock replaces the inode_lock as the inodes on the list can be validity checked while holding the inode->i_lock and hence the inode_lock is no longer needed to protect the list. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-24 21:17:51 -04:00
Dave Chinner	55fa6091d8	fs: move i_sb_list out from under inode_lock Protect the per-sb inode list with a new global lock inode_sb_list_lock and use it to protect the list manipulations and traversals. This lock replaces the inode_lock as the inodes on the list can be validity checked while holding the inode->i_lock and hence the inode_lock is no longer needed to protect the list. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-24 21:16:32 -04:00
Dave Chinner	f283c86afe	fs: remove inode_lock from iput_final and prune_icache Now that inode state changes are protected by the inode->i_lock and the inode LRU manipulations by the inode_lru_lock, we can remove the inode_lock from prune_icache and the initial part of iput_final(). instead of using the inode_lock to protect the inode during iput_final, use the inode->i_lock instead. This protects the inode against new references being taken while we change the inode state to I_FREEING, as well as preventing prune_icache from grabbing the inode while we are manipulating it. Hence we no longer need the inode_lock in iput_final prior to setting I_FREEING on the inode. For prune_icache, we no longer need the inode_lock to protect the LRU list, and the inodes themselves are protected against freeing races by the inode->i_lock. Hence we can lift the inode_lock from prune_icache as well. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-24 21:16:32 -04:00
Dave Chinner	02afc410f3	fs: Lock the inode LRU list separately Introduce the inode_lru_lock to protect the inode_lru list. This lock is nested inside the inode->i_lock to allow the inode to be added to the LRU list in iput_final without needing to deal with lock inversions. This keeps iput_final() clean and neat. Further, where marking the inode I_FREEING and removing it from the LRU, move the LRU list manipulation within the inode->i_lock to keep the list manipulation consistent with iput_final. This also means that most of the open coded LRU list removal + unused inode accounting can now use the inode_lru_list_del() wrappers which cleans the code up further. However, this locking change means what the LRU traversal in prune_icache() inverts this lock ordering and needs to use trylock semantics on the inode->i_lock to avoid deadlocking. In these cases, if we fail to lock the inode we move it to the back of the LRU to prevent spinning on it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-24 21:16:31 -04:00
Dave Chinner	b2b2af8e61	fs: factor inode disposal We have a couple of places that dispose of inodes. factor the disposal into evict() to isolate this code and make it simpler to peel away the inode_lock from the code. While doing this, change the logic flow in iput_final() to separate the different cases that need to be handled to make the transitions the inode goes through more obvious. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-24 21:16:31 -04:00
Dave Chinner	250df6ed27	fs: protect inode->i_state with inode->i_lock Protect inode state transitions and validity checks with the inode->i_lock. This enables us to make inode state transitions independently of the inode_lock and is the first step to peeling away the inode_lock from the code. This requires that __iget() is done atomically with i_state checks during list traversals so that we don't race with another thread marking the inode I_FREEING between the state check and grabbing the reference. Also remove the unlock_new_inode() memory barrier optimisation required to avoid taking the inode_lock when clearing I_NEW. Simplify the code by simply taking the inode->i_lock around the state change and wakeup. Because the wakeup is no longer tricky, remove the wake_up_inode() function and open code the wakeup where necessary. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-24 21:16:31 -04:00
Trond Myklebust	0acd220192	Merge branch 'nfs-for-2.6.39' into nfs-for-next	2011-03-24 17:03:14 -04:00
Weston Andros Adamson	35124a0994	Cleanup XDR parsing for LAYOUTGET, GETDEVICEINFO changes LAYOUTGET and GETDEVICEINFO XDR parsing to: - not use vmap, which doesn't work on incoherent archs - use xdr_stream parsing for all xdr Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-24 17:01:41 -04:00
Andy Adamson	ef31153786	NFSv4.1 convert layoutcommit sync to boolean Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-24 15:49:48 -04:00
Andy Adamson	de4b15c7e9	NFSv4.1 pnfs_layoutcommit_inode fixes Test NFS_INO_LAYOUTCOMMIT before kzalloc Mark inode dirty to retry LAYOUTCOMMIT on kzalloc failure. Add comments. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-24 15:49:48 -04:00
Jesper Juhl	3dc8fe4dca	autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd() In fs/autofs4/dev-ioctl.c::autofs_dev_ioctl_setpipefd() we call fget(), which may return NULL, but we do not explicitly test for that NULL return so we may end up dereferencing a NULL pointer - bad. When I originally submitted this patch I had chosen EBUSY as the return value to use if this happens. Ian Kent was kind enough to explain why that would most likely be wrong and why EBADF should most likely be used instead. This version of the patch uses EBADF. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-24 14:54:35 -04:00
Ian Kent	e7854723d0	autofs4 - remove autofs4_lock The autofs4_lock introduced by the rcu-walk changes has unnecessarily broad scope. The locking is better handled by the per-autofs super block lookup_lock. Signed-off-by: Ian Kent <raven@themaw.net> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-24 14:54:35 -04:00
Ian Kent	83fb96bfc7	autofs4 - fix d_manage() return on rcu-walk The daemon never needs to block and, in the rcu-walk case an error return isn't used, so always return zero. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-24 14:54:34 -04:00
Ian Kent	d4a85e35d1	autofs4 - fix autofs4_expire_indirect() traversal The vfs-scale changes changed the traversal used in autofs4_expire_indirect() from a list to a depth first tree traversal which isn't right. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-24 14:54:34 -04:00
Ian Kent	f9398c233e	autofs4 - fix dentry leak in autofs4_expire_direct() There is a missing dput() when returning from autofs4_expire_direct() when we see that the dentry is already a pending mount. Signed-off-by: Ian Kent <raven@themaw.net> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-24 14:54:34 -04:00
Ian Kent	3c31998529	autofs4 - reinstate last used update on access When direct (and offset) mounts were introduced the the last used timeout could no longer be updated in ->d_revalidate(). This is because covered direct mounts would be followed over without calling the autofs file system. As a result the definition of the busyness check for all entries was changed to be "actually busy" being an open file or working directory within the automount. But now we have a call back in the follow so the last used update on any access can be re-instated. This requires DCACHE_MANAGE_TRANSIT to always be set. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-24 14:54:34 -04:00
Ian Kent	62a7375e5d	vfs - check non-mountpoint dentry might block in __follow_mount_rcu() When following a mount in rcu-walk mode we must check if the incoming dentry is telling us it may need to block, even if it isn't actually a mountpoint. Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-24 14:24:32 -04:00
Bryan Schumaker	8f70e95f9f	NFS: Determine initial mount security When sec=<something> is not presented as a mount option, we should attempt to determine what security flavor the server is using. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-24 13:52:42 -04:00
Bryan Schumaker	7ebb931598	NFS: use secinfo when crossing mountpoints A submount may use different security than the parent mount does. We should figure out what sec flavor the submount uses at mount time. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-24 13:52:42 -04:00
Bryan Schumaker	5a5ea0d485	NFS: Add secinfo procedure This patch adds the nfs4 operation secinfo as a valid nfs rpc operation. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-24 13:52:41 -04:00
Bryan Schumaker	7c5130588d	NFS: lookup supports alternate client A later patch will need to perform a lookup using an alternate client with a different security flavor. This patch adds support for doing that on NFS v4. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-24 13:52:41 -04:00
Bryan Schumaker	e73b83f270	NFS: convert call_sync() to a function This patch changes nfs4_call_sync() from a macro into a static inline function. As a macro, the call_sync() function will not do any type checking and depends on the sequence arguments always having the same name. As a function, we get to have type checking and can rename the arguments if we so choose. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-24 13:52:41 -04:00
Linus Torvalds	6c51038900	Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits) Documentation/iostats.txt: bit-size reference etc. cfq-iosched: removing unnecessary think time checking cfq-iosched: Don't clear queue stats when preempt. blk-throttle: Reset group slice when limits are changed blk-cgroup: Only give unaccounted_time under debug cfq-iosched: Don't set active queue in preempt block: fix non-atomic access to genhd inflight structures block: attempt to merge with existing requests on plug flush block: NULL dereference on error path in __blkdev_get() cfq-iosched: Don't update group weights when on service tree fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away block: Require subsystems to explicitly allocate bio_set integrity mempool jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging fs: make fsync_buffers_list() plug mm: make generic_writepages() use plugging blk-cgroup: Add unaccounted time to timeslice_used. block: fixup plugging stubs for !CONFIG_BLOCK block: remove obsolete comments for blkdev_issue_zeroout. blktrace: Use rq->cmd_flags directly in blk_add_trace_rq. ... Fix up conflicts in fs/{aio.c,super.c}	2011-03-24 10:16:26 -07:00
Linus Torvalds	6d1e9a42e7	Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6: pstore: cleanups to pstore_dump() [IA64] New syscalls for 2.6.39	2011-03-24 10:05:23 -07:00
Linus Torvalds	fdc0ad80a4	Merge branch 'for-linus' of git://git.infradead.org/ubi-2.6 * 'for-linus' of git://git.infradead.org/ubi-2.6: UBIFS: fix assertion warning and refine comments UBIFS: kill CONFIG_UBIFS_FS_DEBUG_CHKS UBIFS: use GFP_NOFS properly UBI: use GFP_NOFS properly	2011-03-24 08:22:34 -07:00
Linus Torvalds	dc87c55120	Merge branch 'for-2.6.39' of git://linux-nfs.org/~bfields/linux * 'for-2.6.39' of git://linux-nfs.org/~bfields/linux: SUNRPC: Remove resource leak in svc_rdma_send_error() nfsd: wrong index used in inner loop nfsd4: fix comment and remove unused nfsd4_file fields nfs41: make sure nfs server return right ca_maxresponsesize_cached nfsd: fix compile error svcrpc: fix bad argument in unix_domain_find nfsd4: fix struct file leak nfsd4: minor nfs4state.c reshuffling svcrpc: fix rare race on unix_domain creation nfsd41: modify the members value of nfsd4_op_flags nfsd: add proc file listing kernel's gss_krb5 enctypes gss:krb5 only include enctype numbers in gm_upcall_enctypes NFSD, VFS: Remove dead code in nfsd_rename() nfsd: kill unused macro definition locks: use assign_type()	2011-03-24 08:20:39 -07:00
Linus Torvalds	5818fcc8bd	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus * git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus: Squashfs: Use vmalloc rather than kmalloc for zlib workspace Squashfs: handle corruption of directory structure Squashfs: wrap squashfs_mount() definition Squashfs: xz_wrapper doesn't need to include squashfs_fs_i.h anymore Squashfs: Update documentation to include compression options Squashfs: Update Kconfig help text to include xz compression Squashfs: add compression options support to xz decompressor Squashfs: extend decompressor framework to handle compression options	2011-03-24 08:02:21 -07:00
Linus Torvalds	1b506cfb6a	Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd * 'for-linus' of git://git.open-osd.org/linux-open-osd: exofs: deprecate the commands pending counter exofs: Write sbi->s_nextid as part of the Create command exofs: Add option to mount by osdname exofs: Override read-ahead to align on stripe_size exofs: simple fsync race fix exofs: Optimize read_4_write exofs: Trivial: fix some indentation and debug prints exofs: Remove redundant unlikely()	2011-03-24 07:57:38 -07:00
Artem Bityutskiy	6ed09c34b7	UBIFS: fix assertion warning and refine comments This patch fixes the following UBIFS assertion warning: UBIFS assert failed in do_readpage at 115 (pid 199) [<b00321b8>] (unwind_backtrace+0x0/0xdc) from [<af025118>] (do_readpage+0x108/0x594 [ubifs]) [<af025118>] (do_readpage+0x108/0x594 [ubifs]) from [<af025764>] (ubifs_write_end+0x1c0/0x2e8 [ubifs]) [<af025764>] (ubifs_write_end+0x1c0/0x2e8 [ubifs]) from [<b00a0164>] (generic_file_buffered_write+0x18c/0x270) [<b00a0164>] (generic_file_buffered_write+0x18c/0x270) from [<b00a08d4>] (__generic_file_aio_write+0x478/0x4c0) [<b00a08d4>] (__generic_file_aio_write+0x478/0x4c0) from [<b00a0984>] (generic_file_aio_write+0x68/0xc8) [<b00a0984>] (generic_file_aio_write+0x68/0xc8) from [<af024a78>] (ubifs_aio_write+0x178/0x1d8 [ubifs]) [<af024a78>] (ubifs_aio_write+0x178/0x1d8 [ubifs]) from [<b00d104c>] (do_sync_write+0xb0/0x100) [<b00d104c>] (do_sync_write+0xb0/0x100) from [<b00d1abc>] (vfs_write+0xac/0x154) [<b00d1abc>] (vfs_write+0xac/0x154) from [<b00d1c10>] (sys_write+0x3c/0x68) [<b00d1c10>] (sys_write+0x3c/0x68) from [<b002d9a0>] (ret_fast_syscall+0x0/0x2c) The 'PG_checked' flag is used to indicate that the page does not supposedly exist on the media (e.g., a hole or a page beyond the inode size), so it requires slightly bigger budget, because we have to account the indexing size increase. And this flag basically tells that the budget for this page has to be "new page budget". The "new page budget" is slightly bigger than the "existing page budget". The 'do_readpage()' function has the following assertion which sometimes is hit: 'ubifs_assert(!PageChecked(page))'. Obviously, the meaning of this assertion is: "I should not be asked to read a page which does not exist on the media". However, in 'ubifs_write_begin()' we have a small "trick". Notice, that VFS may write pages which were not read yet, so the page data were not loaded from the media to the page cache yet. If VFS tells that it is going to change only some part of the page, we obviously have to load it from the media. However, if VFS tells that it is going to change whole page, we do not read it from the media for optimization purposes. However, since we do not read it, we do not know if it exists on the media or not (a hole, etc). So we set the 'PG_checked' flag to this page to force bigger budget, just in case. So 'ubifs_write_begin()' sets 'PG_checked'. Then we are in 'ubifs_write_end()'. And VFS tells us: "hey, for some reasons I changed my mind and did not change whole page". Frankly, I do not know why this happens, but I hit this somehow on an ARM platform. And this is extremely rare. So in this case UBIFS does the following: 1. Cancels allocated budget. 2. Loads the page from the media by calling 'do_readpage()'. 3. Asks VFS to repeat the whole write operation from the very beginning (call '->write_begin() again, etc). And the assertion warning is hit at the step 2 - remember we have the 'PG_checked' set for this page, and 'do_readpage()' does not like this. So this patch fixes the problem by adding step 1.5 and cleaning the 'PG_checked' before calling 'do_readpage()'. All in all, this patch does not fix any functionality issue, but it silences UBIFS false positive warning which may happen in very very rare cases. And while on it, this patch also improves a commentary which explains the reasons of setting the 'PG_checked' flag for the page. The old commentary was a bit difficult to understand. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-03-24 16:16:18 +02:00
Artem Bityutskiy	9d523cafbe	UBIFS: kill CONFIG_UBIFS_FS_DEBUG_CHKS Simplify UBIFS configuration menu and kill the option to enable self-check compile-time. We do not really need this because we can do this run-time using the module parameters or the corresponding sysfs interfaces. And there is a value in simplifying the kernel configuration menu which becomes increasingly large. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-03-24 16:16:08 +02:00
Artem Bityutskiy	fc5e58c0c4	UBIFS: use GFP_NOFS properly This patch fixes a brown-paperbag bug which was introduced by me: I used incorrect "GFP_KERNEL \| GFP_NOFS" allocation flags to make sure my allocations do not cause write-back. But the correct form is "GFP_NOFS". Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-03-24 16:14:26 +02:00
Yongqiang Yang	523334ba50	ext3: Fix writepage credits computation for ordered mode Original computation forgets to count writes of indirect block themselves (it only counts with blocks necessary for their allocation) in ordered mode. Acked-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by:Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2011-03-24 12:33:51 +01:00
Linus Torvalds	b81a618dcd	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: deal with races in /proc//{syscall,stack,personality} proc: enable writing to /proc/pid/mem proc: make check_mem_permission() return an mm_struct on success proc: hold cred_guard_mutex in check_mem_permission() proc: disable mem_write after exec mm: implement access_remote_vm mm: factor out main logic of access_process_vm mm: use mm_struct to resolve gate vma's in __get_user_pages mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm mm: arch: make in_gate_area take an mm_struct instead of a task_struct mm: arch: make get_gate_vma take an mm_struct instead of a task_struct x86: mark associated mm when running a task in 32 bit compatibility mode x86: add context tag to mark mm when running a task in 32-bit compatibility mode auxv: require the target to be tracable (or yourself) close race in /proc//environ report errors in /proc//map* sanely pagemap: close races with suid execve make sessionid permissions in /proc//task/ match those in /proc/* fix leaks in path_lookupat() Fix up trivial conflicts in fs/proc/base.c	2011-03-23 20:51:42 -07:00
Serge E. Hallyn	2e14967075	userns: rename is_owner_or_cap to inode_owner_or_capable And give it a kernel-doc comment. [akpm@linux-foundation.org: btrfs changed in linux-next] Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:47:13 -07:00
Serge E. Hallyn	e795b71799	userns: userns: check user namespace for task->file uid equivalence checks Cheat for now and say all files belong to init_user_ns. Next step will be to let superblocks belong to a user_ns, and derive inode_userns(inode) from inode->i_sb->s_user_ns. Finally we'll introduce more flexible arrangements. Changelog: Feb 15: make is_owner_or_cap take const struct inode Feb 23: make is_owner_or_cap bool [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:47:08 -07:00
Oleg Nesterov	52e9fc76d0	procfs: kill the global proc_mnt variable After the previous cleanup in proc_get_sb() the global proc_mnt has no reasons to exists, kill it. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Cc: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Serge E. Hallyn <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:58 -07:00
Eric W. Biederman	4308eebbeb	pidns: call pid_ns_prepare_proc() from create_pid_namespace() Reorganize proc_get_sb() so it can be called before the struct pid of the first process is allocated. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Serge E. Hallyn <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:58 -07:00
Petr Holasek	cb16e95fa2	sysctl: add some missing input constraint checks Add boundaries of allowed input ranges for: dirty_expire_centisecs, drop_caches, overcommit_memory, page-cluster and panic_on_oom. Signed-off-by: Petr Holasek <pholasek@redhat.com> Acked-by: Dave Young <hidave.darkstar@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:51 -07:00
Kees Cook	5883f57ca0	proc: protect mm start_code/end_code in /proc/pid/stat While mm->start_stack was protected from cross-uid viewing (commit `f83ce3e6b0` ("proc: avoid information leaks to non-privileged processes")), the start_code and end_code values were not. This would allow the text location of a PIE binary to leak, defeating ASLR. Note that the value "1" is used instead of "0" for a protected value since "ps", "killall", and likely other readers of /proc/pid/stat, take start_code of "0" to mean a kernel thread and will misbehave. Thanks to Brad Spengler for pointing this out. Addresses CVE-2011-0726 Signed-off-by: Kees Cook <kees.cook@canonical.com> Cc: <stable@kernel.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: Eugene Teo <eugeneteo@kernel.sg> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Brad Spengler <spender@grsecurity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:37 -07:00
Alexey Dobriyan	312ec7e50c	proc: make struct proc_dir_entry::namelen unsigned int 1. namelen is declared "unsigned short" which hints for "maybe space savings". Indeed in 2.4 struct proc_dir_entry looked like: struct proc_dir_entry { unsigned short low_ino; unsigned short namelen; Now, low_ino is "unsigned int", all savings were gone for a long time. "struct proc_dir_entry" is not that countless to worry about it's size, anyway. 2. converting from unsigned short to int/unsigned int can only create problems, we better play it safe. Space is not really conserved, because of natural alignment for the next field. sizeof(struct proc_dir_entry) remains the same. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:37 -07:00
Jovi Zhang	fc3d8767b2	procfs: fix some wrong error code usage [root@wei 1]# cat /proc/1/mem cat: /proc/1/mem: No such process error code -ESRCH is wrong in this situation. Return -EPERM instead. Signed-off-by: Jovi Zhang <bookjovi@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:36 -07:00
Aaro Koskinen	0db0c01b53	procfs: fix /proc/<pid>/maps heap check The current code fails to print the "[heap]" marking if the heap is split into multiple mappings. Fix the check so that the marking is displayed in all possible cases: 1. vma matches exactly the heap 2. the heap vma is merged e.g. with bss 3. the heap vma is splitted e.g. due to locked pages Test cases. In all cases, the process should have mapping(s) with [heap] marking: (1) vma matches exactly the heap #include <stdio.h> #include <unistd.h> #include <sys/types.h> int main (void) { if (sbrk(4096) != (void )-1) { printf("check /proc/%d/maps\n", (int)getpid()); while (1) sleep(1); } return 0; } # ./test1 check /proc/553/maps [1] + Stopped ./test1 # cat /proc/553/maps \| head -4 00008000-00009000 r-xp 00000000 01:00 3113640 /test1 00010000-00011000 rw-p 00000000 01:00 3113640 /test1 00011000-00012000 rw-p 00000000 00:00 0 [heap] 4006f000-40070000 rw-p 00000000 00:00 0 (2) the heap vma is merged #include <stdio.h> #include <unistd.h> #include <sys/types.h> char foo[4096] = "foo"; char bar[4096]; int main (void) { if (sbrk(4096) != (void )-1) { printf("check /proc/%d/maps\n", (int)getpid()); while (1) sleep(1); } return 0; } # ./test2 check /proc/556/maps [2] + Stopped ./test2 # cat /proc/556/maps \| head -4 00008000-00009000 r-xp 00000000 01:00 3116312 /test2 00010000-00012000 rw-p 00000000 01:00 3116312 /test2 00012000-00014000 rw-p 00000000 00:00 0 [heap] 4004a000-4004b000 rw-p 00000000 00:00 0 (3) the heap vma is splitted (this fails without the patch) #include <stdio.h> #include <unistd.h> #include <sys/mman.h> #include <sys/types.h> int main (void) { if ((sbrk(4096) != (void )-1) && !mlockall(MCL_FUTURE) && (sbrk(4096) != (void )-1)) { printf("check /proc/%d/maps\n", (int)getpid()); while (1) sleep(1); } return 0; } # ./test3 check /proc/559/maps [1] + Stopped ./test3 # cat /proc/559/maps\|head -4 00008000-00009000 r-xp 00000000 01:00 3119108 /test3 00010000-00011000 rw-p 00000000 01:00 3119108 /test3 00011000-00012000 rw-p 00000000 00:00 0 [heap] 00012000-00013000 rw-p 00000000 00:00 0 [heap] It looks like the bug has been there forever, and since it only results in some information missing from a procfile, it does not fulfil the -stable "critical issue" criteria. Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:36 -07:00
Konstantin Khlebnikov	51e031496d	proc: hide kernel addresses via %pK in /proc/<pid>/stack This file is readable for the task owner. Hide kernel addresses from unprivileged users, leave them function names and offsets. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Kees Cook <kees.cook@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:36 -07:00
Akinobu Mita	61f2e7b0f4	bitops: remove minix bitops from asm/bitops.h minix bit operations are only used by minix filesystem and useless by other modules. Because byte order of inode and block bitmaps is different on each architecture like below: m68k: big-endian 16bit indexed bitmaps h8300, microblaze, s390, sparc, m68knommu: big-endian 32 or 64bit indexed bitmaps m32r, mips, sh, xtensa: big-endian 32 or 64bit indexed bitmaps for big-endian mode little-endian bitmaps for little-endian mode Others: little-endian bitmaps In order to move minix bit operations from asm/bitops.h to architecture independent code in minix filesystem, this provides two config options. CONFIG_MINIX_FS_BIG_ENDIAN_16BIT_INDEXED is only selected by m68k. CONFIG_MINIX_FS_NATIVE_ENDIAN is selected by the architectures which use native byte order bitmaps (h8300, microblaze, s390, sparc, m68knommu, m32r, mips, sh, xtensa). The architectures which always use little-endian bitmaps do not select these options. Finally, we can remove minix bit operations from asm/bitops.h for all architectures. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Andreas Schwab <schwab@linux-m68k.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Michal Simek <monstr@monstr.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Hirokazu Takata <takata@linux-m32r.org> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:22 -07:00
Akinobu Mita	f312eff816	bitops: remove ext2 non-atomic bitops from asm/bitops.h As the result of conversions, there are no users of ext2 non-atomic bit operations except for ext2 filesystem itself. Now we can put them into architecture independent code in ext2 filesystem, and remove from asm/bitops.h for all architectures. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:21 -07:00
Akinobu Mita	3cdc7125c3	ufs: use little-endian bitops As a preparation for removing ext2 non-atomic bit operations from asm/bitops.h. This converts ext2 non-atomic bit operations to little-endian bit operations. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:19 -07:00
Akinobu Mita	9ad1e1e405	udf: use little-endian bitops As a preparation for removing ext2 non-atomic bit operations from asm/bitops.h. This converts ext2 non-atomic bit operations to little-endian bit operations. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:19 -07:00
Akinobu Mita	a49ebbabb0	nilfs2: use little-endian bitops As a preparation for removing ext2 non-atomic bit operations from asm/bitops.h. This converts ext2 non-atomic bit operations to little-endian bit operations. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:18 -07:00
Akinobu Mita	c4354d0d68	ocfs2: use little-endian bitops As a preparation for removing ext2 non-atomic bit operations from asm/bitops.h. This converts ext2 non-atomic bit operations to little-endian bit operations. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Joel Becker <joel.becker@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:17 -07:00
Akinobu Mita	50e0168cc3	ext4: use little-endian bitops As a preparation for removing ext2 non-atomic bit operations from asm/bitops.h. This converts ext2 non-atomic bit operations to little-endian bit operations. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:17 -07:00
Andrew Morton	135a9fcf45	fs/adfs/adfs.h: fix unsigned comparison fs/adfs/adfs.h: In function 'append_filetype_suffix': fs/adfs/adfs.h:115: warning: comparison is always false due to limited range of data type Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Stuart Swales <stuart.swales.croftnuisk@gmail.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:09 -07:00
Al Viro	a9712bc12c	deal with races in /proc/*/{syscall,stack,personality} All of those are rw-r--r-- and all are broken for suid - if you open a file before the target does suid-root exec, you'll be still able to access it. For personality it's not a big deal, but for syscall and stack it's a real problem. Fix: check that task is tracable for you at the time of read(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 17:01:18 -04:00
Stephen Wilson	198214a7ee	proc: enable writing to /proc/pid/mem With recent changes there is no longer a security hazard with writing to /proc/pid/mem. Remove the #ifdef. Signed-off-by: Stephen Wilson <wilsons@start.ca> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:59 -04:00
Stephen Wilson	8b0db9db19	proc: make check_mem_permission() return an mm_struct on success This change allows us to take advantage of access_remote_vm(), which in turn eliminates a security issue with the mem_write() implementation. The previous implementation of mem_write() was insecure since the target task could exec a setuid-root binary between the permission check and the actual write. Holding a reference to the target mm_struct eliminates this vulnerability. Signed-off-by: Stephen Wilson <wilsons@start.ca> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:59 -04:00
Stephen Wilson	18f661bcf8	proc: hold cred_guard_mutex in check_mem_permission() Avoid a potential race when task exec's and we get a new ->mm but check against the old credentials in ptrace_may_access(). Holding of the mutex is implemented by factoring out the body of the code into a helper function __check_mem_permission(). Performing this factorization now simplifies upcoming changes and minimizes churn in the diff's. Signed-off-by: Stephen Wilson <wilsons@start.ca> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:58 -04:00
Stephen Wilson	26947f8c8f	proc: disable mem_write after exec This change makes mem_write() observe the same constraints as mem_read(). This is particularly important for mem_write as an accidental leak of the fd across an exec could result in arbitrary modification of the target process' memory. IOW, /proc/pid/mem is implicitly close-on-exec. Signed-off-by: Stephen Wilson <wilsons@start.ca> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:58 -04:00
Stephen Wilson	31db58b3ab	mm: arch: make get_gate_vma take an mm_struct instead of a task_struct Morally, the presence of a gate vma is more an attribute of a particular mm than a particular task. Moreover, dropping the dependency on task_struct will help make both existing and future operations on mm's more flexible and convenient. Signed-off-by: Stephen Wilson <wilsons@start.ca> Reviewed-by: Michel Lespinasse <walken@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:54 -04:00
Al Viro	2fadaef412	auxv: require the target to be tracable (or yourself) same as for environ, except that we didn't do any checks to prevent access after suid execve Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:52 -04:00
Al Viro	d6f64b89d7	close race in /proc/*/environ Switch to mm_for_maps(). Maybe we ought to make it r--r--r--, since we do checks on IO anyway... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:51 -04:00
Al Viro	ec6fd8a435	report errors in /proc//map* sanely Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:50 -04:00
Al Viro	ca6b0bf0e0	pagemap: close races with suid execve just use mm_for_maps() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:50 -04:00
Al Viro	26ec3c646e	make sessionid permissions in /proc//task/ match those in /proc/* Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:49 -04:00
Tao Ma	0ba0851714	ext4: fix a BUG in mb_mark_used during trim. In a bs=4096 volume, if we call FITRIM with the following parameter as fstrim_range(start = 102400, len = 134144000, minlen = 10240), we will trigger this BUG_ON: BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3)); Mar 4 00:55:52 boyu-tm kernel: ------------[ cut here ]------------ Mar 4 00:55:52 boyu-tm kernel: kernel BUG at fs/ext4/mballoc.c:1506! Mar 4 01:21:09 boyu-tm kernel: Code: d4 00 00 00 00 49 89 fe 8b 56 0c 44 8b 7e 04 89 55 c4 48 8b 4f 28 89 d6 44 01 fe 48 63 d6 48 8b 41 18 48 c1 e0 03 48 39 c2 76 04 <0f> 0b eb fe 48 8b 55 b0 8b 47 34 3b 42 08 74 04 0f 0b eb fe 48 Mar 4 01:21:09 boyu-tm kernel: RIP [<ffffffffa053eb42>] mb_mark_used+0x47/0x26c [ext4] Mar 4 01:21:09 boyu-tm kernel: RSP <ffff880121e45c38> Mar 4 01:21:09 boyu-tm kernel: ---[ end trace 9f461696f6a9dcf2 ]--- Fix this bug by doing the accounting correctly. Cc: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-23 15:48:11 -04:00
Fred Isaman	cccb4d063b	NFSv4.1 remove temp code that prevented ds commits Now that all the infrastructure is in place, we will do the right thing if we remove this special casing. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:04 -04:00
Andy Adamson	863a3c6c68	NFSv4.1: layoutcommit The filelayout driver sends LAYOUTCOMMIT only when COMMIT goes to the data server (as opposed to the MDS) and the data server WRITE is not NFS_FILE_SYNC. Only whole file layout support means that there is only one IOMODE_RW layout segment. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Alexandros Batsakis <batsakis@netapp.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn> Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn> Tested-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:04 -04:00
Fred Isaman	e0c2b38018	NFSv4.1: filelayout driver specific code for COMMIT Implement all the hooks created in the previous patches. This requires exporting quite a few functions and adding a few structure fields. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:04 -04:00
Fred Isaman	988b6dceb0	NFSv4.1: remove GETATTR from ds commits Any COMMIT compound directed to a data server needs to have the GETATTR calls suppressed. We here, make sure the field we are testing (data->lseg) is set and refcounted correctly. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:03 -04:00
Fred Isaman	a861a1e1c3	NFSv4.1: add generic layer hooks for pnfs COMMIT We create three major hooks for the pnfs code. pnfs_mark_request_commit() is called during writeback_done from nfs_mark_request_commit, which gives the driver an opportunity to claim it wants control over commiting a particular req. pnfs_choose_commit_list() is called from nfs_scan_list to choose which list a given req should be added to, based on where we intend to send it for COMMIT. It is up to the driver to have preallocated list headers for each destination it may need. pnfs_commit_list() is how the driver actually takes control, it is used instead of nfs_commit_list(). In order to pass information between the above functions, we create a union in nfs_page to hold a lseg (which is possible because the req is not on any list while in transition), and add some flags to indicate if we need to use the pnfs code. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:03 -04:00
Fred Isaman	425eb736cd	NFSv4.1: alloc and free commit_buckets Create a preallocated list header to hold nfs_pages for each non-MDS COMMIT destination. Note this is not necessarily each DS, but is basically each <DS, fh> pair. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:03 -04:00
Fred Isaman	c879513e91	NFSv4.1: shift filelayout_free_lseg Move it up to avoid forward declaration in later patch. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:03 -04:00

... 2 3 4 5 6 ...

22364 commits