linux

Commit Graph

Author	SHA1	Message	Date
Boaz Harrosh	96218556b0	pnfs-obj: Get rid of objlayout_{alloc,free}_io_state This is part of moving objio_osd to use the ORE. objlayout_io_state had two functions: 1. It was used in the error reporting mechanism at layout_return. This function is kept intact. (Later patch will rename objlayout_io_state => objlayout_io_res) 2. Carrier of rw io members into the objio_read/write_paglist API. This is removed in this patch. The {r,w}data received from NFS are passed directly to the objio_{read,write}_paglist API. The io_engine is now allocating it's own IO state as part of the read/write. The minimal functionality that was part of the generic allocation is passed to the io_engine. So part of this patch is rename of: ios->ol_state.foo => ios->foo At objlayout_{read,write}_done an objlayout_io_state is passed that denotes the result of the IO. (Hence the later name change). If the IO is successful objlayout calls an objio_free_result() API immediately (Which for objio_osd causes the release of the io_state). If the IO ended in an error it is hanged onto until reported in layout_return and is released later through the objio_free_result() API. (All this is not new just renamed and cleaned) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-11-02 23:56:05 -04:00
Boaz Harrosh	e6c40fe3f4	pnfs-obj: Return PNFS_NOT_ATTEMPTED in case of read/write_pagelist objlayout driver was always returning PNFS_ATTEMPTED from it's read/write_pagelist operations. Even on error. Fix that. Start by establishing an error return API from io-engine, by not returning ssize_t (length-or-error) but returning "int" 0=OK, 0>Error. And clean up all return types in io-engine. Then if io-engine returned error return PNFS_NOT_ATTEMPTED to generic layer. (With a dprint) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-11-02 23:56:03 -04:00
Boaz Harrosh	4cdc685c7d	pnfs-obj: Remove redundant EOF from objlayout_io_state The EOF calculation was done on .read_pagelist(), cached in objlayout_io_state->eof, and set in objlayout_read_done() into nfs_read_data->res.eof. So set it directly into nfs_read_data->res.eof and avoid the extra member. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-11-02 23:56:00 -04:00
Rakib Mullick	2b72c9ccd2	nfs: Remove unused variable from write.c When CONFIG_NFS=y and CONFIG_NFS_V3_{,V4}=n we get the following warning. fs/nfs/write.c: In function ‘nfs_writeback_done’: fs/nfs/write.c:1246:21: warning: unused variable ‘server’ Remove the variable 'server' to fix the above warning. Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-11-02 23:50:24 -04:00
Rakib Mullick	6f276e49fd	nfs: Fix unused variable warning from file.c Fix the following unused variable warning. fs/nfs/file.c: In function ‘nfs_file_release’: fs/nfs/file.c:140:17: warning: unused variable ‘dentry’ fs/nfs/file.c: In function ‘nfs_file_read’: fs/nfs/file.c:237:9: warning: unused variable ‘count’ Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-11-02 23:49:09 -04:00
Linus Torvalds	06ef93e1b8	Merge branch 'for-3.2' of git://linux-nfs.org/~bfields/linux * 'for-3.2' of git://linux-nfs.org/~bfields/linux: nfsd4: typo logical vs bitwise negate in nfsd4_decode_share_access	2011-11-02 16:54:36 -07:00
Linus Torvalds	092f4c56c1	Merge branch 'akpm' (Andrew's incoming - part two) Says Andrew: "60 patches. That's good enough for -rc1 I guess. I have quite a lot of detritus to be rechecked, work through maintainers, etc. - most of the remains of MM - rtc - various misc - cgroups - memcg - cpusets - procfs - ipc - rapidio - sysctl - pps - w1 - drivers/misc - aio" * akpm: (60 commits) memcg: replace ss->id_lock with a rwlock aio: allocate kiocbs in batches drivers/misc/vmw_balloon.c: fix typo in code comment drivers/misc/vmw_balloon.c: determine page allocation flag can_sleep outside loop w1: disable irqs in critical section drivers/w1/w1_int.c: multiple masters used same init_name drivers/power/ds2780_battery.c: fix deadlock upon insertion and removal drivers/power/ds2780_battery.c: add a nolock function to w1 interface drivers/power/ds2780_battery.c: create central point for calling w1 interface w1: ds2760 and ds2780, use ida for id and ida_simple_get() to get it pps gpio client: add missing dependency pps: new client driver using GPIO pps: default echo function include/linux/dma-mapping.h: add dma_zalloc_coherent() sysctl: make CONFIG_SYSCTL_SYSCALL default to n sysctl: add support for poll() RapidIO: documentation update drivers/net/rionet.c: fix ethernet address macros for LE platforms RapidIO: fix potential null deref in rio_setup_device() RapidIO: add mport driver for Tsi721 bridge ...	2011-11-02 16:07:27 -07:00
Jeff Moyer	080d676de0	aio: allocate kiocbs in batches In testing aio on a fast storage device, I found that the context lock takes up a fair amount of cpu time in the I/O submission path. The reason is that we take it for every I/O submitted (see __aio_get_req). Since we know how many I/Os are passed to io_submit, we can preallocate the kiocbs in batches, reducing the number of times we take and release the lock. In my testing, I was able to reduce the amount of time spent in _raw_spin_lock_irq by .56% (average of 3 runs). The command I used to test this was: aio-stress -O -o 2 -o 3 -r 8 -d 128 -b 32 -i 32 -s 16384 <dev> I also tested the patch with various numbers of events passed to io_submit, and I ran the xfstests aio group of tests to ensure I didn't break anything. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Cc: Daniel Ehrenberg <dehrenberg@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-11-02 16:07:03 -07:00
Lucas De Marchi	f1ecf06854	sysctl: add support for poll() Adding support for poll() in sysctl fs allows userspace to receive notifications of changes in sysctl entries. This adds a infrastructure to allow files in sysctl fs to be pollable and implements it for hostname and domainname. [akpm@linux-foundation.org: s/declare/define/ for definitions] Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi> Cc: Greg KH <gregkh@suse.de> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-11-02 16:07:02 -07:00
Vasiliy Kulikov	aa6afca5bc	proc: fix races against execve() of /proc/PID/fd** fd* files are restricted to the task's owner, and other users may not get direct access to them. But one may open any of these files and run any setuid program, keeping opened file descriptors. As there are permission checks on open(), but not on readdir() and read(), operations on the kept file descriptors will not be checked. It makes it possible to violate procfs permission model. Reading fdinfo/* may disclosure current fds' position and flags, reading directory contents of fdinfo/ and fd/ may disclosure the number of opened files by the target task. This information is not sensible per se, but it can reveal some private information (like length of a password stored in a file) under certain conditions. Used existing (un)lock_trace functions to check for ptrace_may_access(), but instead of using EPERM return code from it use EACCES to be consistent with existing proc_pid_follow_link()/proc_pid_readlink() return code. If they differ, attacker can guess what fds exist by analyzing stat() return code. Patched handlers: stat() for fd/, stat() and read() for fdindo/, readdir() and lookup() for fd/ and fdinfo/. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-11-02 16:07:00 -07:00
Pavel Emelyanov	887df07891	procfs: report EISDIR when reading sysctl dirs in proc On reading sysctl dirs we should return -EISDIR instead of -EINVAL. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-11-02 16:07:00 -07:00
Phillip Lougher	434a964daa	hfs: fix hfs_find_init() sb->ext_tree NULL ptr oops Clement Lecigne reports a filesystem which causes a kernel oops in hfs_find_init() trying to dereference sb->ext_tree which is NULL. This proves to be because the filesystem has a corrupted MDB extent record, where the extents file does not fit into the first three extents in the file record (the first blocks). In hfs_get_block() when looking up the blocks for the extent file (HFS_EXT_CNID), it fails the first blocks special case, and falls through to the extent code (which ultimately calls hfs_find_init()) which is in the process of being initialised. Hfs avoids this scenario by always having the extents b-tree fitting into the first blocks (the extents B-tree can't have overflow extents). The fix is to check at mount time that the B-tree fits into first blocks, i.e. fail if HFS_I(inode)->alloc_blocks >= HFS_I(inode)->first_blocks Note, the existing commit `47f365eb57` ("hfs: fix oops on mount with corrupted btree extent records") becomes subsumed into this as a special case, but only for the extents B-tree (HFS_EXT_CNID), it is perfectly acceptable for the catalog B-Tree file to grow beyond three extents, with the remaining extent descriptors in the extents overfow. This fixes CVE-2011-2203 Reported-by: Clement LECIGNE <clement.lecigne@netasq.com> Signed-off-by: Phillip Lougher <plougher@redhat.com> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-11-02 16:06:59 -07:00
Namjae Jeon	3069083cc8	isofs: add readpages support Use mpage_readpages() instead of multiple calls to isofs_readpage() to reduce the CPU utilization and make performance higher. Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-11-02 16:06:59 -07:00
Richard Weinberger	0620d9193c	ramfs: remove module leftovers Since ramfs is hard-selected to "y", the module leftovers make no sense. Signed-off-by: Richard Weinberger <richard@nod.at> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-11-02 16:06:58 -07:00
Jiri Kosina	a3defbe5c3	binfmt_elf: fix PIE execution with randomization disabled The case of address space randomization being disabled in runtime through randomize_va_space sysctl is not treated properly in load_elf_binary(), resulting in SIGKILL coming at exec() time for certain PIE-linked binaries in case the randomization has been disabled at runtime prior to calling exec(). Handle the randomize_va_space == 0 case the same way as if we were not supporting .text randomization at all. Based on original patch by H.J. Lu and Josh Boyer. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Ingo Molnar <mingo@elte.hu> Cc: Russell King <rmk@arm.linux.org.uk> Cc: H.J. Lu <hongjiu.lu@intel.com> Cc: <stable@kernel.org> Tested-by: Josh Boyer <jwboyer@redhat.com> Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-11-02 16:06:58 -07:00
Linus Torvalds	d211858837	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue: vfs: add d_prune dentry operation vfs: protect i_nlink filesystems: add set_nlink() filesystems: add missing nlink wrappers logfs: remove unnecessary nlink setting ocfs2: remove unnecessary nlink setting jfs: remove unnecessary nlink setting hypfs: remove unnecessary nlink setting vfs: ignore error on forced remount readlinkat: ensure we return ENOENT for the empty pathname for normal lookups vfs: fix dentry leak in simple_fill_super()	2011-11-02 11:41:01 -07:00
Phillip Lougher	7657cacf47	Squashfs: Add an option to set dev block size to 4K This commit adds an option to set the device block size used to 4K. By default Squashfs sets the device block size (sb_min_blocksize) to 1K or the smallest block size supported by the block device (if larger). This, because blocks are packed together and unaligned in Squashfs, should reduce latency. This, however, gives poor performance on MTD NAND devices where the optimal I/O size is 4K (even though the devices can support smaller block sizes). Using a 4K device block size may also improve overall I/O performance for some file access patterns (e.g. sequential accesses of files in filesystem order) on all media. Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>	2011-11-02 17:25:50 +00:00
Linus Torvalds	f1f8935a5c	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (97 commits) jbd2: Unify log messages in jbd2 code jbd/jbd2: validate sb->s_first in journal_get_superblock() ext4: let ext4_ext_rm_leaf work with EXT_DEBUG defined ext4: fix a syntax error in ext4_ext_insert_extent when debugging enabled ext4: fix a typo in struct ext4_allocation_context ext4: Don't normalize an falloc request if it can fit in 1 extent. ext4: remove comments about extent mount option in ext4_new_inode() ext4: let ext4_discard_partial_buffers handle unaligned range correctly ext4: return ENOMEM if find_or_create_pages fails ext4: move vars to local scope in ext4_discard_partial_page_buffers_no_lock() ext4: Create helper function for EXT4_IO_END_UNWRITTEN and i_aiodio_unwritten ext4: optimize locking for end_io extent conversion ext4: remove unnecessary call to waitqueue_active() ext4: Use correct locking for ext4_end_io_nolock() ext4: fix race in xattr block allocation path ext4: trace punch_hole correctly in ext4_ext_map_blocks ext4: clean up AGGRESSIVE_TEST code ext4: move variables to their scope ext4: fix quota accounting during migration ext4: migrate cleanup ...	2011-11-02 10:06:20 -07:00
Linus Torvalds	34116645d9	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Cleanup metadata flags handling udf: Skip mirror metadata FE loading when metadata FE is ok ext3: Allow quota file use root reservation udf: Remove web reference from UDF MAINTAINERS entry quota: Drop path reference on error exit from quotactl udf: Neaten udf_debug uses udf: Neaten logging output, use vsprintf extension %pV udf: Convert printks to pr_<level> udf: Rename udf_warning to udf_warn udf: Rename udf_error to udf_err udf: Promote some debugging messages to udf_error ext3: Remove the obsolete broken EXT3_IOC32_WAIT_FOR_READONLY. udf: Add readpages support for udf. ext3/balloc.c: local functions should be static ext2: fix the outdated comment in ext2_nfs_get_inode() ext3: remove deprecated oldalloc fs/ext3/balloc.c: delete useless initialization fs/ext2/balloc.c: delete useless initialization ext3: fix message in ext3_remount for rw-remount case ext3: Remove i_mutex from ext3_sync_file() Fix up trivial (printf format cleanup) conflicts in fs/udf/udfdecl.h	2011-11-02 10:05:22 -07:00
Linus Torvalds	de0a5345a5	Merge branch 'for-linus' of git://github.com/richardweinberger/linux * 'for-linus' of git://github.com/richardweinberger/linux: (90 commits) um: fix ubd cow size um: Fix kmalloc argument order in um/vdso/vma.c um: switch to use of drivers/Kconfig UserModeLinux-HOWTO.txt: fix a typo UserModeLinux-HOWTO.txt: remove ^H characters um: we need sys/user.h only on i386 um: merge delay_{32,64}.c um: distribute exports to where exported stuff is defined um: kill system-um.h um: generic ftrace.h will do... um: segment.h is x86-only and needed only there um: asm/pda.h is not needed anymore um: hw_irq.h can go generic as well um: switch to generic-y um: clean Kconfig up a bit um: a couple of missing dependencies... um: kill useless argument of free_chan() and free_one_chan() um: unify ptrace_user.h um: unify KSTK_... um: fix gcov build breakage ...	2011-11-02 09:45:39 -07:00
Al Viro	548fd1e8db	um: kill useless include of user.h everything in USER_OBJ gets it via -include user.h Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Richard Weinberger <richard@nod.at>	2011-11-02 14:15:13 +01:00
Sage Weil	f0023bc617	vfs: add d_prune dentry operation This adds a d_prune dentry operation that is called by the VFS prior to pruning (i.e. unhashing and killing) a hashed dentry from the dcache. Wrap dentry_lru_del() and use the new _prune() helper in the cases where we are about to unhash and kill the dentry. This will be used by Ceph to maintain a flag indicating whether the complete contents of a directory are contained in the dcache, allowing it to satisfy lookups and readdir without addition server communication. Renumber a few DCACHE_* #defines to group DCACHE_OP_PRUNE with the other DCACHE_OP_ bits. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-11-02 12:53:43 +01:00
Miklos Szeredi	a78ef704a8	vfs: protect i_nlink Prevent direct modification of i_nlink by making it const and adding a non-const __i_nlink alias. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-11-02 12:53:43 +01:00
Miklos Szeredi	bfe8684869	filesystems: add set_nlink() Replace remaining direct i_nlink updates with a new set_nlink() updater function. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-11-02 12:53:43 +01:00
Miklos Szeredi	6d6b77f163	filesystems: add missing nlink wrappers Replace direct i_nlink updates with the respective updater function (inc_nlink, drop_nlink, clear_nlink, inode_dec_link_count). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2011-11-02 12:53:43 +01:00
Miklos Szeredi	dd2a981f46	logfs: remove unnecessary nlink setting alloc_inode() initializes i_nlink to 1. Remove unnecessary re-initialization. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Joern Engel <joern@logfs.org> CC: Prasad Joshi <prasadjoshi.linux@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-11-02 12:53:42 +01:00
Miklos Szeredi	a7732b05f7	ocfs2: remove unnecessary nlink setting alloc_inode() initializes i_nlink to 1. Remove unnecessary re-initialization. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Joel Becker <jlbec@evilplan.org> CC: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-11-02 12:53:42 +01:00
Miklos Szeredi	c818b682c9	jfs: remove unnecessary nlink setting alloc_inode() initializes i_nlink to 1. Remove unnecessary re-initialization. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-11-02 12:53:42 +01:00
Miklos Szeredi	2833eb2b46	vfs: ignore error on forced remount On emergency remount we want to force MS_RDONLY on the super block even if ->remount_fs() failed for some reason. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-11-02 12:53:42 +01:00
Andy Whitcroft	1fa1e7f615	readlinkat: ensure we return ENOENT for the empty pathname for normal lookups Since the commit below which added O_PATH support to the *at() calls, the error return for readlink/readlinkat for the empty pathname has switched from ENOENT to EINVAL: commit `65cfc67223` Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sun Mar 13 15:56:26 2011 -0400 readlinkat(), fchownat() and fstatat() with empty relative pathnames This is both unexpected for userspace and makes readlink/readlinkat inconsistant with all other interfaces; and inconsistant with our stated return for these pathnames. As the readlinkat call does not have a flags parameter we cannot use the AT_EMPTY_PATH approach used in the other calls. Therefore expose whether the original path is infact entry via a new user_path_at_empty() path lookup function. Use this to determine whether to default to EINVAL or ENOENT for failures. Addresses http://bugs.launchpad.net/bugs/817187 [akpm@linux-foundation.org: remove unused getname_flags()] Signed-off-by: Andy Whitcroft <apw@canonical.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-11-02 12:53:42 +01:00
Konstantin Khlebnikov	32096ea1aa	vfs: fix dentry leak in simple_fill_super() put dentry if inode allocation failed, d_genocide() cannot release it Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-11-02 12:53:42 +01:00
Eryu Guan	f2a44523b2	jbd2: Unify log messages in jbd2 code Some jbd2 code prints out kernel messages with "JBD2: " prefix, at the same time other jbd2 code prints with "JBD: " prefix. Unify the prefix to "JBD2: ". Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-11-01 19:09:18 -04:00
Eryu Guan	8762202dd0	jbd/jbd2: validate sb->s_first in journal_get_superblock() I hit a J_ASSERT(blocknr != 0) failure in cleanup_journal_tail() when mounting a fsfuzzed ext3 image. It turns out that the corrupted ext3 image has s_first = 0 in journal superblock, and the 0 is passed to journal->j_head in journal_reset(), then to blocknr in cleanup_journal_tail(), in the end the J_ASSERT failed. So validate s_first after reading journal superblock from disk in journal_get_superblock() to ensure s_first is valid. The following script could reproduce it: fstype=ext3 blocksize=1024 img=$fstype.img offset=0 found=0 magic="c0 3b 39 98" dd if=/dev/zero of=$img bs=1M count=8 mkfs -t $fstype -b $blocksize -F $img filesize=`stat -c %s $img` while [ $offset -lt $filesize ] do if od -j $offset -N 4 -t x1 $img \| grep -i "$magic";then echo "Found journal: $offset" found=1 break fi offset=`echo "$offset+$blocksize" \| bc` done if [ $found -ne 1 ];then echo "Magic \"$magic\" not found" exit 1 fi dd if=/dev/zero of=$img seek=$(($offset+23)) conv=notrunc bs=1 count=1 mkdir -p ./mnt mount -o loop $img ./mnt Cc: Jan Kara <jack@suse.cz> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-11-01 19:04:59 -04:00
Yongqiang Yang	bf52c6f7af	ext4: let ext4_ext_rm_leaf work with EXT_DEBUG defined The variable 'block' is removed by commit `750c9c47`, so use the replacement ex_ee_block instead. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-11-01 18:59:26 -04:00
Yongqiang Yang	32de675690	ext4: fix a syntax error in ext4_ext_insert_extent when debugging enabled This patch fixes a syntax error which omits a comma. Besides this, logical block number is unsigend 32 bits, so printk should use %u instead %d. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-11-01 18:56:41 -04:00
Benny Halevy	fc0d14fe2d	nfsd4: typo logical vs bitwise negate in nfsd4_decode_share_access Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-11-01 18:06:43 -04:00
Linus Torvalds	1c39865151	Merge branch 'pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux * 'pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: pstore: make pstore write function return normal success/fail value pstore: change mutex locking to spin_locks pstore: defer inserting OOPS entries into pstore	2011-11-01 10:52:29 -07:00
Eric W. Biederman	f6d90b4f9c	sysfs: Make sysfs_rename safe with sysfs_dirents in rbtrees. In sysfs_rename we need to remove the optimization of not calling sysfs_unlink_sibling and sysfs_link_sibling if the renamed parent directory is not changing. This optimization is no longer valid now that sysfs dirents are stored in an rbtree sorted by name. Move the assignment of s_ns before the call of sysfs_link_sibling. With no sysfs_dirent fields changing after the call of sysfs_link_sibling this allows sysfs_link_sibling to take any of the directory entries into account when it builds the rbtrees, and s_ns looks like a prime canidate to be used in the rbtree in the future. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Greg KH <gregkh@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-11-01 09:16:14 -07:00
Nelson Elhage	d8805e633e	epoll: fix spurious lockdep warnings epoll can acquire recursively acquire ep->mtx on multiple "struct eventpoll"s at once in the case where one epoll fd is monitoring another epoll fd. This is perfectly OK, since we're careful about the lock ordering, but it causes spurious lockdep warnings. Annotate the recursion using mutex_lock_nested, and add a comment explaining the nesting rules for good measure. Recent versions of systemd are triggering this, and it can also be demonstrated with the following trivial test program: --------------------8<-------------------- int main(void) { int e1, e2; struct epoll_event evt = { .events = EPOLLIN }; e1 = epoll_create1(0); e2 = epoll_create1(0); epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt); return 0; } --------------------8<-------------------- Reported-by: Paul Bolle <pebolle@tiscali.nl> Tested-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Nelson Elhage <nelhage@nelhage.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:57 -07:00
Andy Shevchenko	0a90e0f101	fat: follow rename pack_hex_byte() to hex_byte_pack() There is no functional change. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:57 -07:00
Joe Perches	b9075fa968	treewide: use __printf not __attribute__((format(printf,...))) Standardize the style for compiler based printf format verification. Standardized the location of __printf too. Done via script and a little typing. $ grep -rPl --include=.[ch] -w "__attribute__" \| \ grep -vP "^(tools\|scripts\|include/linux/compiler-gcc.h)" \| \ xargs perl -n -i -e 'local $/; while (<>) { s/\b__attribute__\s$\s\(\sformat\s\(\sprintf\s,\s(.+)\s,\s(.+)\s$\s\)\s\)/__printf($1, $2)/g ; print; }' [akpm@linux-foundation.org: revert arch bits] Signed-off-by: Joe Perches <joe@perches.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:54 -07:00
Pavel Emelyanov	d70ef97baf	fs/pipe.c: add ->statfs callback for pipefs Currently a statfs on a pipe's /proc/<pid>/fd/ link returns -ENOSYS. Wire pipfs up so that the statfs succeeds. This is required by checkpoint-restart in the userspace to make it possible to distinguish pipes from fifos. When we dump information about task's open files we use the /proc/pid/fd directoy's symlinks and the fact that opening any of them gives us exactly the same dentry->inode pair as the original process has. Now if a task we're dumping has opened pipe and fifo we need to detect this and act accordingly. Knowing that an fd with type S_ISFIFO resides on a pipefs is the most precise way. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:51 -07:00
Tao Ma	72a2ebd8bc	fs/buffer.c: add device information for error output in __find_get_block_slow() On the ext4 mailing list[1], we got some report about errors in __find_get_block_slow(), but the information is very limited. If the device information is given, we can know the name of the sick volume. Futhermore, we can get the corresponding status of that block(group, inode block etc) by analyzing the disk layout. [1] http://marc.info/?l=linux-ext4&m=131379831421147&w=2 Signed-off-by: Tao Ma <boyu.mt@taobao.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:49 -07:00
Mikulas Patocka	09f363c736	vmscan: fix shrinker callback bug in fs/super.c The callback must not return -1 when nr_to_scan is zero. Fix the bug in fs/super.c and add this requirement to the callback specification. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:49 -07:00
Akinobu Mita	798248206b	lib/string.c: introduce memchr_inv() memchr_inv() is mainly used to check whether the whole buffer is filled with just a specified byte. The function name and prototype are stolen from logfs and the implementation is from SLUB. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Acked-by: Joern Engel <joern@logfs.org> Cc: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:47 -07:00
Mel Gorman	966dbde2c2	ext4: warn if direct reclaim tries to writeback pages Direct reclaim should never writeback pages. Warn if an attempt is made. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:46 -07:00
Mel Gorman	94054fa3fc	xfs: warn if direct reclaim tries to writeback pages Direct reclaim should never writeback pages. For now, handle the situation and warn about it. Ultimately, this will be a BUG_ON. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:46 -07:00
Christoph Lameter	bc3e53f682	mm: distinguish between mlocked and pinned pages Some kernel components pin user space memory (infiniband and perf) (by increasing the page count) and account that memory as "mlocked". The difference between mlocking and pinning is: A. mlocked pages are marked with PG_mlocked and are exempt from swapping. Page migration may move them around though. They are kept on a special LRU list. B. Pinned pages cannot be moved because something needs to directly access physical memory. They may not be on any LRU list. I recently saw an mlockalled process where mm->locked_vm became bigger than the virtual size of the process (!) because some memory was accounted for twice: Once when the page was mlocked and once when the Infiniband layer increased the refcount because it needt to pin the RDMA memory. This patch introduces a separate counter for pinned pages and accounts them seperately. Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Mike Marciniszyn <infinipath@qlogic.com> Cc: Roland Dreier <roland@kernel.org> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:46 -07:00
Robert P. J. Day	f5fc870da2	tmpfs: add "tmpfs" to the Kconfig prompt to make it obvious. Add the leading word "tmpfs" to the Kconfig string to make it blindingly obvious that this selection refers to tmpfs. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:45 -07:00
David Rientjes	c9f01245b6	oom: remove oom_disable_count This removes mm->oom_disable_count entirely since it's unnecessary and currently buggy. The counter was intended to be per-process but it's currently decremented in the exit path for each thread that exits, causing it to underflow. The count was originally intended to prevent oom killing threads that share memory with threads that cannot be killed since it doesn't lead to future memory freeing. The counter could be fixed to represent all threads sharing the same mm, but it's better to remove the count since: - it is possible that the OOM_DISABLE thread sharing memory with the victim is waiting on that thread to exit and will actually cause future memory freeing, and - there is no guarantee that a thread is disabled from oom killing just because another thread sharing its mm is oom disabled. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:45 -07:00
Christopher Yeoh	fcf634098c	Cross Memory Attach The basic idea behind cross memory attach is to allow MPI programs doing intra-node communication to do a single copy of the message rather than a double copy of the message via shared memory. The following patch attempts to achieve this by allowing a destination process, given an address and size from a source process, to copy memory directly from the source process into its own address space via a system call. There is also a symmetrical ability to copy from the current process's address space into a destination process's address space. - Use of /proc/pid/mem has been considered, but there are issues with using it: - Does not allow for specifying iovecs for both src and dest, assuming preadv or pwritev was implemented either the area read from or written to would need to be contiguous. - Currently mem_read allows only processes who are currently ptrace'ing the target and are still able to ptrace the target to read from the target. This check could possibly be moved to the open call, but its not clear exactly what race this restriction is stopping (reason appears to have been lost) - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix domain socket is a bit ugly from a userspace point of view, especially when you may have hundreds if not (eventually) thousands of processes that all need to do this with each other - Doesn't allow for some future use of the interface we would like to consider adding in the future (see below) - Interestingly reading from /proc/pid/mem currently actually involves two copies! (But this could be fixed pretty easily) As mentioned previously use of vmsplice instead was considered, but has problems. Since you need the reader and writer working co-operatively if the pipe is not drained then you block. Which requires some wrapping to do non blocking on the send side or polling on the receive. In all to all communication it requires ordering otherwise you can deadlock. And in the example of many MPI tasks writing to one MPI task vmsplice serialises the copying. There are some cases of MPI collectives where even a single copy interface does not get us the performance gain we could. For example in an MPI_Reduce rather than copy the data from the source we would like to instead use it directly in a mathops (say the reduce is doing a sum) as this would save us doing a copy. We don't need to keep a copy of the data from the source. I haven't implemented this, but I think this interface could in the future do all this through the use of the flags - eg could specify the math operation and type and the kernel rather than just copying the data would apply the specified operation between the source and destination and store it in the destination. Although we don't have a "second user" of the interface (though I've had some nibbles from people who may be interested in using it for intra process messaging which is not MPI). This interface is something which hardware vendors are already doing for their custom drivers to implement fast local communication. And so in addition to this being useful for OpenMPI it would mean the driver maintainers don't have to fix things up when the mm changes. There was some discussion about how much faster a true zero copy would go. Here's a link back to the email with some testing I did on that: http://marc.info/?l=linux-mm&m=130105930902915&w=2 There is a basic man page for the proposed interface here: http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt This has been implemented for x86 and powerpc, other architecture should mainly (I think) just need to add syscall numbers for the process_vm_readv and process_vm_writev. There are 32 bit compatibility versions for 64-bit kernels. For arch maintainers there are some simple tests to be able to quickly verify that the syscalls are working correctly here: http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: <linux-man@vger.kernel.org> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:44 -07:00
Andrew Morton	fc360bd9cd	/proc/self/numa_maps: restore "huge" tag for hugetlb vmas The display of the "huge" tag was accidentally removed in `29ea2f698` ("mm: use walk_page_range() instead of custom page table walking code"). Reported-by: Stephen Hemminger <shemminger@vyatta.com> Tested-by: Stephen Hemminger <shemminger@vyatta.com> Reviewed-by: Stephen Wilson <wilsons@start.ca> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:44 -07:00
Paul Gortmaker	143cb494cb	fs: add module.h to files that were implicitly using it Some files were using the complete module.h infrastructure without actually including the header at all. Fix them up in advance so once the implicit presence is removed, we won't get failures like this: CC [M] fs/nfsd/nfssvc.o fs/nfsd/nfssvc.c: In function 'nfsd_create_serv': fs/nfsd/nfssvc.c:335: error: 'THIS_MODULE' undeclared (first use in this function) fs/nfsd/nfssvc.c:335: error: (Each undeclared identifier is reported only once fs/nfsd/nfssvc.c:335: error: for each function it appears in.) fs/nfsd/nfssvc.c: In function 'nfsd': fs/nfsd/nfssvc.c:555: error: implicit declaration of function 'module_put_and_exit' make[3]: *** [fs/nfsd/nfssvc.o] Error 1 Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2011-10-31 19:30:31 -04:00
Paul Gortmaker	afeacc8c1f	fs: add export.h to files using EXPORT_SYMBOL/THIS_MODULE macros These files were getting <linux/module.h> via an implicit include path, but we want to crush those out of existence since they cost time during compiles of processing thousands of lines of headers for no reason. Give them the lightweight header that just contains the EXPORT_SYMBOL infrastructure. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2011-10-31 19:30:31 -04:00
Robin Dong	ff3fc1736f	ext4: fix a typo in struct ext4_allocation_context This patch changes "bext" to "best". Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-31 18:55:50 -04:00
Jan Kara	ed47a7d00c	udf: Cleanup metadata flags handling Use simple ->s_flags variable instead of u8 variable for each flag. Signed-off-by: Jan Kara <jack@suse.cz>	2011-10-31 23:49:48 +01:00
Namjae Jeon	3080a74ea3	udf: Skip mirror metadata FE loading when metadata FE is ok It is not necessary to load mirror metadata FE when metadata FE is OK. So try to read it only the first time udf_get_pblock_meta25() fails to map the block from metadata FE. Signed-off-by: Ashish Sangwan <ashishsangwan2@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2011-10-31 23:49:30 +01:00
Dmitry Monakhov	6360e21f94	ext3: Allow quota file use root reservation Quota file is fs's metadata, so it is reasonable to permit use root resevation if necessary. This patch fix 265'th xfstest failure Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>	2011-10-31 23:43:59 +01:00
Jan Kara	0aaa618863	quota: Drop path reference on error exit from quotactl One error exit from quotactl forgot to do path_put(). Fix that. Reported-by: Valerie Aurora <val@vaaconsulting.com> Signed-off-by: Jan Kara <jack@suse.cz>	2011-10-31 23:43:59 +01:00
Joe Perches	a983f368f8	udf: Neaten udf_debug uses Just whitespace and argument alignment. Introduce some checkpatch warnings that deserve to be ignored. Reviewed-by: NamJae Jeon <linkinjeon@gmail.com> Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jan Kara <jack@suse.cz>	2011-10-31 23:43:58 +01:00
Joe Perches	c2bff36c29	udf: Neaten logging output, use vsprintf extension %pV Use %pV and remove a static buffer to save some text space and fix possible issues when several processes call error reporting function in parallel. Also change error level from KERN_CRIT to KERN_ERR. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jan Kara <jack@suse.cz>	2011-10-31 23:43:58 +01:00
Joe Perches	78ace70c41	udf: Convert printks to pr_<level> Use the current logging styles. Convert a few printks that should have been udf_warn and udf_err. Coalesce formats. Add #define pr_fmt. Move an #include "udfdecls.h" above other includes in udftime.c so pr_fmt works correctly. Strip prefixes from conversions as appropriate. Reorder logging definitions in udfdecl.h Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jan Kara <jack@suse.cz>	2011-10-31 23:43:52 +01:00
Greg Harm	3c6fe77017	ext4: Don't normalize an falloc request if it can fit in 1 extent. If an fallocate request fits in EXT_UNINIT_MAX_LEN, then set the EXT4_GET_BLOCKS_NO_NORMALIZE flag. For larger fallocate requests, let mballoc.c normalize the request. This fixes a problem where large requests were being split into non-contiguous extents due to commit 556b27abf73: ext4: do not normalize block requests from fallocate. Testing: ) Checked that 8.x MB falloc'ed files are still laid down next to each other (contiguously). ) Checked that the maximum size extent (127.9MB) is allocated as 1 extent. ) Checked that a 1GB file is somewhat contiguous (often 5-6 non-contiguous extents now). ) Checked that a 120MB file can still be falloc'ed even if there are no single extents large enough to hold it. Signed-off-by: Greg Harm <gharm@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-31 18:41:47 -04:00
Eryu Guan	4af8350899	ext4: remove comments about extent mount option in ext4_new_inode() Remove comments about 'extent' mount option in ext4_new_inode(), since it's no longer exists. Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-31 18:21:29 -04:00
Yongqiang Yang	edb5ac8993	ext4: let ext4_discard_partial_buffers handle unaligned range correctly As comment says, we should handle unaligned range rather than aligned one. This fixes a bug found by running xfstests #91. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>	2011-10-31 18:04:38 -04:00
Yongqiang Yang	5129d05fda	ext4: return ENOMEM if find_or_create_pages fails Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-31 17:56:10 -04:00
Yongqiang Yang	e260daf279	ext4: move vars to local scope in ext4_discard_partial_page_buffers_no_lock() Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-31 17:54:36 -04:00
Tao Ma	0edeb71dc9	ext4: Create helper function for EXT4_IO_END_UNWRITTEN and i_aiodio_unwritten EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should be done simultaneously since ext4_end_io_nolock always clear the flag and decrease the counter in the same time. We have found some bugs that the flag is set while leaving i_aiodio_unwritten unchanged(commit `32c80b32c0`). So this patch just tries to create a helper function to wrap them to avoid any future bug. The idea is inspired by Eric. Cc: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-31 17:30:44 -04:00
Chuck Lever	e414966b81	NFS: Remove no-op less-than-zero checks on unsigned variables. Introduced by commit `16b374ca` "NFSv4.1: pnfs: filelayout: add driver's LAYOUTGET and GETDEVICEINFO infrastructure" (October 20, 2010). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-10-31 11:52:47 -04:00
Chuck Lever	c6e6966602	NFS: Clean up nfs4_xdr_dec_secinfo() Clean up: Remove superfluous logic at the tail of nfs4_xdr_dec_secinfo() . Introduced by commit `5a5ea0d4` "NFS: Add secinfo procedure" (March 24, 2011). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-10-31 11:52:47 -04:00
Chuck Lever	c02f557dd0	NFS: Fix documenting comment for nfs_create_request() Clean up: the first parameter of nfs_create_request() has been incorrectly documented since time immemorial (OK, since before 2.6.12). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-10-31 11:52:47 -04:00
Peng Tao	d743c3c9c2	NFS4: fix cb_recallany decode error craa_type_mask is bitmap4 per RFC5661. We need to expect a length before extracting bitmap value. Cc: Alexandros Batsakis <batsakis@netapp.com> Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-10-31 11:51:28 -04:00
Peng Tao	92407e75ce	nfs4: serialize layoutcommit Current pnfs_layoutcommit_inode can not handle parallel layoutcommit. And as Trond suggested , there is no need for client to optimize for parallel layoutcommit. So add NFS_INO_LAYOUTCOMMITTING flag to mark inflight layoutcommit and serialize lalyoutcommit with it. Also mark_inode_dirty_sync if pnfs_layoutcommit_inode fails to issue layoutcommit. Reported-by: Vitaliy Gusev <gusev.vitaliy@nexenta.com> Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Jim Rees <rees@umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-10-31 11:51:28 -04:00
Theodore Ts'o	b82e384c7b	ext4: optimize locking for end_io extent conversion Now that we are doing the locking correctly, we need to grab the i_completed_io_lock() twice per end_io. We can clean this up by removing the structure from the i_complted_io_list, and use this as the locking mechanism to prevent ext4_flush_completed_IO() racing against ext4_end_io_work(), instead of clearing the EXT4_IO_END_UNWRITTEN in io->flag. In addition, if the ext4_convert_unwritten_extents() returns an error, we no longer keep the end_io structure on the linked list. This doesn't help, because it tends to lock up the file system and wedges the system. That's one way to call attention to the problem, but it doesn't help the overall robustness of the system. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-31 10:56:32 -04:00
Theodore Ts'o	4e29802121	ext4: remove unnecessary call to waitqueue_active() The usage of waitqueue_active() is not necessary, and introduces (I believe) a hard-to-hit race. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-30 18:41:19 -04:00
Tao Ma	d73d5046a7	ext4: Use correct locking for ext4_end_io_nolock() We must hold i_completed_io_lock when manipulating anything on the i_completed_io_list linked list. This includes io->lock, which we were checking in ext4_end_io_nolock(). So move this check to ext4_end_io_work(). This also has the bonus of avoiding extra work if it is already done without needing to take the mutex. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-30 18:26:08 -04:00
Curt Wohlgemuth	0e175a1835	writeback: Add a 'reason' to wb_writeback_work This creates a new 'reason' field in a wb_writeback_work structure, which unambiguously identifies who initiates writeback activity. A 'wb_reason' enumeration has been added to writeback.h, to enumerate the possible reasons. The 'writeback_work_class' and tracepoint event class and 'writeback_queue_io' tracepoints are updated to include the symbolic 'reason' in all trace events. And the 'writeback_inodes_sbXXX' family of routines has had a wb_stats parameter added to them, so callers can specify why writeback is being started. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-10-31 00:33:36 +08:00
Curt Wohlgemuth	ad4e38dd6a	writeback: send work item to queue_io, move_expired_inodes Instead of sending ->older_than_this to queue_io() and move_expired_inodes(), send the entire wb_writeback_work structure. There are other fields of a work item that are useful in these routines and in tracepoints. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-10-31 00:33:27 +08:00
Shirish Pargaonkar	9ef5992e44	cifs: Assume passwords are encoded according to iocharset (try #2 ) Re-posting a patch originally posted by Oskar Liljeblad after rebasing on 3.2. Modify cifs to assume that the supplied password is encoded according to iocharset. Before this patch passwords would be treated as raw 8-bit data, which made authentication with Unicode passwords impossible (at least passwords with characters > 0xFF). The previous code would as a side effect accept passwords encoded with ISO 8859-1, since Unicode < 0x100 basically is ISO 8859-1. Software which relies on that will no longer support password chars > 0x7F unless it also uses iocharset=iso8859-1. (mount.cifs does not care about the encoding so it will work as expected.) Signed-off-by: Oskar Liljeblad <oskar@osk.mine.nu> Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru> Tested-by: A <nimbus1_03087@yahoo.com> Signed-off-by: Steve French <smfrench@gmail.com>	2011-10-29 22:06:54 -05:00
Pavel Shilovsky	5079276066	CIFS: Fix the VFS brlock cache usage in posix locking case Request to the cache in FL_POSIX case only. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>	2011-10-29 22:03:14 -05:00
Eric Sandeen	6d6a435190	ext4: fix race in xattr block allocation path Ceph users reported that when using Ceph on ext4, the filesystem would often become corrupted, containing inodes with incorrect i_blocks counters. I managed to reproduce this with a very hacked-up "streamtest" binary from the Ceph tree. Ceph is doing a lot of xattr writes, to out-of-inode blocks. There is also another thread which does sync_file_range and close, of the same files. The problem appears to happen due to this race: sync/flush thread xattr-set thread ----------------- ---------------- do_writepages ext4_xattr_set ext4_da_writepages ext4_xattr_set_handle mpage_da_map_blocks ext4_xattr_block_set set DELALLOC_RESERVE ext4_new_meta_blocks ext4_mb_new_blocks if (!i_delalloc_reserved_flag) vfs_dq_alloc_block ext4_get_blocks down_write(i_data_sem) set i_delalloc_reserved_flag ... up_write(i_data_sem) if (i_delalloc_reserved_flag) vfs_dq_alloc_block_nofail In other words, the sync/flush thread pops in and sets i_delalloc_reserved_flag on the inode, which makes the xattr thread think that it's in a delalloc path in ext4_new_meta_blocks(), and add the block for a second time, after already having added it once in the !i_delalloc_reserved_flag case in ext4_mb_new_blocks The real problem is that we shouldn't be using the DELALLOC_RESERVED state flag, and instead we should be passing EXT4_GET_BLOCKS_DELALLOC_RESERVE down to ext4_map_blocks() instead of using an inode state flag. We'll fix this for now with using i_data_sem to prevent this race, but this is really not the right way to fix things. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org	2011-10-29 10:15:35 -04:00
Yongqiang Yang	e7b319e397	ext4: trace punch_hole correctly in ext4_ext_map_blocks When ext4_ext_map_blocks() is called by punch_hole, trace should trace blocks punched out. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-29 09:39:51 -04:00
Yongqiang Yang	02dc62fba8	ext4: clean up AGGRESSIVE_TEST code Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-29 09:29:11 -04:00
Yongqiang Yang	81fdbb4a8d	ext4: move variables to their scope Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-29 09:23:38 -04:00
Dmitry Monakhov	5cb81dabcc	ext4: fix quota accounting during migration The tmp_inode should have same uid/gid as the original inode. Otherwise new metadata blocks will be accounted to wrong quota-id, which will result in a quota leak after the inode migration is completed. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-29 09:05:00 -04:00
Dmitry Monakhov	fba90ffee8	ext4: migrate cleanup This patch cleanup code a bit, actual logic not changed - Move current block pointer to migrate_structure, let's all walk info will be in one structure. - Get rid of usless null ind-block ptr checks, caller already does that check. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-10-29 09:03:00 -04:00
Linus Torvalds	97d2eb13a0	Merge branch 'for-linus' of git://ceph.newdream.net/git/ceph-client * 'for-linus' of git://ceph.newdream.net/git/ceph-client: libceph: fix double-free of page vector ceph: fix 32-bit ino numbers libceph: force resend of osd requests if we skip an osdmap ceph: use kernel DNS resolver ceph: fix ceph_monc_init memory leak ceph: let the set_layout ioctl set single traits Revert "ceph: don't truncate dirty pages in invalidate work thread" ceph: replace leading spaces with tabs libceph: warn on msg allocation failures libceph: don't complain on msgpool alloc failures libceph: always preallocate mon connection libceph: create messenger with client ceph: document ioctls ceph: implement (optional) max read size ceph: rename rsize -> rasize ceph: make readpages fully async	2011-10-28 16:42:18 -07:00
Steve French	8ea00c6977	[CIFS] Update cifs version to 1.76 Update cifs version to 1.76 now that async read, lock caching, and changes to oplock enabled interface are in. Thanks to Pavel for reminding me. Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>	2011-10-28 14:49:46 -05:00
Pavel Shilovsky	d12799b4c3	CIFS: Remove extra mutex_unlock in cifs_lock_add_if to prevent the mutex being unlocked twice if we interrupt a blocked lock. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Steve French <smfrench@gmail.com>	2011-10-28 14:09:23 -05:00
Linus Torvalds	f362f98e7c	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue: (21 commits) leases: fix write-open/read-lease race nfs: drop unnecessary locking in llseek ext4: replace cut'n'pasted llseek code with generic_file_llseek_size vfs: add generic_file_llseek_size vfs: do (nearly) lockless generic_file_llseek direct-io: merge direct_io_walker into __blockdev_direct_IO direct-io: inline the complete submission path direct-io: separate map_bh from dio direct-io: use a slab cache for struct dio direct-io: rearrange fields in dio/dio_submit to avoid holes direct-io: fix a wrong comment direct-io: separate fields only used in the submission path from struct dio vfs: fix spinning prevention in prune_icache_sb vfs: add a comment to inode_permission() vfs: pass all mask flags check_acl and posix_acl_permission vfs: add hex format for MAY_* flag values vfs: indicate that the permission functions take all the MAY_* flags compat: sync compat_stats with statfs. vfs: add "device" tag to /proc/self/mountstats cleanup: vfs: small comment fix for block_invalidatepage ... Fix up trivial conflict in fs/gfs2/file.c (llseek changes)	2011-10-28 10:49:34 -07:00
Linus Torvalds	f793f29611	Merge http://sucs.org/~rohan/git/gfs2-3.0-nmw * http://sucs.org/~rohan/git/gfs2-3.0-nmw: (24 commits) GFS2: Move readahead of metadata during deallocation into its own function GFS2: Remove two unused variables GFS2: Misc fixes GFS2: rewrite fallocate code to write blocks directly GFS2: speed up delete/unlink performance for large files GFS2: Fix off-by-one in gfs2_blk2rgrpd GFS2: Clean up ->page_mkwrite GFS2: Correctly set goal block after allocation GFS2: Fix AIL flush issue during fsync GFS2: Use cached rgrp in gfs2_rlist_add() GFS2: Call do_strip() directly from recursive_scan() GFS2: Remove obsolete assert GFS2: Cache the most recently used resource group in the inode GFS2: Make resource groups "append only" during life of fs GFS2: Use rbtree for resource groups and clean up bitmap buffer ref count scheme GFS2: Fix lseek after SEEK_DATA, SEEK_HOLE have been added GFS2: Clean up gfs2_create GFS2: Use ->dirty_inode() GFS2: Fix bug trap and journaled data fsync GFS2: Fix inode allocation error path ...	2011-10-28 10:44:50 -07:00
Linus Torvalds	dabcbb1bae	Merge branch '3.2-without-smb2' of git://git.samba.org/sfrench/cifs-2.6 * '3.2-without-smb2' of git://git.samba.org/sfrench/cifs-2.6: (52 commits) Fix build break when freezer not configured Add definition for share encryption CIFS: Make cifs_push_locks send as many locks at once as possible CIFS: Send as many mandatory unlock ranges at once as possible CIFS: Implement caching mechanism for posix brlocks CIFS: Implement caching mechanism for mandatory brlocks CIFS: Fix DFS handling in cifs_get_file_info CIFS: Fix error handling in cifs_readv_complete [CIFS] Fixup trivial checkpatch warning [CIFS] Show nostrictsync and noperm mount options in /proc/mounts cifs, freezer: add wait_event_freezekillable and have cifs use it cifs: allow cifs_max_pending to be readable under /sys/module/cifs/parameters cifs: tune bdi.ra_pages in accordance with the rsize cifs: allow for larger rsize= options and change defaults cifs: convert cifs_readpages to use async reads cifs: add cifs_async_readv cifs: fix protocol definition for READ_RSP cifs: add a callback function to receive the rest of the frame cifs: break out 3rd receive phase into separate function cifs: find mid earlier in receive codepath ...	2011-10-28 10:43:32 -07:00
Linus Torvalds	5619a69396	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: (69 commits) xfs: add AIL pushing tracepoints xfs: put in missed fix for merge problem xfs: do not flush data workqueues in xfs_flush_buftarg xfs: remove XFS_bflush xfs: remove xfs_buf_target_name xfs: use xfs_ioerror_alert in xfs_buf_iodone_callbacks xfs: clean up xfs_ioerror_alert xfs: clean up buffer allocation xfs: remove buffers from the delwri list in xfs_buf_stale xfs: remove XFS_BUF_STALE and XFS_BUF_SUPER_STALE xfs: remove XFS_BUF_SET_VTYPE and XFS_BUF_SET_VTYPE_REF xfs: remove XFS_BUF_FINISH_IOWAIT xfs: remove xfs_get_buftarg_list xfs: fix buffer flushing during unmount xfs: optimize fsync on directories xfs: reduce the number of log forces from tail pushing xfs: Don't allocate new buffers on every call to _xfs_buf_find xfs: simplify xfs_trans_ijoin* again xfs: unlock the inode before log force in xfs_change_file_space xfs: unlock the inode before log force in xfs_fs_nfs_commit_metadata ...	2011-10-28 10:31:42 -07:00
J. Bruce Fields	f3c7691e8d	leases: fix write-open/read-lease race In setlease, we use i_writecount to decide whether we can give out a read lease. In open, we break leases before incrementing i_writecount. There is therefore a window between the break lease and the i_writecount increment when setlease could add a new read lease. This would leave us with a simultaneous write open and read lease, which shouldn't happen. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:59:00 +02:00
Andi Kleen	79835a710d	nfs: drop unnecessary locking in llseek This makes NFS follow the standard generic_file_llseek locking scheme. Cc: Trond.Myklebust@netapp.com Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:59:00 +02:00
Andi Kleen	4cce0e28b9	ext4: replace cut'n'pasted llseek code with generic_file_llseek_size This gives ext4 the benefits of unlocked llseek. Cc: tytso@mit.edu Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:59 +02:00
Andi Kleen	5760495a87	vfs: add generic_file_llseek_size Add a generic_file_llseek variant to the VFS that allows passing in the maximum file size of the file system, instead of always using maxbytes from the superblock. This can be used to eliminate some cut'n'paste seek code in ext4. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:59 +02:00
Andi Kleen	ef3d0fd27e	vfs: do (nearly) lockless generic_file_llseek The i_mutex lock use of generic _file_llseek hurts. Independent processes accessing the same file synchronize over a single lock, even though they have no need for synchronization at all. Under high utilization this can cause llseek to scale very poorly on larger systems. This patch does some rethinking of the llseek locking model: First the 64bit f_pos is not necessarily atomic without locks on 32bit systems. This can already cause races with read() today. This was discussed on linux-kernel in the past and deemed acceptable. The patch does not change that. Let's look at the different seek variants: SEEK_SET: Doesn't really need any locking. If there's a race one writer wins, the other loses. For 32bit the non atomic update races against read() stay the same. Without a lock they can also happen against write() now. The read() race was deemed acceptable in past discussions, and I think if it's ok for read it's ok for write too. => Don't need a lock. SEEK_END: This behaves like SEEK_SET plus it reads the maximum size too. Reading the maximum size would have the 32bit atomic problem. But luckily we already have a way to read the maximum size without locking (i_size_read), so we can just use that instead. Without i_mutex there is no synchronization with write() anymore, however since the write() update is atomic on 64bit it just behaves like another racy SEEK_SET. On non atomic 32bit it's the same as SEEK_SET. => Don't need a lock, but need to use i_size_read() SEEK_CUR: This has a read-modify-write race window on the same file. One could argue that any application doing unsynchronized seeks on the same file is already broken. But for the sake of not adding a regression here I'm using the file->f_lock to synchronize this. Using this lock is much better than the inode mutex because it doesn't synchronize between processes. => So still need a lock, but can use a f_lock. This patch implements this new scheme in generic_file_llseek. I dropped generic_file_llseek_unlocked and changed all callers. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:58 +02:00
Andi Kleen	847cc6371b	direct-io: merge direct_io_walker into __blockdev_direct_IO This doesn't change anything for the compiler, but hch thought it would make the code clearer. I moved the reference counting into its own little inline. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:58 +02:00
Andi Kleen	ba253fbf6d	direct-io: inline the complete submission path Add inlines to all the submission path functions. While this increases code size it also gives gcc a lot of optimization opportunities in this critical hotpath. In particular -- together with some other changes -- this allows gcc to get rid of the unnecessary clearing of sdio at the beginning and optimize the messy parameter passing. Any non inlining of a function which takes a sdio parameter would break this optimization because they cannot be done if the address of a structure is taken. Note that benefits are only seen with CONFIG_OPTIMIZE_INLINING and CONFIG_CC_OPTIMIZE_FOR_SIZE both set to off. This gives about 2.2% improvement on a large database benchmark with a high IOPS rate. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-10-28 14:58:58 +02:00

1 2 3 4 5 ...

24773 Commits (171c02048f50d7187991f251ddeed2d7e5de104f)