linux

Commit Graph

Author	SHA1	Message	Date
Tao Ma	28b8ca0b7f	ocfs2: bug-fix for journal extend in xattr. In ocfs2_extend_trans, when we can't extend the current transaction, it will commit current transaction and restart a new one. So if the previous credits we have allocated aren't used(the block isn't dirtied before our extend), we will not have enough credits for any future operation(it will cause jbd complain and bug out). So check this and re-extend it. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:06 -07:00
Joel Becker	8d6220d6a7	ocfs2: Change ocfs2_get__extent_tree() to ocfs2_init__extent_tree() The original get/put_extent_tree() functions held a reference on et_root_bh. However, every single caller already has a safe reference, making the get/put cycle irrelevant. We change ocfs2_get__extent_tree() to ocfs2_init__extent_tree(). It no longer gets a reference on et_root_bh. ocfs2_put_extent_tree() is removed. Callers now have a simpler init+use pattern. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:05 -07:00
Joel Becker	1625f8ac15	ocfs2: Comment struct ocfs2_extent_tree_operations. struct ocfs2_extent_tree_operations provides methods for the different on-disk btrees in ocfs2. Describing what those methods do is probably a good idea. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:05 -07:00
Joel Becker	f99b9b7ccf	ocfs2: Make ocfs2_extent_tree the first-class representation of a tree. We now have three different kinds of extent trees in ocfs2: inode data (dinode), extended attributes (xattr_tree), and extended attribute values (xattr_value). There is a nice abstraction for them, ocfs2_extent_tree, but it is hidden in alloc.c. All the calling functions have to pick amongst a varied API and pass in type bits and often extraneous pointers. A better way is to make ocfs2_extent_tree a first-class object. Everyone converts their object to an ocfs2_extent_tree() via the ocfs2_get_*_extent_tree() calls, then uses the ocfs2_extent_tree for all tree calls to alloc.c. This simplifies a lot of callers, making for readability. It also provides an easy way to add additional extent tree types, as they only need to be defined in alloc.c with a ocfs2_get_<new>_extent_tree() function. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:05 -07:00
Joel Becker	1e61ee79e2	ocfs2: Add an insertion check to ocfs2_extent_tree_operations. A couple places check an extent_tree for a valid inode. We move that out to add an eo_insert_check() operation. It can be called from ocfs2_insert_extent() and elsewhere. We also have the wrapper calls ocfs2_et_insert_check() and ocfs2_et_sanity_check() ignore NULL ops. That way we don't have to provide useless operations for xattr types. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:05 -07:00
Joel Becker	1a09f556e5	ocfs2: Create specific get_extent_tree functions. A caller knows what kind of extent tree they have. There's no reason they have to call ocfs2_get_extent_tree() with a NULL when they could just as easily call a specific function to their type of extent tree. Introduce ocfs2_dinode_get_extent_tree(), ocfs2_xattr_tree_get_extent_tree(), and ocfs2_xattr_value_get_extent_tree(). They only take the necessary arguments, calling into the underlying __ocfs2_get_extent_tree() to do the real work. __ocfs2_get_extent_tree() is the old ocfs2_get_extent_tree(), but without needing any switch-by-type logic. ocfs2_get_extent_tree() is now a wrapper around the specific calls. It exists because a couple alloc.c functions can take et_type. This will go later. Another benefit is that ocfs2_xattr_value_get_extent_tree() can take a struct ocfs2_xattr_value_root* instead of void*. This gives us typechecking where we didn't have it before. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:05 -07:00
Joel Becker	943cced39e	ocfs2: Determine an extent tree's max_leaf_clusters in an et_op. Provide an optional extent_tree_operation to specify the max_leaf_clusters of an ocfs2_extent_tree. If not provided, the value is 0 (unlimited). Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:04 -07:00
Joel Becker	1c25d93a4a	ocfs2: Use struct ocfs2_extent_tree in ocfs2_num_free_extents(). ocfs2_num_free_extents() re-implements the logic of ocfs2_get_extent_tree(). Now that ocfs2_get_extent_tree() does not allocate, let's use it in ocfs2_num_free_extents() to simplify the code. The inode validation code in ocfs2_num_free_extents() is not needed. All callers are passing in pre-validated inodes. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:04 -07:00
Joel Becker	0ce1010f1a	ocfs2: Provide the get_root_el() method to ocfs2_extent_tree_operations. The root_el of an ocfs2_extent_tree needs to be calculated from et->et_object. Make it an operation on et->et_ops. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:04 -07:00
Joel Becker	ea5efa1512	ocfs2: Make 'private' into 'object' on ocfs2_extent_tree. The 'private' pointer was a way to store off xattr values, which don't live at a set place in the bh. But the concept of "the object containing the extent tree" is much more generic. For an inode it's the struct ocfs2_dinode, for an xattr value its the value. Let's save off the 'object' at all times. If NULL is passed to ocfs2_get_extent_tree(), 'object' is set to bh->b_data; Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:04 -07:00
Joel Becker	dc0ce61af4	ocfs2: Make ocfs2_extent_tree get/put instead of alloc. Rather than allocating a struct ocfs2_extent_tree, just put it on the stack. Fill it with ocfs2_get_extent_tree() and drop it with ocfs2_put_extent_tree(). Now the callers don't have to ENOMEM, yet still safely ref the root_bh. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:04 -07:00
Joel Becker	ce1d9ea621	ocfs2: Prefix the ocfs2_extent_tree structure. The members of the ocfs2_extent_tree structure gain a prefix of 'et_'. All users are updated. Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:04 -07:00
Joel Becker	35dc0aa3c5	ocfs2: Prefix the extent tree operations structure. The ocfs2_extent_tree_operations structure gains a field prefix on its members. The ->eo_sanity_check() operation gains a wrapper function for completeness. All of the extent tree operation wrappers gain a consistent name (ocfs2_et_*()). Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:04 -07:00
Mark Fasheh	ff1ec20ef6	ocfs2: fix printk format warnings This patch fixes the following build warnings: fs/ocfs2/xattr.c: In function 'ocfs2_half_xattr_bucket': fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 7 has type 'long int' fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 8 has type 'long int' fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 7 has type 'long int' fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 8 has type 'long int' fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 7 has type 'long int' fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 8 has type 'long int' fs/ocfs2/xattr.c: In function 'ocfs2_xattr_set_entry_in_bucket': fs/ocfs2/xattr.c:4092: warning: format '%d' expects type 'int', but argument 6 has type 'size_t' fs/ocfs2/xattr.c:4092: warning: format '%d' expects type 'int', but argument 6 has type 'size_t' fs/ocfs2/xattr.c:4092: warning: format '%d' expects type 'int', but argument 6 has type 'size_t' Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:03 -07:00
Tiger Yang	8154da3d21	ocfs2: Add incompatible flag for extended attribute This patch adds the s_incompat flag for extended attribute support. This helps us ensure that older versions of Ocfs2 or ocfs2-tools will not be able to mount a volume with xattr support. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:03 -07:00
Tao Ma	a394425643	ocfs2: Delete all xattr buckets during inode removal In inode removal, we need to iterate all the buckets, remove any externally-stored EA values and delete the xattr buckets. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:03 -07:00
Tao Ma	012255961c	ocfs2: Enable xattr set in index btree Where the previous patches added the ability of list/get xattr in buckets for ocfs2, this patch enables ocfs2 to store large numbers of EAs. The original design doc is written by Mark Fasheh, and it can be found in http://oss.oracle.com/osswiki/OCFS2/DesignDocs/IndexedEATrees. I only had to make small modifications to it. First, because the bucket size is 4K, a new field named xh_free_start is added in ocfs2_xattr_header to indicate the next valid name/value offset in a bucket. It is used when we store new EA name/value. With this field, we can find the place more quickly and what's more, we don't need to sort the name/value every time to let the last entry indicate the next unused space. This makes the insert operation more efficient for blocksizes smaller than 4k. Because of the new xh_free_start, another field named as xh_name_value_len is also added in ocfs2_xattr_header. It records the total length of all the name/values in the bucket. We need this so that we can check it and defragment the bucket if there is not enough contiguous free space. An xattr insertion looks like this: 1. xattr_index_block_find: find the right bucket by the name_hash, say bucketA. 2. check whether there is enough space in bucketA. If yes, insert it directly and modify xh_free_start and xh_name_value_len accordingly. If not, check xh_name_value_len to see whether we can store this by defragment the bucket. If yes, defragment it and go on insertion. 3. If defragement doesn't work, check whether there is new empty bucket in the clusters within this extent record. If yes, init the new bucket and move all the buckets after bucketA one by one to the next bucket. Move half of the entries in bucketA to the next bucket and go on insertion. 4. If there is no new bucket, grow the extent tree. As for xattr deletion, we will delete an xattr bucket when all it's xattrs are removed and move all the buckets after it to the previous one. When all the xattr buckets in an extend record are freed, free this extend records from ocfs2_xattr_tree. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:03 -07:00
Tao Ma	ca12b7c489	ocfs2: Optionally limit extent size in ocfs2_insert_extent() In xattr bucket, we want to limit the maximum size of a btree leaf, otherwise we'll lose the benefits of hashing because we'll have to search large leaves. So add a new field in ocfs2_extent_tree which indicates the maximum leaf cluster size we want so that we can prevent ocfs2_insert_extent() from merging the leaf record even if it is contiguous with an adjacent record. Other btree types are not affected by this change. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:03 -07:00
Tao Ma	589dc2602f	ocfs2: Add xattr lookup code xattr btrees Add code to lookup a given extended attribute in the xattr btree. Lookup follows this general scheme: 1. Use ocfs2_xattr_get_rec to find the xattr extent record 2. Find the xattr bucket within the extent which may contain this xattr 3. Iterate the bucket to find the xattr. In ocfs2_xattr_block_get(), we need to recalcuate the block offset and name offset for the right position of name/value. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:03 -07:00
Tao Ma	0c044f0b24	ocfs2: Add xattr bucket iteration for large numbers of EAs Ocfs2 breaks up xattr index tree leaves into 4k regions, called buckets. Attributes are stored within a given bucket, depending on hash value. After a discussion with Mark, we decided that the per-bucket index (xe_entry[]) would only exist in the 1st block of a bucket. Likewise, name/value pairs will not straddle more than one block. This allows the majority of operations to work directly on the buffer heads in a leaf block. This patch adds code to iterate the buckets in an EA. A new abstration of ocfs2_xattr_bucket is added. It records the bhs in this bucket and ocfs2_xattr_header. This keeps the code neat, improving readibility. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:03 -07:00
Tao Ma	ba492615f0	ocfs2: Add xattr index tree operations When necessary, an ocfs2_xattr_block will embed an ocfs2_extent_list to store large numbers of EAs. This patch adds a new type in ocfs2_extent_tree_type and adds the implementation so that we can re-use the b-tree code to handle the storage of many EAs. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:02 -07:00
Tiger Yang	cf1d6c763f	ocfs2: Add extended attribute support This patch implements storing extended attributes both in inode or a single external block. We only store EA's in-inode when blocksize > 512 or that inode block has free space for it. When an EA's value is larger than 80 bytes, we will store the value via b-tree outside inode or block. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:02 -07:00
Tiger Yang	fdd77704a8	ocfs2: reserve inline space for extended attribute Add the structures and helper functions we want for handling inline extended attributes. We also update the inline-data handlers so that they properly function in the event that we have both inline data and inline attributes sharing an inode block. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:02 -07:00
Tao Ma	f56654c435	ocfs2: Add extent tree operation for xattr value btrees Add some thin wrappers around ocfs2_insert_extent() for each of the 3 different btree types, ocfs2_inode_insert_extent(), ocfs2_xattr_value_insert_extent() and ocfs2_xattr_tree_insert_extent(). The last is for the xattr index btree, which will be used in a followup patch. All the old callers in file.c etc will call ocfs2_dinode_insert_extent(), while the other two handle the xattr issue. And the init of extent tree are handled by these functions. When storing xattr value which is too large, we will allocate some clusters for it and here ocfs2_extent_list and ocfs2_extent_rec will also be used. In order to re-use the b-tree operation code, a new parameter named "private" is added into ocfs2_extent_tree and it is used to indicate the root of ocfs2_exent_list. The reason is that we can't deduce the root from the buffer_head now. It may be in an inode, an ocfs2_xattr_block or even worse, in any place in an ocfs2_xattr_bucket. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 16:57:01 -07:00
Tao Ma	ac11c82719	ocfs2: Add helper function in uptodate.c for removing xattr clusters The old uptodate only handles the issue of removing one buffer_head from ocfs2 inode's buffer cache. With xattr clusters, we may need to remove multiple buffer_head's at a time. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:59 -07:00
Tao Ma	5a7bc8eb29	ocfs2: Add the basic xattr disk layout in ocfs2_fs.h Ocfs2 uses a very flexible structure for storing extended attributes on disk. Small amount of attributes are stored directly in the inode block - up to 256 bytes worth. If that fills up, attributes are also stored in an external block, linked to from the inode block. That block can in turn expand to a btree, capable of storing large numbers of attributes. Individual attribute values are stored inline if they're small enough (currently about 80 bytes, this can be changed though), and otherwise are expanded to a btree. The theoretical limit to the size of an individual attribute is about the same as an inode, though the kernel's upper bound on the size of an attributes data is far smaller. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:59 -07:00
Tao Ma	0eb8d47e69	ocfs2: Make high level btree extend code generic Factor out the non-inode specifics of ocfs2_do_extend_allocation() into a more generic function, ocfs2_do_cluster_allocation(). ocfs2_do_extend_allocation calls ocfs2_do_cluster_allocation() now, but the latter can be used for other btree types as well. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:59 -07:00
Tao Ma	e7d4cb6bc1	ocfs2: Abstract ocfs2_extent_tree in b-tree operations. In the old extent tree operation, we take the hypothesis that we are using the ocfs2_extent_list in ocfs2_dinode as the tree root. As xattr will also use ocfs2_extent_list to store large value for a xattr entry, we refactor the tree operation so that xattr can use it directly. The refactoring includes 4 steps: 1. Abstract set/get of last_eb_blk and update_clusters since they may be stored in different location for dinode and xattr. 2. Add a new structure named ocfs2_extent_tree to indicate the extent tree the operation will work on. 3. Remove all the use of fe_bh and di, use root_bh and root_el in extent tree instead. So now all the fe_bh is replaced with et->root_bh, el with root_el accordingly. 4. Make ocfs2_lock_allocators generic. Now it is limited to be only used in file extend allocation. But the whole function is useful when we want to store large EAs. Note: This patch doesn't touch ocfs2_commit_truncate() since it is not used for anything other than truncate inode data btrees. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:58 -07:00
Tao Ma	811f933df1	ocfs2: Use ocfs2_extent_list instead of ocfs2_dinode. ocfs2_extend_meta_needed(), ocfs2_calc_extend_credits() and ocfs2_reserve_new_metadata() are all useful for extent tree operations. But they are all limited to an inode btree because they use a struct ocfs2_dinode parameter. Change their parameter to struct ocfs2_extent_list (the part of an ocfs2_dinode they actually use) so that the xattr btree code can use these functions. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:58 -07:00
Tao Ma	231b87d109	ocfs2: Modify ocfs2_num_free_extents for future xattr usage. ocfs2_num_free_extents() is used to find the number of free extent records in an inode btree. Hence, it takes an "ocfs2_dinode" parameter. We want to use this for extended attribute trees in the future, so genericize the interface the take a buffer head. A future patch will allow that buffer_head to contain any structure rooting an ocfs2 btree. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:58 -07:00
Mark Fasheh	9a8ff578fb	ocfs2: track local alloc state via debugfs A per-mount debugfs file, "local_alloc" is created which when read will expose live state of the nodes local alloc file. Performance impact is minimal, only a bit of memory overhead per mount point. Still, the code is hidden behind CONFIG_OCFS2_FS_STATS. This feature will help us debug local alloc performance problems on a live system. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:58 -07:00
Mark Fasheh	9c7af40b21	ocfs2: throttle back local alloc when low on disk space Ocfs2's local allocator disables itself for the duration of a mount point when it has trouble allocating a large enough area from the primary bitmap. That can cause performance problems, especially for disks which were only temporarily full or fragmented. This patch allows for the allocator to shrink it's window first, before being disabled. Later, it can also be re-enabled so that any performance drop is minimized. To do this, we allow the value of osb->local_alloc_bits to be shrunk when needed. The default value is recorded in a mostly read-only variable so that we can re-initialize when required. Locking had to be updated so that we could protect changes to local_alloc_bits. Mostly this involves protecting various local alloc values with the osb spinlock. A new state is also added, OCFS2_LA_THROTTLED, which is used when the local allocator is has shrunk, but is not disabled. If the available space dips below 1 megabyte, the local alloc file is disabled. In either case, local alloc is re-enabled 30 seconds after the event, or when an appropriate amount of bits is seen in the primary bitmap. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:57 -07:00
Mark Fasheh	ebcee4b5c9	ocfs2: Track local alloc bits internally Do this instead of tracking absolute local alloc size. This avoids needless re-calculatiion of bits from bytes in localalloc.c. Additionally, the value is now in a more natural unit for internal file system bitmap work. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:57 -07:00
Mark Fasheh	53da4939f3	ocfs2: POSIX file locks support This is actually pretty easy since fs/dlm already handles the bulk of the work. The Ocfs2 userspace cluster stack module already uses fs/dlm as the underlying lock manager, so I only had to add the right calls. Cluster-aware POSIX locks ("plocks") can be turned off by the same means at UNIX locks - mount with 'noflocks', or create a local-only Ocfs2 volume. Internally, the file system uses two sets of file_operations, depending on whether cluster aware plocks is required. This turns out to be easier than implementing local-only versions of ->lock. Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-10-13 13:57:57 -07:00
Steven Whitehouse	a447c09324	vfs: Use const for kernel parser table This is a much better version of a previous patch to make the parser tables constant. Rather than changing the typedef, we put the "const" in all the various places where its required, allowing the __initconst exception for nfsroot which was the cause of the previous trouble. This was posted for review some time ago and I believe its been in -mm since then. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Alexander Viro <aviro@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 10:10:37 -07:00
Linus Torvalds	20272c8994	Merge branch 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc * 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc: proc: remove kernel.maps_protect proc: remove now unneeded ADDBUF macro [PATCH] proc: show personality via /proc/pid/personality [PATCH] signal, procfs: some lock_task_sighand() users do not need rcu_read_lock() proc: move PROC_PAGE_MONITOR to fs/proc/Kconfig proc: make grab_header() static proc: remove unused get_dma_list() proc: remove dummy vmcore_open() proc: proc_sys_root tweak proc: fix return value of proc_reg_open() in "too late" case Fixed up trivial conflict in removed file arch/sparc/include/asm/dma_32.h	2008-10-13 10:04:04 -07:00
Linus Torvalds	8d71ff0bef	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (24 commits) integrity: special fs magic As pointed out by Jonathan Corbet, the timer must be deleted before ERROR: code indent should use tabs where possible The tpm_dev_release function is only called for platform devices, not pnp Protect tpm_chip_list when transversing it. Renames num_open to is_open, as only one process can open the file at a time. Remove the BKL calls from the TPM driver, which were added in the overall netlabel: Add configuration support for local labeling cipso: Add support for native local labeling and fixup mapping names netlabel: Changes to the NetLabel security attributes to allow LSMs to pass full contexts selinux: Cache NetLabel secattrs in the socket's security struct selinux: Set socket NetLabel based on connection endpoint netlabel: Add functionality to set the security attributes of a packet netlabel: Add network address selectors to the NetLabel/LSM domain mapping netlabel: Add a generic way to create ordered linked lists of network addrs netlabel: Replace protocol/NetLabel linking with refrerence counts smack: Fix missing calls to netlbl_skbuff_err() selinux: Fix missing calls to netlbl_skbuff_err() selinux: Fix a problem in security_netlbl_sid_to_secattr() selinux: Better local/forward check in selinux_ip_postroute() ...	2008-10-13 10:00:44 -07:00
Linus Torvalds	244dc4e54b	Merge git://git.infradead.org/users/dwmw2/random-2.6 * git://git.infradead.org/users/dwmw2/random-2.6: Fix autoloading of MacBook Pro backlight driver. Automatic MODULE_ALIAS() for DMI match tables. Remove asm/a.out.h files for all architectures without a.out support. Introduce HAVE_AOUT symbol to remove hard-coded arch list for BINFMT_AOUT Remove redundant CONFIG_ARCH_SUPPORTS_AOUT S390: Update comments about why we don't use <asm-generic/statfs.h> SPARC: Use <asm-generic/statfs.h> PowerPC: Use <asm-generic/statfs.h> PARISC: Use <asm-generic/statfs.h> x86_64: Use <asm-generic/statfs.h> IA64: Use <asm-generic/statfs.h> ARM: Use <asm-generic/statfs.h> Make <asm-generic/statfs.h> suitable for 64-bit platforms. Define and use PCI_DEVICE_ID_MARVELL_88ALP01_CCIC for CAFÉ camera driver [MTD] [NAND] Define and use PCI_DEVICE_ID_MARVELL_88ALP01_NAND for CAFÉ Use PCI_DEVICE_ID_88ALP01 for CAFÉ chip, rather than PCI_DEVICE_ID_CAFE. EFS: Don't set f_fsid in statfs().	2008-10-13 09:59:14 -07:00
Sukadev Bhattiprolu	a6f37daa8b	Simplify devpts_pty_kill When creating a new pty, save the pty's inode in the tty->driver_data. Use this inode in pty_kill() to identify the devpts instance. Since we now have the inode for the pty, we can skip get_node() lookup and remove the unused get_node(). TODO: - check if the mutex_lock is needed in pty_kill(). Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 09:51:43 -07:00
Sukadev Bhattiprolu	89a52e109e	Simplify devpts_pty_new() devpts_pty_new() is called when setting up a new pty and would not will not have an existing dentry or inode for the pty. So don't bother looking for an existing dentry - just create a new one. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 09:51:43 -07:00
Sukadev Bhattiprolu	527b3e4773	Simplify devpts_get_tty() As pointed out by H. Peter Anvin, since the inode for the pty is known, we don't need to look it up. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 09:51:43 -07:00
Sukadev Bhattiprolu	15f1a6338d	Add an instance parameter devpts interfaces Pass-in 'inode' or 'tty' parameter to devpts interfaces. With multiple devpts instances, these parameters will be used in subsequent patches to identify the instance of devpts mounted. The parameters also help simplify devpts implementation. Changelog[v3]: - minor changes due to merge with ttydev updates - rename parameters to emphasize they are ptmx or pts inodes - pass-in tty_struct * to devpts_pty_kill() (this will help cleanup the get_node() call in a subsequent patch) Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 09:51:43 -07:00
Alan Cox	934e6ebf96	tty: Redo current tty locking Currently it is sometimes locked by the tty mutex and sometimes by the sighand lock. The latter is in fact correct and now we can hand back referenced objects we can fix this up without problems around sleeping functions. Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 09:51:41 -07:00
Alan Cox	2cb5998b5f	tty: the vhangup syscall is racy We now have the infrastructure to sort this out but rather than teaching the syscall tty lock rules we move the hard work into a tty helper Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 09:51:41 -07:00
Alan Cox	452a00d2ee	tty: Make get_current_tty use a kref We now return a kref covered tty reference. That ensures the tty structure doesn't go away when you have a return from get_current_tty. This is not enough to protect you from most of the resources being freed behind your back - yet. [Updated to include fixes for SELinux problems found by Andrew Morton and an s390 leak found while debugging the former] Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 09:51:41 -07:00
David Woodhouse	e758936e02	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 Conflicts: include/asm-x86/statfs.h	2008-10-13 17:13:56 +01:00
Linus Torvalds	3280fb3139	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix kconfig typo and extra whitespace ext4: fix build failure without procfs ext4: add an option to control error handling on file data jbd2: don't dirty original metadata buffer on abort ext4: add checks for errors from jbd2 jbd2: fix error handling for checkpoint io jbd2: abort when failed to log metadata buffers	2008-10-12 16:10:29 -07:00
Mimi Zohar	9256292782	integrity: special fs magic Discussion on the mailing list questioned the use of these magic values in userspace, concluding these values are already exported to userspace via statfs and their correct/incorrect usage is left up to the userspace application. - Move special fs magic number definitions to magic.h - Add magic.h include Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Reviewed-by: James Morris <jmorris@namei.org> Signed-off-by: James Morris <jmorris@namei.org>	2008-10-13 09:47:43 +11:00
Jan Engelhardt	f319fb8bf6	ext4: fix kconfig typo and extra whitespace Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-12 15:53:01 -04:00
Alexander Beregalov	3244fcb1ae	ext4: fix build failure without procfs fs/ext4/super.c: In function 'ext4_fill_super': fs/ext4/super.c:2226: error: 'ext4_ui_proc_fops' undeclared (first use in this function) fs/ext4/super.c:2226: error: (Each undeclared identifier is reported only once fs/ext4/super.c:2226: error: for each function it appears in.) Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-12 17:27:49 -04:00
Linus Torvalds	f1b2a5ace9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: [CIFS] cifs: remove pointless lock and unlock of GlobalMid_Lock in header_assemble	2008-10-12 12:42:36 -07:00
Adrian Bunk	06270d5d6a	provide generic_block_fiemap() only with BLOCK=y This fixes the following compile error with CONFIG_BLOCK=n caused by commit `68c9d702bb` ("generic block based fiemap implementation"): CC fs/ioctl.o fs/ioctl.c: In function 'generic_block_fiemap': fs/ioctl.c:249: error: storage size of 'tmp' isn't known fs/ioctl.c:272: error: invalid application of 'sizeof' to incomplete type 'struct buffer_head' fs/ioctl.c:280: error: implicit declaration of function 'buffer_mapped' fs/ioctl.c:249: warning: unused variable 'tmp' make[2]: *** [fs/ioctl.o] Error 1 Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-12 11:44:37 -07:00
Jeff Layton	14835a3325	[CIFS] cifs: remove pointless lock and unlock of GlobalMid_Lock in header_assemble We lock GlobalMid_Lock in header_assemble and then immediately unlock it again without doing anything. Not sure what this was intended to do, but remove it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-10-12 13:34:11 +00:00
Linus Torvalds	fd04808830	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (43 commits) ext4: Rename ext4dev to ext4 ext4: Avoid double dirtying of super block in ext4_put_super() Update ext4 MAINTAINERS file Hook ext4 to the vfs fiemap interface. generic block based fiemap implementation ocfs2: fiemap support vfs: vfs-level fiemap interface ext4: fix xattr deadlock jbd2: Fix buffer head leak when writing the commit block ext4: Add debugging markers that can be used by systemtap jbd2: abort instead of waiting for nonexistent transaction ext4: fix initialization of UNINIT bitmap blocks ext4: Remove old legacy block allocator ext4: Use readahead when reading an inode from the inode table ext4: Improve the documentation for ext4's /proc tunables ext4: Combine proc file handling into a single set of functions ext4: move /proc setup and teardown out of mballoc.c ext4: Don't use 'struct dentry' for internal lookups ext4/jbd2: Avoid WARN() messages when failing to write to the superblock ext4: use percpu data structures for lg_prealloc_list ...	2008-10-11 13:23:48 -07:00
Linus Torvalds	86ed5a93b8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: [CIFS] Check that last search entry resume key is valid [CIFS] make sure we have the right resume info before calling CIFSFindNext [CIFS] clean up error handling in cifs_unlink [CIFS] fix some settings of cifsAttrs after calling SetFileInfo and SetPathInfo cifs: explicitly revoke SPNEGO key after session setup cifs: Convert cifs to new aops. [CIFS] update DOS attributes in cifsInode if we successfully changed them cifs: remove NULL termination from rename target in CIFSSMBRenameOpenFIle cifs: work around samba returning -ENOENT on SetFileDisposition call cifs: fix inverted NULL check after kmalloc [CIFS] clean up upcall handling for dns_resolver keys [CIFS] fix busy-file renames and refactor cifs_rename logic cifs: add function to set file disposition [CIFS] add constants for string lengths of keynames in SPNEGO upcall string cifs: move rename and delete-on-close logic into helper function cifs: have find_writeable_file prefer filehandles opened by same task cifs: don't use GFP_KERNEL with GFP_NOFS [CIFS] use common code for turning off ATTR_READONLY in cifs_unlink cifs: clean up variables in cifs_unlink	2008-10-11 09:31:53 -07:00
Hidehiro Kawai	5bf5683a33	ext4: add an option to control error handling on file data If the journal doesn't abort when it gets an IO error in file data blocks, the file data corruption will spread silently. Because most of applications and commands do buffered writes without fsync(), they don't notice the IO error. It's scary for mission critical systems. On the other hand, if the journal aborts whenever it gets an IO error in file data blocks, the system will easily become inoperable. So this patch introduces a filesystem option to determine whether it aborts the journal or just call printk() when it gets an IO error in file data. If you mount an ext4 fs with data_err=abort option, it aborts on file data write error. If you mount it with data_err=ignore, it doesn't abort, just call printk(). data_err=ignore is the default. Here is the corresponding patch of the ext3 version: http://kerneltrap.org/mailarchive/linux-kernel/2008/9/9/3239374 Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-10 22:12:43 -04:00
Hidehiro Kawai	7ad7445f60	jbd2: don't dirty original metadata buffer on abort Currently, original metadata buffers are dirtied when they are unfiled whether the journal has aborted or not. Eventually these buffers will be written-back to the filesystem by pdflush. This means some metadata buffers are written to the filesystem without journaling if the journal aborts. So if both journal abort and system crash happen at the same time, the filesystem would become inconsistent state. Additionally, replaying journaled metadata can overwrite the latest metadata on the filesystem partly. Because, if the journal gets aborted, journaled metadata are preserved and replayed during the next mount not to lose uncheckpointed metadata. This would also break the consistency of the filesystem. This patch prevents original metadata buffers from being dirtied on abort by clearing BH_JBDDirty flag from those buffers. Thus, no metadata buffers are written to the filesystem without journaling. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-10 20:29:31 -04:00
Hidehiro Kawai	7ffe1ea894	ext4: add checks for errors from jbd2 If the journal has aborted due to a checkpointing failure, we have to keep the contents of the journal space. Otherwise, the filesystem will lose uncheckpointed metadata completely and become inconsistent. To avoid this, we need to keep needs_recovery flag if checkpoint has failed. With this patch, ext4_put_super() detects a checkpointing failure from the return value of journal_destroy(), then it invokes ext4_abort() to make the filesystem read only and keep needs_recovery flag. Errors from jbd2_journal_flush() are also handled by this patch in some places. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-10 20:29:21 -04:00
Hidehiro Kawai	44519faf22	jbd2: fix error handling for checkpoint io When a checkpointing IO fails, current JBD2 code doesn't check the error and continue journaling. This means latest metadata can be lost from both the journal and filesystem. This patch leaves the failed metadata blocks in the journal space and aborts journaling in the case of jbd2_log_do_checkpoint(). To achieve this, we need to do: 1. don't remove the failed buffer from the checkpoint list where in the case of __try_to_free_cp_buf() because it may be released or overwritten by a later transaction 2. jbd2_log_do_checkpoint() is the last chance, remove the failed buffer from the checkpoint list and abort the journal 3. when checkpointing fails, don't update the journal super block to prevent the journaled contents from being cleaned. For safety, don't update j_tail and j_tail_sequence either 4. when checkpointing fails, notify this error to the ext4 layer so that ext4 don't clear the needs_recovery flag, otherwise the journaled contents are ignored and cleaned in the recovery phase 5. if the recovery fails, keep the needs_recovery flag 6. prevent jbd2_cleanup_journal_tail() from being called between __jbd2_journal_drop_transaction() and jbd2_journal_abort() (a possible race issue between jbd2_log_do_checkpoint()s called by jbd2_journal_flush() and __jbd2_log_wait_for_space()) Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-10 20:29:13 -04:00
Hidehiro Kawai	77e841de8a	jbd2: abort when failed to log metadata buffers If we failed to write metadata buffers to the journal space and succeeded to write the commit record, stale data can be written back to the filesystem as metadata in the recovery phase. To avoid this, when we failed to write out metadata buffers, abort the journal before writing the commit record. We can also avoid this kind of corruption by using the journal checksum feature because it can detect invalid metadata blocks in the journal and avoid them from being replayed. So we don't need to care about asynchronous commit record writeout with a checksum. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-10-12 16:39:16 -04:00
Theodore Ts'o	03010a3350	ext4: Rename ext4dev to ext4 The ext4 filesystem is getting stable enough that it's time to drop the "dev" prefix. Also remove the requirement for the TEST_FILESYS flag. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-10-10 20:02:48 -04:00
Linus Torvalds	13dd7f876d	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm: dlm: choose better identifiers dlm: remove bkl dlm: fix address compare dlm: fix locking of lockspace list in dlm_scand dlm: detect available userspace daemon dlm: allow multiple lockspace creates	2008-10-10 11:13:55 -07:00
Christoph Hellwig	73f6aa4d44	Fix barrier fail detection in XFS Currently we disable barriers as soon as we get a buffer in xlog_iodone that has the XBF_ORDERED flag cleared. But this can be the case not only for buffers where the barrier failed, but also the first buffer of a split log write in case of a log wraparound. Due to the disabled barriers we can easily get directory corruption on unclean shutdowns. So instead of using this check add a new buffer flag for failed barrier writes. This is a regression vs 2.6.26 caused by patch to use the right macro to check for the ORDERED flag, as we previously got true returned for every buffer. Thanks to Toei Rei for reporting the bug. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: David Chinner <david@fromorbit.com> Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-10 11:08:07 -07:00
Linus Torvalds	445e1ceda3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: GFS2: Support for I/O barriers GFS2: Add UUID to GFS2 sb GFS2: high time to take some time over atime GFS2: The war on bloat GFS2: GFS2 will panic if you misspell any mount options GFS2: Direct IO write at end of file error GFS2: Use an IS_ERR test rather than a NULL test GFS2: Fix race relating to glock min-hold time GFS2: Fix & clean up GFS2 rename GFS2: rm on multiple nodes causes panic GFS2: Fix metafs mounts GFS2: Fix debugfs glock file iterator	2008-10-10 11:02:22 -07:00
Linus Torvalds	e26feff647	Merge branch 'for-2.6.28' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.28' of git://git.kernel.dk/linux-2.6-block: (132 commits) doc/cdrom: Trvial documentation error, file not present block_dev: fix kernel-doc in new functions block: add some comments around the bio read-write flags block: mark bio_split_pool static block: Find bio sector offset given idx and offset block: gendisk integrity wrapper block: Switch blk_integrity_compare from bdev to gendisk block: Fix double put in blk_integrity_unregister block: Introduce integrity data ownership flag block: revert part of d7533ad0e132f92e75c1b2eb7c26387b25a583c1 bio.h: Remove unused conditional code block: remove end_{queued\|dequeued}_request() block: change elevator to use __blk_end_request() gdrom: change to use __blk_end_request() memstick: change to use __blk_end_request() virtio_blk: change to use __blk_end_request() blktrace: use BLKTRACE_BDEV_SIZE as the name size for setup structure block: add lld busy state exporting interface block: Fix blk_start_queueing() to not kick a stopped queue include blktrace_api.h in headers_install ...	2008-10-10 10:52:45 -07:00
Alexey Dobriyan	3bbfe05967	proc: remove kernel.maps_protect After commit `831830b5a2` aka "restrict reading from /proc/<pid>/maps to those who share ->mm or can ptrace" sysctl stopped being relevant because commit moved security checks from ->show time to ->start time (mm_for_maps()). Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Kees Cook <kees.cook@canonical.com>	2008-10-10 04:24:51 +04:00
Alexey Dobriyan	45acb8db06	proc: remove now unneeded ADDBUF macro After local seq_file conversion it was forgotten. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:58 +04:00
Kees Cook	4783072308	[PATCH] proc: show personality via /proc/pid/personality Make process personality flags visible in /proc. Since a process's personality is potentially sensitive (e.g. READ_IMPLIES_EXEC), make this file only readable by the process owner. Signed-off-by: Kees Cook <kees.cook@canonical.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:57 +04:00
Lai Jiangshan	a6bebbc87a	[PATCH] signal, procfs: some lock_task_sighand() users do not need rcu_read_lock() lock_task_sighand() make sure task->sighand is being protected, so we do not need rcu_read_lock(). [ exec() will get task->sighand->siglock before change task->sighand! ] But code using rcu_read_lock() _just_ to protect lock_task_sighand() only appear in procfs. (and some code in procfs use lock_task_sighand() without such redundant protection.) Other subsystem may put lock_task_sighand() into rcu_read_lock() critical region, but these rcu_read_lock() are used for protecting "for_each_process()", "find_task_by_vpid()" etc. , not for protecting lock_task_sighand(). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> [ok from Oleg] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:57 +04:00
Alexey Dobriyan	53167a3ef2	proc: move PROC_PAGE_MONITOR to fs/proc/Kconfig Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:57 +04:00
Adrian Bunk	81324364b7	proc: make grab_header() static Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:56 +04:00
Alexey Dobriyan	a70973c214	proc: remove unused get_dma_list() Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:56 +04:00
Alexey Dobriyan	a04f4de641	proc: remove dummy vmcore_open() Empty ->open is equivalent to always succeeding ->open. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:55 +04:00
Alexey Dobriyan	e1675231ce	proc: proc_sys_root tweak Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:55 +04:00
Alexey Dobriyan	300b994b74	proc: fix return value of proc_reg_open() in "too late" case If ->open() wasn't called, returning 0 is misleading and, theoretically, oopsable: 1) remove_proc_entry clears ->proc_fops, drops lock, 2) ->open "succeeds", 3) ->release oopses, because it assumes ->open was called (single_release()). Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:54 +04:00
Linus Torvalds	efc968d450	Don't allow splice() to files opened with O_APPEND This is debatable, but while we're debating it, let's disallow the combination of splice and an O_APPEND destination. It's not entirely clear what the semantics of O_APPEND should be, and POSIX apparently expects pwrite() to ignore O_APPEND, for example. So we could make up any semantics we want, including the old ones. But Miklos convinced me that we should at least give it some thought, and that accepting writes at arbitrary offsets is wrong at least for IS_APPEND() files (which always have O_APPEND set, even if the reverse isn't true: you can obviously have O_APPEND set on a regular file). So disallow O_APPEND entirely for now. I doubt anybody cares, and this way we have one less gray area to worry about. Reported-and-argued-for-by: Miklos Szeredi <miklos@szeredi.hu> Acked-by: Jens Axboe <ens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-09 14:26:38 -07:00
Randy Dunlap	57d1b5366f	block_dev: fix kernel-doc in new functions Fix kernel-doc in new functions: Error(mmotm-2008-1002-1617//fs/block_dev.c:895): duplicate section name 'Description' Error(mmotm-2008-1002-1617//fs/block_dev.c:924): duplicate section name 'Description' Warning(mmotm-2008-1002-1617//fs/block_dev.c:1282): No description found for parameter 'pathname' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> cc: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 10:42:38 +02:00
Denis ChengRq	6feef531f5	block: mark bio_split_pool static Since all bio_split calls refer the same single bio_split_pool, the bio_split function can use bio_split_pool directly instead of the mempool_t parameter; then the mempool_t parameter can be removed from bio_split param list, and bio_split_pool is only referred in fs/bio.c file, can be marked static. Signed-off-by: Denis ChengRq <crquan@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:57:05 +02:00
Martin K. Petersen	ad3316bf4e	block: Find bio sector offset given idx and offset Helper function to find the sector offset in a bio given bvec index and page offset. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:22 +02:00
Martin K. Petersen	74aa8c2cc0	block: Introduce integrity data ownership flag A filesystem might supply its own integrity metadata. Introduce a flag that indicates whether the filesystem or the block layer owns the integrity buffer. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:21 +02:00
Jens Axboe	b04accc425	block: revert part of d7533ad0e132f92e75c1b2eb7c26387b25a583c1 We need bdev_get_integrity() to support the pending md/dm patches. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:21 +02:00
Jens Axboe	9c02f2b02e	block: cleanup some of the integrity stuff in blkdev.h Don't put functions that are only used in fs/bio-integrity.c in blkdev.h, it's much cleaner to just keep it in there. Also kill completely unused bdev_get_tag_size() Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:17 +02:00
Jens Axboe	0a0d96b03a	block: add bio_kmalloc() Not all callers need (or want!) the mempool backing guarentee, it essentially means that you can only use bio_alloc() for short allocations and not for preallocating some bio's at setup or init time. So add bio_kmalloc() which does the same thing as bio_alloc(), except it just uses kmalloc() as the backing instead of the bio mempools. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:17 +02:00
Andrew Patterson	608aeef17a	Call flush_disk() after detecting an online resize. We call flush_disk() to make sure the buffer cache for the disk is flushed after a disk resize. There are two resize cases, growing and shrinking. Given that users can shrink/then grow a disk before revalidate_disk() is called, we treat the grow case identically to shrinking. We need to flush the buffer cache after an online shrink because, as James Bottomley puts it, The two use cases for shrinking I can see are 1. planned: the fs is already shrunk to within the new boundaries and all data is relocated, so invalidate is fine (any dirty buffers that might exist in the shrunk region are there only because they were relocated but not yet written to their original location). 2. unplanned: In this case, the fs is probably toast, so whether we invalidate or not isn't going to make a whole lot of difference; it's still going to try to read or write from sectors beyond the new size and get I/O errors. Immediately invalidating shrunk disks will cause errors for outstanding I/Os for reads/write beyond the new end of the disk to be generated earlier then if we waited for the normal buffer cache operation. It also removes a potential security hole where we might keep old data around from beyond the end of the shrunk disk if the disk was not invalidated. Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:13 +02:00
Andrew Patterson	56ade44b46	Added flush_disk to factor out common buffer cache flushing code. We need to be able to flush the buffer cache for for more than just when a disk is changed, so we factor out common cache flush code in check_disk_change() to an internal flush_disk() routine. This routine will then be used for both disk changes and disk resizes (in a later patch). Include the disk name in the text indicating that there are busy inodes on the device and increase the KERN severity of the message. Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:13 +02:00
Andrew Patterson	9bc3ffbfbd	Check for device resize when rescanning partitions Check for device resize in the rescan_partitions() routine. If the device has been resized, the bdev size is set to match. The rescan_partitions() routine is called when opening the device and when calling the BLKRRPART ioctl. Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:12 +02:00
Andrew Patterson	c3279d1454	Adjust block device size after an online resize of a disk. The revalidate_disk routine now checks if a disk has been resized by comparing the gendisk capacity to the bdev inode size. If they are different (usually because the disk has been resized underneath the kernel) the bdev inode size is adjusted to match the capacity. Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:12 +02:00
Andrew Patterson	0c002c2f74	Wrapper for lower-level revalidate_disk routines. This is a wrapper for the lower-level revalidate_disk call-backs such as sd_revalidate_disk(). It allows us to perform pre and post operations when calling them. We will use this wrapper in a later patch to adjust block device sizes after an online resize (a _post_ operation). Signed-off-by: Andrew Patterson <andrew.patterson@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:12 +02:00
FUJITA Tomonori	818827669d	block: make blk_rq_map_user take a NULL user-space buffer This patch changes blk_rq_map_user to accept a NULL user-space buffer with a READ command if rq_map_data is not NULL. Thus a caller can pass page frames to lk_rq_map_user to just set up a request and bios with page frames propely. bio_uncopy_user (called via blk_rq_unmap_user) doesn't copy data to user space with such request. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:11 +02:00
FUJITA Tomonori	4d8ab62e08	bio: convert bio_copy_kern to use bio_copy_user bio_copy_kern and bio_copy_user are very similar. This converts bio_copy_kern to use bio_copy_user. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:10 +02:00
FUJITA Tomonori	152e283fdf	block: introduce struct rq_map_data to use reserved pages This patch introduces struct rq_map_data to enable bio_copy_use_iov() use reserved pages. Currently, bio_copy_user_iov allocates bounce pages but drivers/scsi/sg.c wants to allocate pages by itself and use them. struct rq_map_data can be used to pass allocated pages to bio_copy_user_iov. The current users of bio_copy_user_iov simply passes NULL (they don't want to use pre-allocated pages). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Douglas Gilbert <dougg@torque.net> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:10 +02:00
FUJITA Tomonori	a3bce90edd	block: add gfp_mask argument to blk_rq_map_user and blk_rq_map_user_iov Currently, blk_rq_map_user and blk_rq_map_user_iov always do GFP_KERNEL allocation. This adds gfp_mask argument to blk_rq_map_user and blk_rq_map_user_iov so sg can use it (sg always does GFP_ATOMIC allocation). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Douglas Gilbert <dougg@torque.net> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:10 +02:00
Jens Axboe	c7c22e4d5c	block: add support for IO CPU affinity This patch adds support for controlling the IO completion CPU of either all requests on a queue, or on a per-request basis. We export a sysfs variable (rq_affinity) which, if set, migrates completions of requests to the CPU that originally submitted it. A bio helper (bio_set_completion_cpu()) is also added, so that queuers can ask for completion on that specific CPU. In testing, this has been show to cut the system time by as much as 20-40% on synthetic workloads where CPU affinity is desired. This requires a little help from the architecture, so it'll only work as designed for archs that are using the new generic smp helper infrastructure. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:09 +02:00
Tejun Heo	3e1a7ff8a0	block: allow disk to have extended device number Now that disk and partition handlings are mostly unified, it's easy to allow disk to have extended device number. This patch makes add_disk() use extended device number if disk->minors is zero. Both sd and ide-disk are updated to use this. * sd_format_disk_name() is implemented which can generically determine the drive name. This removes disk number restriction stemming from limited device names. * If sd index goes over SD_MAX_DISKS (which can be increased now BTW), sd simply doesn't initialize minors letting block layer choose extended device number. * If CONFIG_DEBUG_EXT_DEVT is set, both sd and ide-disk always set minors to 0 and use extended device numbers. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	689d6fac40	block: replace @ext_minors with GENHD_FL_EXT_DEVT With previous changes, it's meaningless to limit the number of partitions. Replace @ext_minors with GENHD_FL_EXT_DEVT such that setting the flag allows the disk to have maximum number of allowed partitions (only limited by the number of entries in parsed_partitions as determined by MAX_PART constant). This kills not-too-pretty alloc_disk_ext[_node]() functions and makes @minors parameter to alloc_disk[_node]() unnecessary. The parameter is left alone to avoid disturbing the users. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	540eed5637	block: make partition array dynamic disk->__part used to be statically allocated to the maximum possible number of partitions. This patch makes partition array allocation dynamic. The added overhead is minimal as only real change is one memory dereference changed to RCU one. This saves both a bit of memory and cpu cycles iterating through unoccupied slots and makes increasing partition limit easier. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	074a7aca7a	block: move stats from disk to part0 Move stats related fields - stamp, in_flight, dkstats - from disk to part0 and unify stat handling such that... * part_stat_() now updates part0 together if the specified partition is not part0. ie. part_stat_() are now essentially all_stat_(). {disk\|all}_stat_() are gone. part_round_stats() is updated similary. It handles part0 stats automatically and disk_round_stats() is killed. * part_{inc\|dec}_in_fligh() is implemented which automatically updates part0 stats for parts other than part0. * disk_map_sector_rcu() is updated to return part0 if no part matches. Combined with the above changes, this makes NULL special case handling in callers unnecessary. * Separate stats show code paths for disk are collapsed into part stats show code paths. * Rename disk_stat_lock/unlock() to part_stat_lock/unlock() While at it, reposition stat handling macros a bit and add missing parentheses around macro parameters. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	eddb2e26b5	block: kill GENHD_FL_FAIL and use part0->make_it_fail GENHD_FL_FAIL for disk is what make_it_fail is for parts. Kill it and use part0->make_it_fail. Sysfs node handling is unified too. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	0762b8bde9	block: always set bdev->bd_part Till now, bdev->bd_part is set only if the bdev was for parts other than part0. This patch makes bdev->bd_part always set so that code paths don't have to differenciate common handling. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	4c46501d16	block: move holder_dir from disk to part0 Move disk->holder_dir to part0->holder_dir. Kill now mostly superflous bdev_get_holder(). While at it, kill superflous kobject_get/put() around holder_dir, slave_dir and cmd_filter creation and collapse disk_sysfs_add_subdirs() into register_disk(). These serve no purpose but obfuscating the code. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00

1 2 3 4 5 ...

10277 Commits (ec23847d6cfe445ba9a1a5ec513297f4cc0ada53)