linux

Commit Graph

Author	SHA1	Message	Date
NeilBrown	2ca68f5ed7	md/raid1: store behind-write pages in bi_vecs. When performing write-behind we allocate pages to store the data during write. Previously we just keep a list of pages. Now we keep a list of bi_vec which includes offset and size. This means that the r1bio has complete information to create a new bio which will be needed for retrying after write errors. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:32:10 +10:00
NeilBrown	4367af5561	md/raid1: clear bad-block record when write succeeds. If we succeed in writing to a block that was recorded as being bad, we clear the bad-block record. This requires some delayed handling as the bad-block-list update has to happen in process-context. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:31:49 +10:00
NeilBrown	1f68f0c4b6	md/raid1: avoid writing to known-bad blocks on known-bad drives. If we have seen any write error on a drive, then don't write to any known-bad blocks on that drive. If necessary, we divide the write request up into pieces just like we do for reads, so each piece is either all written or all not written to any given drive. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:31:48 +10:00
NeilBrown	de393cdea6	md: make it easier to wait for bad blocks to be acknowledged. It is only safe to choose not to write to a bad block if that bad block is safely recorded in metadata - i.e. if it has been 'acknowledged'. If it hasn't we need to wait for the acknowledgement. We support that using rdev->blocked wait and md_wait_for_blocked_rdev by introducing a new device flag 'BlockedBadBlock'. This flag is only advisory. It is cleared whenever we acknowledge a bad block, so that a waiter can re-check the particular bad blocks that it is interested it. It should be set by a caller when they find they need to wait. This (set after test) is inherently racy, but as md_wait_for_blocked_rdev already has a timeout, losing the race will have minimal impact. When we clear "Blocked" was also clear "BlockedBadBlocks" incase it was set incorrectly (see above race). We also modify the way we manage 'Blocked' to fit better with the new handling of 'BlockedBadBlocks' and to make it consistent between externally managed and internally managed metadata. This requires that each raidXd loop checks if the metadata needs to be written and triggers a write (md_check_recovery) if needed. Otherwise a queued write request might cause raidXd to wait for the metadata to write, and only that thread can write it. Before writing metadata, we set FaultRecorded for all devices that are Faulty, then after writing the metadata we clear Blocked for any device for which the Fault was certainly Recorded. The 'faulty' device flag now appears in sysfs if the device is faulty or it has unacknowledged bad blocks. So user-space which does not understand bad blocks can continue to function correctly. User space which does, should not assume a device is faulty until it sees the 'faulty' flag, and then sees the list of unacknowledged bad blocks is empty. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:31:48 +10:00
NeilBrown	06f603851f	md/raid1: avoid reading known bad blocks during resync When performing resync/etc, keep the size of the request small enough that it doesn't overlap any known bad blocks. Devices with badblocks at the start of the request are completely excluded. If there is nowhere to read from due to bad blocks, record a bad block on each target device. Now that we never read from known-bad-blocks we can allow devices with known-bad-blocks into a RAID1. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:31:48 +10:00
NeilBrown	d2eb35acfd	md/raid1: avoid reading from known bad blocks. Now that we have a bad block list, we should not read from those blocks. There are several main parts to this: 1/ read_balance needs to check for bad blocks, and return not only the chosen device, but also how many good blocks are available there. 2/ fix_read_error needs to avoid trying to read from bad blocks. 3/ read submission must be ready to issue multiple reads to different devices as different bad blocks on different devices could mean that a single large read cannot be served by any one device, but can still be served by the array. This requires keeping count of the number of outstanding requests per bio. This count is stored in 'bi_phys_segments' 4/ retrying a read needs to also be ready to submit a smaller read and queue another request for the rest. This does not yet handle bad blocks when reading to perform resync, recovery, or check. 'md_trim_bio' will also be used for RAID10, so put it in md.c and export it. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:31:48 +10:00
NeilBrown	34b343cff4	md: don't allow arrays to contain devices with bad blocks. As no personality understand bad block lists yet, we must reject any device that is known to contain bad blocks. As the personalities get taught, these tests can be removed. This only applies to raid1/raid5/raid10. For linear/raid0/multipath/faulty the whole concept of bad blocks doesn't mean anything so there is no point adding the checks. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:31:47 +10:00
Jonathan Brassow	654e8b5abc	MD: raid1 s/sysfs_notify_dirent/sysfs_notify_dirent_safe If device-mapper creates a RAID1 array that includes devices to be rebuilt, it will deref a NULL pointer when finished because sysfs is not used by device-mapper instantiated RAID devices. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Namhyung Kim	9d3d80113d	md/raid1: move rdev->corrected_errors counting Read errors are considered to corrected if write-back and re-read cycle is finished without further problems. Thus moving the rdev-> corrected_errors counting after the re-reading looks more reasonable IMHO. Also included a couple of whitespace fixes on sync_page_io(). Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
NeilBrown	5389042ffa	md: change managed of recovery_disabled. If we hit a read error while recovering a mirror, we want to abort the recovery without necessarily failing the disk - as having a disk this a read error is better than not having an array at all. Currently this is managed with a per-array flag "recovery_disabled" and is only implemented for RAID1. For RAID10 we will need finer grained control as we might want to disable recovery for individual devices separately. So push more of the decision making into the personality. 'recovery_disabled' is now a 'cookie' which is copied when the personality want to disable recovery and is changed when a device is added to the array as this is used as a trigger to 'try recovery again'. This will allow RAID10 to get the control that it needs. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Namhyung Kim	36fad858a7	md: introduce link/unlink_rdev() helpers There are places where sysfs links to rdev are handled in a same way. Add the helper functions to consolidate them. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Christian Dietrich	8bda470e8e	md/raid: use printk_ratelimited instead of printk_ratelimit As per printk_ratelimit comment, it should not be used. Signed-off-by: Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Jonathan Brassow	1ed7242e59	MD: raid1 changes to allow use by device mapper MD RAID1: Changes to allow RAID1 to be used by device-mapper (dm-raid.c) Added the necessary congestion function and conditionalize calls requiring an array 'queue' or 'gendisk'. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-08 15:11:31 +10:00
NeilBrown	b098636cf0	md: allow resync_start to be set while an array is active. The sysfs attribute 'resync_start' (known internally as recovery_cp), records where a resync is up to. A value of 0 means the array is not known to be in-sync at all. A value of MaxSector means the array is believed to be fully in-sync. When the size of member devices of an array (RAID1,RAID4/5/6) is increased, the array can be increased to match. This process sets resync_start to the old end-of-device offset so that the new part of the array gets resynced. However with RAID1 (and RAID6) a resync is not technically necessary and may be undesirable. So it would be good if the implied resync after the array is resized could be avoided. So: change 'resync_start' so the value can be changed while the array is active, and as a precaution only allow it to be changed while resync/recovery is 'frozen'. Changing it once resync has started is not going to be useful anyway. This allows the array to be resized without a resync by: write 'frozen' to 'sync_action' write new size to 'component_size' (this will set resync_start) write 'none' to 'resync_start' write 'idle' to 'sync_action'. Also slightly improve some tests on recovery_cp when resizing raid1/raid5. Now that an arbitrary value could be set we should be more careful in our tests. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 15:52:21 +10:00
NeilBrown	af6d7b760c	md/raid1: improve handling of pages allocated for write-behind. The current handling and freeing of these pages is a bit fragile. We only keep the list of allocated pages in each bio, so we need to still have a valid bio when freeing the pages, which is a bit clumsy. So simply store the allocated page list in the r1_bio so it can easily be found and freed when we are finished with the r1_bio. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:51:19 +10:00
NeilBrown	7ca78d57d1	md/raid1: try fix_sync_read_error before process_checks. If we get a read error during resync/recovery we current repeat with single-page reads to find out just where the error is, and possibly read each page from a different device. With check/repair we don't currently do that, we just fail. However it is possible that while all devices fail on the large 64K read, we might be able to satisfy each 4K from one device or another. So call fix_sync_read_error before process_checks to maximise the chance of finding good data and writing it out to the devices with read errors. For this to work, we need to set the 'uptodate' flags properly after fix_sync_read_error has succeeded. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:50:37 +10:00
NeilBrown	78d7f5f726	md/raid1: tidy up new functions: process_checks and fix_sync_read_error. These changes are mostly cosmetic: 1/ change mddev->raid_disks to conf->raid_disks because the later is technically safer, though in current practice it doesn't matter in this particular context. 2/ Rearrange two for / if loops to have an early 'continue' so the body of the 'if' doesn't need to be indented so much. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:48:56 +10:00
NeilBrown	a68e587035	md/raid1: split out two sub-functions from sync_request_write sync_request_write is too big and too deep. So split out two self-contains bits of functionality into separate function. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:40:44 +10:00
NeilBrown	76073054c9	md/raid1: clean up read_balance. read_balance has two loops which both look for a 'best' device based on slightly different criteria. This is clumsy and makes is hard to add extra criteria. So replace it all with a single loop that combines everything. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:34:56 +10:00
NeilBrown	c3b328ac84	md: fix up raid1/raid10 unplugging. We just need to make sure that an unplug event wakes up the md thread, which is exactly what mddev_check_plugged does. Also remove some plug-related code that is no longer needed. Signed-off-by: NeilBrown <neilb@suse.de>	2011-04-18 18:25:43 +10:00
NeilBrown	e1dfa0a297	md: use new plugging interface for RAID IO. md/raid submits a lot of IO from the various raid threads. So adding start/finish plug calls to those so that some plugging happens. Signed-off-by: NeilBrown <neilb@suse.de>	2011-04-18 18:25:41 +10:00
Martin K. Petersen	a91a2785b2	block: Require subsystems to explicitly allocate bio_set integrity mempool MD and DM create a new bio_set for every metadevice. Each bio_set has an integrity mempool attached regardless of whether the metadevice is capable of passing integrity metadata. This is a waste of memory. Instead we defer the allocation decision to MD and DM since we know at metadevice creation time whether integrity passthrough is needed or not. Automatic integrity mempool allocation can then be removed from bioset_create() and we make an explicit integrity allocation for the fs_bio_set. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Acked-by: Mike Snitzer <snizer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-17 11:11:05 +01:00
Jens Axboe	4c63f5646e	Merge branch 'for-2.6.39/stack-plug' into for-2.6.39/core Conflicts: block/blk-core.c block/blk-flush.c drivers/md/raid1.c drivers/md/raid10.c drivers/md/raid5.c fs/nilfs2/btnode.c fs/nilfs2/mdt.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:58:35 +01:00
Jens Axboe	7eaceaccab	block: remove per-queue plugging Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:52:07 +01:00
NeilBrown	da9cf5050a	md: avoid spinlock problem in blk_throtl_exit blk_throtl_exit assumes that ->queue_lock still exists, so make sure that it does. To do this, we stop redirecting ->queue_lock to conf->device_lock and leave it pointing where it is initialised - __queue_lock. As the blk_plug functions check the ->queue_lock is held, we now take that spin_lock explicitly around the plug functions. We don't need the locking, just the warning removal. This is needed for any kernel with the blk_throtl code, which is which is 2.6.37 and later. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2011-02-21 18:25:57 +11:00
Jonathan Brassow	ccebd4c415	md-new-param-to_sync_page_io Add new parameter to 'sync_page_io'. The new parameter allows us to distinguish between metadata and data operations. This becomes important later when we add the ability to use separate devices for data and metadata. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>	2011-01-14 09:14:33 +11:00
Joe Perches	067032bc62	md: Fix single printks with multiple KERN_<level>s Noticed-by: Russell King <linux@arm.linux.org.uk> Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-14 09:14:33 +11:00
NeilBrown	8f9e0ee38f	md/raid1: really fix recovery looping when single good device fails. Commit `4044ba58dd` supposedly fixed a problem where if a raid1 with just one good device gets a read-error during recovery, the recovery would abort and immediately restart in an infinite loop. However it depended on raid1_remove_disk removing the spare device from the array. But that does not happen in this case. So add a test so that in the 'recovery_disabled' case, the device will be removed. This suitable for any kernel since 2.6.29 which is when recovery_disabled was introduced. Cc: stable@kernel.org Reported-by: Sebastian Färber <faerber@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-11-24 16:39:46 +11:00
NeilBrown	f3ac8bf7ce	md: tidy up device searches in read_balance. The code for searching through the device list to read-balance in raid1 is rather clumsy and hard to follow. Try to simplify it a bit. No important functionality change here. Signed-off-by: NeilBrown <neilb@suse.de>	2010-10-29 16:40:33 +11:00
NeilBrown	046abeede7	md/raid1: fix some typos in comments. Signed-off-by: NeilBrown <neilb@suse.de>	2010-10-29 16:40:33 +11:00
NeilBrown	9b19553e0b	md/raid1: discard unused variable. This structure field (flushing_bio_list) is never used, so remove it. Signed-off-by: NeilBrown <neilb@suse.de>	2010-10-29 16:40:33 +11:00
NeilBrown	a167f66324	md: use separate bio pool for each md device. bio_clone and bio_alloc allocate from a common bio pool. If an md device is stacked with other devices that use this pool, or under something like swap which uses the pool, then the multiple calls on the pool can cause deadlocks. So allocate a local bio pool for each md array and use that rather than the common pool. This pool is used both for regular IO and metadata updates. Signed-off-by: NeilBrown <neilb@suse.de>	2010-10-28 17:36:15 +11:00
NeilBrown	2b193363ef	md: change type of first arg to sync_page_io. Currently sync_page_io takes a 'bdev'. Every caller passes 'rdev->bdev'. We will soon want another field out of the rdev in sync_page_io, So just pass the rdev instead of the bdev out of it. Signed-off-by: NeilBrown <neilb@suse.de>	2010-10-28 17:36:11 +11:00
NeilBrown	1c4588e9c1	md/raid1: perform mem allocation before disabling writes during resync. Though this mem alloc is GFP_NOIO an so will not deadlock, it seems better to do the allocation before 'raise_barrier' which stops any IO requests while the resync proceeds. raid10 always uses this order, so it is at least consistent to do the same in raid1. Signed-off-by: NeilBrown <neilb@suse.de>	2010-10-28 17:36:09 +11:00
NeilBrown	6746557f03	md: use bio_kmalloc rather than bio_alloc when failure is acceptable. bio_alloc can never fail (as it uses a mempool) but an block indefinitely, especially if the caller is holding a reference to a previously allocated bio. So these to places which both handle failure and hold multiple bios should not use bio_alloc, they should use bio_kmalloc. Signed-off-by: NeilBrown <neilb@suse.de>	2010-10-28 17:36:06 +11:00
NeilBrown	4e78064f42	md: Fix possible deadlock with multiple mempool allocations. It is not safe to allocate from a mempool while holding an item previously allocated from that mempool as that can deadlock when the mempool is close to exhaustion. So don't use a bio list to collect the bios to write to multiple devices in raid1 and raid10. Instead queue each bio as it becomes available so an unplug will activate all previously allocated bios and so a new bio has a chance of being allocated. This means we must set the 'remaining' count to '1' before submitting any requests, then when all are submitted, decrement 'remaining' and possible handle the write completion at that point. Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com> Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-10-28 17:34:07 +11:00
NeilBrown	57dab0bdf6	md: use sector_t in bitmap_get_counter bitmap_get_counter returns the number of sectors covered by the counter in a pass-by-reference variable. In some cases this can be very large, so make it a sector_t for safety. Signed-off-by: NeilBrown <neilb@suse.de>	2010-10-28 17:32:26 +11:00
Jens Axboe	fa251f8990	Merge branch 'v2.6.36-rc8' into for-2.6.37/barrier Conflicts: block/blk-core.c drivers/block/loop.c mm/swapfile.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-10-19 09:13:04 +02:00
NeilBrown	db8d9d3591	md/raid1: minor bio initialisation improvements. When performing a resync we pre-allocate some bios and repeatedly use them. This requires us to re-initialise them each time. One field (bi_comp_cpu) and some flags weren't being initiaised reliably. Signed-off-by: NeilBrown <neilb@suse.de>	2010-10-07 12:00:50 +11:00
NeilBrown	7571ae887d	md/raid1: avoid overflow in raid1 resync when bitmap is in use. bitmap_start_sync returns - via a pass-by-reference variable - the number of sectors before we need to check with the bitmap again. Since commit `ef42567335` this number can be substantially larger, 2^27 is a common value. Unfortunately it is an 'int' and so when raid1.c:sync_request shifts it 9 places to the left it becomes 0. This results in a zero-length read which the scsi layer justifiably complains about. This patch just removes the shift so the common case becomes safe with a trivially-correct patch. In the next merge window we will convert this 'int' to a 'sector_t' Reported-by: "George Spelvin" <linux@horizon.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-10-07 11:54:46 +11:00
Tejun Heo	e9c7469bb4	md: implment REQ_FLUSH/FUA support This patch converts md to support REQ_FLUSH/FUA instead of now deprecated REQ_HARDBARRIER. In the core part (md.c), the following changes are notable. * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with processing of other requests and thus there is no reason to mark the queue congested while FLUSH/FUA is in progress. * REQ_FLUSH/FUA failures are final and its users don't need retry logic. Retry logic is removed. * Preflush needs to be issued to all member devices but FUA writes can be handled the same way as other writes - their processing can be deferred to request_queue of member devices. md_barrier_request() is renamed to md_flush_request() and simplified accordingly. For linear, raid0 and multipath, the core changes are enough. raid1, 5 and 10 need the following conversions. * raid1: Handling of FLUSH/FUA bio's can simply be deferred to request_queues of member devices. Barrier related logic removed. * raid5: Queue draining logic dropped. FUA bit is propagated through biodrain and stripe resconstruction such that all the updated parts of the stripe are written out with FUA writes if any of the dirtying writes was FUA. preread_active_stripes handling in make_request() is updated as suggested by Neil Brown. * raid10: FUA bit needs to be propagated to write clones. linear, raid0, 1, 5 and 10 tested. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-10 12:35:38 +02:00
NeilBrown	2c7d46ec19	md raid-1/10 Fix bio_rw bit manipulations again commit `7b6d91daee` changed the behaviour of a few variables in raid1 and raid10 from flags to bit-sets, but left them as type 'bool' so they did not work. Change them (back) to unsigned long. (historical note: see `1ef04fefe2`) Signed-off-by: NeilBrown <neilb@suse.de> Reported-by: Jiri Slaby <jslaby@suse.cz> and many others	2010-08-18 16:16:05 +10:00
NeilBrown	6b96562054	md: provide appropriate return value for spare_active functions. md_check_recovery expects ->spare_active to return 'true' if any spares were activated, but none of them do, so the consequent change in 'degraded' is not notified through sysfs. So count the number of spares activated, subtract it from 'degraded' just once, and return it. Reported-by: Adrian Drzewiecki <adriand@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-08-18 12:04:32 +10:00
Adrian Drzewiecki	e6ffbcb6cd	md: Notify sysfs when RAID1/5/10 disk is In_sync. When RAID1 is done syncing disks, it'll update the state of synced rdevs to In_sync. But it neglected to notify sysfs that the attribute changed. So any programs that are waiting for an rdev's state to change will not be woken. (raid5/raid10 added by neilb) Signed-off-by: Adrian Drzewiecki <adriand@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-08-18 11:49:02 +10:00
Christoph Hellwig	7b6d91daee	block: unify flags for struct bio and struct request Remove the current bio flags and reuse the request flags for the bio, too. This allows to more easily trace the type of I/O from the filesystem down to the block driver. There were two flags in the bio that were missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've renamed two request flags that had a superflous RW in them. Note that the flags are in bio.h despite having the REQ_ name - as blkdev.h includes bio.h that is the only way to go for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-08-07 18:20:39 +02:00
NeilBrown	19fdb9eefb	Merge commit '3ff195b011d7decf501a4d55aeed312731094796' into for-linus Conflicts: drivers/md/md.c - Resolved conflict in md_update_sb - Added extra 'NULL' arg to new instance of sysfs_get_dirent. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-22 08:31:36 +10:00
NeilBrown	af3a2cd6b8	md: Fix read balancing in RAID1 and RAID10 on drives > 2TB read_balance uses a "unsigned long" for a sector number which will get truncated beyond 2TB. This will cause read-balancing to be non-optimal, and can cause data to be read from the 'wrong' branch during a resync. This has a very small chance of returning wrong data. Reported-by: Jordan Russell <jr-list-2010@quo.to> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:28:00 +10:00
NeilBrown	9dd1e2faf7	md/raid1: improve printk messages Make sure the array name is included in a uniform way in all printk messages. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:59 +10:00
NeilBrown	e555190d82	md/raid1: delay reads that could overtake behind-writes. When a raid1 array is configured to support write-behind on some devices, it normally only reads from other devices. If all devices are write-behind (because the rest have failed) it is possible for a read request to be serviced before a behind-write request, which would appear as data corruption. So when forced to read from a WriteMostly device, wait for any write-behind to complete, and don't start any more behind-writes. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:57 +10:00
NeilBrown	d754c5ae1f	md/raid1: fix confusing 'redirect sector' message. This message seems to suggest the named device is the one on which a read failed, however it is actually the device that the read will be redirected to. So make the message a little clearer. Reported-by: Tim Burgess <ozburgess@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:56 +10:00
NeilBrown	21a52c6d05	md: pass mddev to make_request functions rather than request_queue We used to pass the personality make_request function direct to the block layer so the first argument had to be a queue. But now we have the intermediary md_make_request so it makes at lot more sense to pass a struct mddev_s. It makes it possible to have an mddev without its own queue too. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:55 +10:00
NeilBrown	b821eaa572	md: remove ->changed and related code. We set ->changed to 1 and call check_disk_change at the end of md_open so that bd_invalidated would be set and thus partition rescan would happen appropriately. Now that we call revalidate_disk directly, which sets bd_invalidates, that indirection is no longer needed and can be removed. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:53 +10:00
NeilBrown	490773268c	md: move io accounting out of personalities into md_make_request While I generally prefer letting personalities do as much as possible, given that we have a central md_make_request anyway we may as well use it to simplify code. Also this centralises knowledge of ->gendisk which will help later. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:52 +10:00
H Hartley Sweeten	7b92813c3c	drivers/md: Remove unnecessary casts of void * void pointers do not need to be cast to other pointer types. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:46 +10:00
NeilBrown	964147d5c8	md/raid1: fix counting of write targets. There is a very small race window when writing to a RAID1 such that if a device is marked faulty at exactly the wrong time, the write-in-progress will not be sent to the device, but the bitmap (if present) will be updated to say that the write was sent. Then if the device turned out to still be usable as was re-added to the array, the bitmap-based-resync would skip resyncing that block, possibly leading to corruption. This would only be a problem if no further writes were issued to that area of the device (i.e. that bitmap chunk). Suitable for any pending -stable kernel. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:13 +10:00
Tejun Heo	5a0e3ad6af	include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>	2010-03-30 22:02:32 +09:00
NeilBrown	627a2d3c29	md: deal with merge_bvec_fn in component devices better. If a component device has a merge_bvec_fn then as we never call it we must ensure we never need to. Currently this is done by setting max_sector to 1 PAGE, however this does not stop a bio being created with several sub-page iovecs that would violate the merge_bvec_fn. So instead set max_segments to 1 and set the segment boundary to the same as a page boundary to ensure there is only ever one single-page segment of IO requested at a time. This can particularly be an issue when 'xen' is used as it is known to submit multiple small buffers in a single bio. Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2010-03-16 17:04:24 +11:00
Martin K. Petersen	086fa5ff08	block: Rename blk_queue_max_sectors to blk_queue_max_hw_sectors The block layer calling convention is blk_queue_<limit name>. blk_queue_max_sectors predates this practice, leading to some confusion. Rename the function to appropriately reflect that its intended use is to set max_hw_sectors. Also introduce a temporary wrapper for backwards compability. This can be removed after the merge window is closed. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-26 13:58:08 +01:00
NeilBrown	0efb9e6191	md: add MODULE_DESCRIPTION for all md related modules. Suggested by Oren Held <orenhe@il.ibm.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	42a04b5078	md: move offset, daemon_sleep and chunksize out of bitmap structure ... and into bitmap_info. These are all configuration parameters that need to be set before the bitmap is created. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	709ae4879a	md/raid1: add takeover support for raid5->raid1 A 2-device raid5 array can now be converted to raid1. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:41 +11:00
NeilBrown	6eef4b21ff	md: add honouring of suspend_{lo,hi} to raid1. This will allow us to stop writeout to portions of the array while they are resynced by someone else - e.g. another node in a cluster. Signed-off-by: NeilBrown <neilb@suse.de>	2009-12-14 12:51:40 +11:00
NeilBrown	d0e260782c	md: revert incorrect fix for read error handling in raid1. commit `4706b349f` was a forward port of a fix that was needed for SLES10. But in fact it is not needed in mainline because the earlier commit `dd00a99e7a` fixes the same problem in a better way. Further, this commit introduces a bug in the way it interacts with the automatic read-error-correction. If, after a read error is successfully corrected, the same disk is chosen to re-read - the re-read won't be attempted but an error will be returned instead. After reverting that commit, there is the possibility that a read error on a read-only array (where read errors cannot be corrected as that requires a write) will repeatedly read the same device and continue to get an error. So in the "Array is readonly" case, fail the drive immediately on a read error. Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2009-12-01 17:30:59 +11:00
NeilBrown	ed9bfdf1a4	md: raid1/raid10: handle allocation errors during array setup. Both raid1 and raid10 create a mempool during startup. If the 'alloc' function for this mempool fails, unplug_slaves is called. If that happens when the pool is being initialised, unplug_slaves will try to use the 'conf' structure that isn't filled in yet, and badness will happen. So ensure that unplug_slaves doesn't get called unless we know that the conf structure if fully initialised. Signed-off-by: NeilBrown <neilb@suse.de>	2009-10-16 15:55:44 +11:00
NeilBrown	1d9d52416c	md/raid1/raid10: add a cond_resched During 'check' of a raid1 or raid10 it is possible for the management thread to spend a lot of time running 'memcmp' on blocks from different devices, so make sure the thread has a chance to schedule. raid5d already has a cond_resched (in process_stripe). Reported-By: Lee Howard <faxguy@howardsilvan.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-10-16 15:55:32 +11:00
Dmitry Monakhov	1ef04fefe2	md: raid-1/10: fix RW bits manipulation Recently Jens has changed bio_rw_flagged() logic by following commit `1f98a13f62`. Now it returns bool instead of int. This broke raid1/raid10 RW bits manipulation logic. One of visible result is BUG_ON triggering due to empty barrier here scsi_lib.c:1108 scsi_setup_fs_cmnd() Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-09-23 18:20:15 +10:00
NeilBrown	3fa841d7e7	md: report device as congested when suspended This should writeback from coming when the device is temporarily suspended. Signed-off-by: NeilBrown <neilb@suse.de>	2009-09-23 18:10:29 +10:00
NeilBrown	0da3c6194e	md: Improve name of threads created by md_register_thread The management thread for raid4,5,6 arrays are all called mdX_raid5, independent of the actual raid level, which is wrong and can be confusion. So change md_register_thread to use the name from the personality unless no alternate name (like 'resync' or 'reshape') is given. This is simpler and more correct. Cc: Jinzc <zhenchengjin@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-09-23 18:09:45 +10:00
Jens Axboe	1f98a13f62	bio: first step in sanitizing the bio->bi_rw flag testing Get rid of any functions that test for these bits and make callers use bio_rw_flagged() directly. Then it is at least directly apparent what variable and flag they check. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:33:31 +02:00
NeilBrown	449aad3e25	md: Use revalidate_disk to effect changes in size of device. As revalidate_disk calls check_disk_size_change, it will cause any capacity change of a gendisk to be propagated to the blockdev inode. So use that instead of mucking about with locks and i_size_write. Also add a call to revalidate_disk in do_md_run and a few other places where the gendisk capacity is changed. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-03 10:59:58 +10:00
Andre Noll	ac5e7113e7	md: Push down data integrity code to personalities. This patch replaces md_integrity_check() by two new public functions: md_integrity_register() and md_integrity_add_rdev() which are both personality-independent. md_integrity_register() is called from the ->run and ->hot_remove methods of all personalities that support data integrity. The function iterates over the component devices of the array and determines if all active devices are integrity capable and if their profiles match. If this is the case, the common profile is registered for the mddev via blk_integrity_register(). The second new function, md_integrity_add_rdev() is called from the ->hot_add_disk methods, i.e. whenever a new device is being added to a raid array. If the new device does not support data integrity, or has a profile different from the one already registered, data integrity for the mddev is disabled. For raid0 and linear, only the call to md_integrity_register() from the ->run method is necessary. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-03 10:59:47 +10:00
Martin K. Petersen	8f6c2e4b32	md: Use new topology calls to indicate alignment and I/O sizes Switch MD over to the new disk_stack_limits() function which checks for aligment and adjusts preferred I/O sizes when stacking. Also indicate preferred I/O sizes where applicable. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-07-01 11:13:45 +10:00
Andre Noll	8c6ac868b1	md: Push down reconstruction log message to personality code. Currently, the md layer checks in analyze_sbs() if the raid level supports reconstruction (mddev->level >= 1) and if reconstruction is in progress (mddev->recovery_cp != MaxSector). Move that printk into the personality code of those raid levels that care (levels 1, 4, 5, 6, 10). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:48:06 +10:00
Andre Noll	664e7c413f	md: Convert mddev->new_chunk to sectors. A straight-forward conversion which gets rid of some multiplications/divisions/shifts. The patch also introduces a couple of new ones, most of which are due to conf->chunk_size still being represented in bytes. This will be cleaned up in subsequent patches. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:45:27 +10:00
Andre Noll	9d8f036362	md: Make mddev->chunk_size sector-based. This patch renames the chunk_size field to chunk_sectors with the implied change of semantics. Since is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9) = is_power_of_2(chunk_sectors) these bits don't need an adjustment for the shift. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:45:01 +10:00
NeilBrown	070ec55d07	md: remove mddev_to_conf "helper" macro Having a macro just to cast a void* isn't really helpful. I would must rather see that we are simply de-referencing ->private, than have to know what the macro does. So open code the macro everywhere and remove the pointless cast. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-16 16:54:21 +10:00
Martin K. Petersen	ae03bf639a	block: Use accessor functions for queue limits Convert all external users of queue limits to using wrapper functions instead of poking the request queue variables directly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-22 23:22:54 +02:00
Christoph Hellwig	8f3d8ba20e	block: move bio list helpers into bio.h It's used by DM and MD and generally useful, so move the bio list helpers into bio.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-04-15 08:28:09 +02:00
Alexander Beregalov	91a9e99d76	md/raid1: fix build breakage Fix this build error: drivers/md/raid1.c: In function 'raid1_congested': drivers/md/raid1.c:589: error: 'BDI_write_congested' undeclared BDI_write_congested was changed in commit `1faa16d228` ("block: change the request allocation/congestion logic to be sync/async based") Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-04-06 14:40:07 -07:00
NeilBrown	303a0e11d0	md/raid1 - don't assume newly allocated bvecs are initialised. Since commit `d3f761104b` newly allocated bvecs aren't initialised to NULL, so we have to be more careful about freeing a bio which only managed to get a few pages allocated to it. Otherwise the resync process crashes. This patch is appropriate for 2.6.29-stable. Cc: stable@kernel.org Cc: "Jens Axboe" <jens.axboe@oracle.com> Reported-by: Gabriele Tozzi <gabriele@tozzi.eu> Signed-off-by: NeilBrown <neilb@suse.de>	2009-04-06 14:40:38 +10:00
Dan Williams	b522adcde9	md: 'array_size' sysfs attribute Allow userspace to set the size of the array according to the following semantics: 1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0) a) If size is set before the array is running, do_md_run will fail if size is greater than the default size b) A reshape attempt that reduces the default size to less than the set array size should be blocked 2/ once userspace sets the size the kernel will not change it 3/ writing 'default' to this attribute returns control of the size to the kernel and reverts to the size reported by the personality Also, convert locations that need to know the default size from directly reading ->array_sectors to <pers>_size. Resync/reshape operations always follow the default size. Finally, fixup other locations that read a number of 1k-blocks from userspace to use strict_blocks_to_sectors() which checks for unsigned long long to sector_t overflow and blocks to sectors overflow. Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-03-31 15:00:31 +11:00
Dan Williams	1f403624bd	md: centralize ->array_sectors modifications Get personalities out of the business of directly modifying ->array_sectors. Lays groundwork to introduce policy on when ->array_sectors can be modified. Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-03-31 14:59:03 +11:00
Dan Williams	80c3a6ce4b	md: add 'size' as a personality method In preparation for giving userspace control over ->array_sectors we need to be able to retrieve the 'default' size, and the 'anticipated' size when a reshape is requested. For personalities that do not reshape emit a warning if anything but the default size is requested. In the raid5 case we need to update ->previous_raid_disks to make the new 'default' size available. Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-03-31 14:57:49 +11:00
NeilBrown	409c57f380	md: enable suspend/resume of md devices. To be able to change the 'level' of an md/raid array, we need to suspend the device so that no requests are active - then move some pointers around etc. The code already keeps counts of active requests and the ->quiesce function can be used to wait until those counts hit zero. However the quiesce function blocks new requests once they are all ready 'inside' the personality module, and that is too late if we want to replace the personality modules. So make all md requests come in through a common md_make_request function that keeps track of how many requests have entered the modules but may not yet be on the internal reference counts. Allow md_make_request to be blocked when we want to suspend the device, and make it possible to wait for all those in-transit requests to be added to internal lists so that ->quiesce can wait for them. There is still a problem that when a request completes, we drop the ref count inside the personality code so there is a short time between when the refcount hits zero, and when the personality code is no longer being used. The personality code never blocks (schedule or spinlock) between dropping the refcount and exiting the routine, so this should be safe (as put_module calls synchronize_sched() before unmapping the module code). Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:39 +11:00
Andre Noll	58c0fed400	md: Make mddev->size sector-based. This patch renames the "size" field of struct mddev_s to "dev_sectors" and stores the number of 512-byte sectors instead of the number of 1K-blocks in it. All users of that field, including raid levels 1,4-6,10, are adjusted accordingly. This simplifies the code a bit because it allows to get rid of a couple of divisions/multiplications by two. In order to make checkpatch happy, some minor coding style issues have also been addressed. In particular, size_store() now uses strict_strtoull() instead of simple_strtoull(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
NeilBrown	43b2e5d86d	md: move md_k.h from include/linux/raid/ to drivers/md/ It really is nicer to keep related code together.. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
NeilBrown	bff61975b3	md: move lots of #include lines out of .h files and into .c This makes the includes more explicit, and is preparation for moving md_k.h to drivers/md/md.h Remove include/raid/md.h as its only remaining use was to #include other files. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
Christoph Hellwig	ef740c372d	md: move headers out of include/linux/raid/ Move the headers with the local structures for the disciplines and bitmap.h into drivers/md/ so that they are more easily grepable for hacking and not far away. md.h is left where it is for now as there are some uses from the outside. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:27:03 +11:00
NeilBrown	73d5c38a95	md: avoid races when stopping resync. There has been a race in raid10 and raid1 for a long time which has only recently started showing up due to a scheduler changed. When a sync_read request finishes, as soon as reschedule_retry is called, another thread can mark the resync request as having completed, so md_do_sync can finish, ->stop can be called, and ->conf can be freed. So using conf after reschedule_retry is not safe. Similarly, when finishing a sync_write, calling md_done_sync must be the last thing we do, as it allows a chain of events which will free conf and other data structures. The first of these requires action in raid10.c The second requires action in raid1.c and raid10.c Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-02-25 13:18:47 +11:00
NeilBrown	4706b349f4	md: Allow read error in a single drive raid1 to be passed up. If a raid1 only has a single working device and gets a read error, we choose to simply return that error up to the filesystem (or whatever) rather than failing the whole array. However the codes doesn't quite do that. We attempt a readbalance which allocates the same drive, so we retry the read - indefinitely. Instead: If read_balance in the error case chooses the same drive that just failed, treat it as a failure and don't retry. Signed-off-by: NeilBrown <neilb@suse.de>	2009-02-06 15:06:47 +11:00
NeilBrown	4044ba58dd	md: don't retry recovery of raid1 that fails due to error on source drive. If a raid1 has only one working drive and it has a sector which gives an error on read, then an attempt to recover onto a spare will fail, but as the single remaining drive is not removed from the array, the recovery will be immediately re-attempted, resulting in an infinite recovery loop. So detect this situation and don't retry recovery once an error on the lone remaining drive is detected. Allow recovery to be retried once every time a spare is added in case the problem wasn't actually a media error. Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:11 +11:00
Cheng Renquan	159ec1fc06	md: use list_for_each_entry macro directly The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to list_for_each_entry_safe, from <linux/list.h>, it should be defined to use list_for_each_entry_safe, instead of reinventing the wheel. But some calls to each_entry_safe don't really need a safe version, just a direct list_for_each_entry is enough, this could save a temp variable (tmp) in every function that used rdev_for_each. In this patch, most rdev_for_each loops are replaced by list_for_each_entry, totally save many tmp vars; and only in the other situations that will call list_del to delete an entry, the safe version is used. Signed-off-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:08 +11:00
Stephen Rothwell	255707274e	md: build failure due to missing delay.h Today's linux-next build (powerpc ppc64_defconfig) failed like this: drivers/md/raid1.c: In function 'sync_request': drivers/md/raid1.c:1759: error: implicit declaration of function 'msleep_interruptible' make[3]: * [drivers/md/raid1.o] Error 1 make[3]: * Waiting for unfinished jobs.... drivers/md/raid10.c: In function 'sync_request': drivers/md/raid10.c:1749: error: implicit declaration of function 'msleep_interruptible' make[3]: *** [drivers/md/raid10.o] Error 1 drivers/md/md.c: In function 'md_do_sync': drivers/md/md.c:5915: error: implicit declaration of function 'msleep' Caused by commit 6caa3b0bbdb474647f6bdd8a958ffc46f78d8d58 ("md: Remove unnecessary #includes, #defines, and function declarations"). I added the following patch. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-15 21:57:05 +11:00
Tejun Heo	074a7aca7a	block: move stats from disk to part0 Move stats related fields - stamp, in_flight, dkstats - from disk to part0 and unify stat handling such that... * part_stat_() now updates part0 together if the specified partition is not part0. ie. part_stat_() are now essentially all_stat_(). {disk\|all}_stat_() are gone. part_round_stats() is updated similary. It handles part0 stats automatically and disk_round_stats() is killed. * part_{inc\|dec}_in_fligh() is implemented which automatically updates part0 stats for parts other than part0. * disk_map_sector_rcu() is updated to return part0 if no part matches. Combined with the above changes, this makes NULL special case handling in callers unnecessary. * Separate stats show code paths for disk are collapsed into part stats show code paths. * Rename disk_stat_lock/unlock() to part_stat_lock/unlock() While at it, reposition stat handling macros a bit and add missing parentheses around macro parameters. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	c995905916	block: fix diskstats access There are two variants of stat functions - ones prefixed with double underbars which don't care about preemption and ones without which disable preemption before manipulating per-cpu counters. It's unclear whether the underbarred ones assume that preemtion is disabled on entry as some callers don't do that. This patch unifies diskstats access by implementing disk_stat_lock() and disk_stat_unlock() which take care of both RCU (for partition access) and preemption (for per-cpu counter access). diskstats access should always be enclosed between the two functions. As such, there's no need for the versions which disables preemption. They're removed and double underbars ones are renamed to drop the underbars. As an extra argument is added, there's no danger of using the old version unconverted. disk_stat_lock() uses get_cpu() and returns the cpu index and all diskstat functions which access per-cpu counters now has @cpu argument to help RT. This change adds RCU or preemption operations at some places but also collapses several preemption ops into one at others. Overall, the performance difference should be negligible as all involved ops are very lightweight per-cpu ones. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:06 +02:00
Jens Axboe	960e739d9e	block: raid fixups for removal of bi_hw_segments Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:03 +02:00
Mikulas Patocka	5df97b91b5	drop vmerge accounting Remove hw_segments field from struct bio and struct request. Without virtual merge accounting they have no purpose. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:03 +02:00
Andre Noll	f233ea5c9e	md: Make mddev->array_size sector-based. This patch renames the array_size field of struct mddev_s to array_sectors and converts all instances to use units of 512 byte sectors instead of 1k blocks. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-07-21 17:05:22 +10:00
Dan Williams	b5470dc5fc	md: resolve external metadata handling deadlock in md_allow_write md_allow_write() marks the metadata dirty while holding mddev->lock and then waits for the write to complete. For externally managed metadata this causes a deadlock as userspace needs to take the lock to communicate that the metadata update has completed. Change md_allow_write() in the 'external' case to start the 'mark active' operation and then return -EAGAIN. The expected side effects while waiting for userspace to write 'active' to 'array_state' are holding off reshape (code currently handles -ENOMEM), cause some 'stripe_cache_size' change requests to fail, cause some GET_BITMAP_FILE ioctl requests to fall back to GFP_NOIO, and cause updates to 'raid_disks' to fail. Except for 'stripe_cache_size' changes these failures can be mitigated by coordinating with mdmon. md_write_start() still prevents writes from occurring until the metadata handler has had a chance to take action as it unconditionally waits for MD_CHANGE_CLEAN to be cleared. [neilb@suse.de: return -EAGAIN, try GFP_NOIO] Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2008-06-30 17:18:19 -07:00
Neil Brown	199050ea1f	rationalise return value for ->hot_add_disk method. For all array types but linear, ->hot_add_disk returns 1 on success, 0 on failure. For linear, it returns 0 on success and -errno on failure. This doesn't cause a functional problem because the ->hot_add_disk function of linear is used quite differently to the others. However it is confusing. So convert all to return 0 for success or -errno on failure and fix call sites to match. Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:33 +10:00

1 2 3 4 5

249 Commits (8fc229a51b0e10f4ceb794e8b99fa0a427a7ba41)