linux

q3k/linux

Author	SHA1	Message	Date
Dan Williams	ac6b53b6e6	md/raid6: asynchronous raid6 operations [ Based on an original patch by Yuri Tikhonov ] The raid_run_ops routine uses the asynchronous offload api and the stripe_operations member of a stripe_head to carry out xor+pq+copy operations asynchronously, outside the lock. The operations performed by RAID-6 are the same as in the RAID-5 case except for no support of STRIPE_OP_PREXOR operations. All the others are supported: STRIPE_OP_BIOFILL - copy data into request buffers to satisfy a read request STRIPE_OP_COMPUTE_BLK - generate missing blocks (1 or 2) in the cache from the other blocks STRIPE_OP_BIODRAIN - copy data out of request buffers to satisfy a write request STRIPE_OP_RECONSTRUCT - recalculate parity for new data that has entered the cache STRIPE_OP_CHECK - verify that the parity is correct The flow is the same as in the RAID-5 case, and reuses some routines, namely: 1/ ops_complete_postxor (renamed to ops_complete_reconstruct) 2/ ops_complete_compute (updated to set up to 2 targets uptodate) 3/ ops_run_check (renamed to ops_run_check_p for xor parity checks) [neilb@suse.de: fixes to get it to pass mdadm regression suite] Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:13:12 -07:00
Dan Williams	4e7d2c0aef	md/raid5: factor out mark_uptodate from ops_complete_compute5 ops_complete_compute5 can be reused in the raid6 path if it is updated to generically handle a second target. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:13:11 -07:00
Dan Williams	cb3c82992f	async_tx: raid6 recovery self test Port drivers/md/raid6test/test.c to use the async raid6 recovery routines. This is meant as a unit test for raid6 acceleration drivers. In addition to the 16-drive test case this implements tests for the 4-disk and 5-disk special cases (dma devices can not generically handle less than 2 sources), and adds a test for the D+Q case. Reviewed-by: Andre Noll <maan@systemlinux.org> Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:09:28 -07:00
Dan Williams	ad283ea4a3	async_tx: add sum check flags Replace the flat zero_sum_result with a collection of flags to contain the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed solomon syndrome) zero-sum result. Use the SUM_CHECK_ namespace instead of DMA_ since these flags will be used on non-dma-zero-sum enabled platforms. Reviewed-by: Andre Noll <maan@systemlinux.org> Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:09:26 -07:00
Dan Williams	d6f38f31f3	md/raid5,6: add percpu scribble region for buffer lists Use percpu memory rather than stack for storing the buffer lists used in parity calculations. Include space for dma address conversions and pass that to async_tx via the async_submit_ctl.scribble pointer. [ Impact: move memory pressure from stack to heap ] Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:09:26 -07:00
Dan Williams	36d1c6476b	md/raid6: move the spare page to a percpu allocation In preparation for asynchronous handling of raid6 operations move the spare page to a percpu allocation to allow multiple simultaneous synchronous raid6 recovery operations. Make this allocation cpu hotplug aware to maximize allocation efficiency. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-08-29 19:09:26 -07:00
Chandra Seetharaman	2bfd2e1337	[SCSI] scsi_dh: Use scsi_dh_set_params() in multipath. Use scsi_dh_set_params() set parameters provided. Save the parameters in parse_hw_handler() and use it in parse_path(). Reported-by: Eddie Williams <Eddie.Williams@steeleye.com> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Tested-by: Eddie Williams <Eddie.Williams@steeleye.com> Cc: Alasdair G Kergon <agk@redhat.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: James Bottomley <James.Bottomley@suse.de>	2009-08-22 17:52:15 -05:00
Linus Torvalds	435a71d9ef	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: Fix new incorrect error return from do_md_stop.	2009-08-18 13:54:08 -07:00
NeilBrown	80ffb3ccea	Fix new incorrect error return from do_md_stop. Recent commit `c8c00a6915` changed the exit paths in do_md_stop and was not quite careful enough. There is one path were 'err' now needs to be cleared but it isn't. So setting an array to readonly (with mdadm --readonly) will work, but will incorrectly report and error: ENXIO. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-18 10:35:26 +10:00
Randy Dunlap	894ef820b1	dm-log-userspace: fix printk format warning drivers/md/dm-log-userspace-transfer.c:110: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'size_t' Previously posted and acked, but apparently lost. http://lkml.indiana.edu/hypermail/linux/kernel/0906.2/02074.html Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: dm-devel@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-08-16 08:35:58 -07:00
NeilBrown	4d484a4a7a	md: allow upper limit for resync/reshape to be set when array is read-only Normally we only allow the upper limit for a reshape to be decreased when the array not performing a sync/recovery/reshape, otherwise there could be races. But if an array is part-way through a reshape when it is assembled the reshape is started immediately leaving no window to set an upper bound. If the array is started read-only, the reshape will be suspended until the array becomes writable, so that provides a window during which it is perfectly safe to reduce the upper limit of a reshape. So: allow the upper limit (sync_max) to be reduced even if the reshape thread is running, as long as the array is still read-only. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-13 10:41:50 +10:00
NeilBrown	1a67dde0ab	md/raid5: Properly remove excess drives after shrinking a raid5/6 We were removing the drives, from the array, but not removing symlinks from /sys/.... and not marking the device as having been removed. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-13 10:41:49 +10:00
NeilBrown	a639755cf8	md/raid5: make sure a reshape restarts at the correct address. This "if" don't allow for the possibility that the number of devices doesn't change, and so sector_nr isn't set correctly in that case. So change '>' to '>='. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-13 10:13:00 +10:00
NeilBrown	67ac6011db	md/raid5: allow new reshape modes to be restarted in the middle. md/raid5 doesn't allow a reshape to restart if it involves writing over the same part of disk that it would be reading from. This happens at the beginning of a reshape that increases the number of devices, at the end of a reshape that decreases the number of devices, and continuously for a reshape that does not change the number of devices. The current code is correct for the "increase number of devices" case as the critical section at the start is handled by userspace performing a backup. It does not work for reducing the number of devices, or the no-change case. For 'reducing', we need to invert the test. For no-change we cannot really be sure things will be safe, so simply require the array to be read-only, which is how the user-space code which carefully starts such arrays works. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-13 10:06:24 +10:00
NeilBrown	51d5668cb2	md: never advance 'events' counter by more than 1. When assembling arrays, md allows two devices to have different event counts as long as the difference is only '1'. This is to cope with a system failure between updating the metadata on two difference devices. However there are currently times when we update the event count by 2. This was done to keep the event count even when the array is clean and odd when it is dirty, which allows us to avoid writing common update to spare devices and so allow those spares to go to sleep. This is bad for the above reason. So change it to never increase by two. This means that the alignment between 'odd/even' and 'clean/dirty' might take a little longer to attain, but that is only a small cost. The spares will get a few more updates but that will still be spared (;-) most updates and can still go to sleep. Prior to this patch there was a small chance that after a crash an array would fail to assemble due to the overly large event count mismatch. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-13 09:54:02 +10:00
NeilBrown	c8c00a6915	Remove deadlock potential in md_open A recent commit: commit `449aad3e25` introduced the possibility of an A-B/B-A deadlock between bd_mutex and reconfig_mutex. __blkdev_get holds bd_mutex while calling md_open which takes reconfig_mutex, do_md_run is always called with reconfig_mutex held, and it now takes bd_mutex in the call the revalidate_disk. This potential deadlock was not caught by lockdep due to the use of mutex_lock_interruptible_nexted which was introduced by commit `d63a5a74de` do avoid a warning of an impossible deadlock. It is quite possible to split reconfig_mutex in to two locks. One protects the array data structures while it is being reconfigured, the other ensures that an array is never even partially open while it is being deactivated. In particular, the second lock prevents an open from completing between the time when do_md_stop checks if there are any active opens, and the time when the array is either set read-only, or when ->pers is set to NULL. So we can be certain that no IO is in flight as the array is being destroyed. So create a new lock, open_mutex, just to ensure exclusion between 'open' and 'stop'. This avoids the deadlock and also avoids the lockdep warning mentioned in commit `d63a5a74d` Reported-by: "Mike Snitzer" <snitzer@gmail.com> Reported-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-10 12:50:52 +10:00
NeilBrown	449aad3e25	md: Use revalidate_disk to effect changes in size of device. As revalidate_disk calls check_disk_size_change, it will cause any capacity change of a gendisk to be propagated to the blockdev inode. So use that instead of mucking about with locks and i_size_write. Also add a call to revalidate_disk in do_md_run and a few other places where the gendisk capacity is changed. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-03 10:59:58 +10:00
NeilBrown	64bd660b51	md: allow raid5_quiesce to work properly when reshape is happening. The ->quiesce method is not supposed to stop resync/recovery/reshape, just normal IO. But in raid5 we don't have a way to know which stripes are being used for normal IO and which for resync etc, so we need to wait for all stripes to be idle to be sure that all writes have completed. However reshape keeps at least some stripe busy for an extended period of time, so a call to raid5_quiesce can block for several seconds needlessly. So arrange for reshape etc to pause briefly while raid5_quiesce is trying to quiesce the array so that the active_stripes count can drop to zero. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-03 10:59:58 +10:00
NeilBrown	e516402c0d	md/raid5: set reshape_position correctly when reshape starts. As the internal reshape_progress counter is the main driver for reshape, the fact that reshape_position sometimes starts with the wrong value has minimal effect. It is visible in sysfs and that is all. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-03 10:59:57 +10:00
NeilBrown	70471dafe3	md: Handle growth of v1.x metadata correctly. The v1.x metadata does not have a fixed size and can grow when devices are added. If it grows enough to require an extra sector of storage, we need to update the 'sb_size' to match. Without this, md can write out an incomplete superblock with a bad checksum, which will be rejected when trying to re-assemble the array. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-03 10:59:57 +10:00
NeilBrown	3673f305fa	md: avoid array overflow with bad v1.x metadata We trust the 'desc_nr' field in v1.x metadata enough to use it as an index in an array. This isn't really safe. So range-check the value first. Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-03 10:59:56 +10:00
NeilBrown	3a981b03f3	md: when a level change reduces the number of devices, remove the excess. When an array is changed from RAID6 to RAID5, fewer drives are needed. So any device that is made superfluous by the level conversion must be marked as not-active. For the RAID6->RAID5 conversion, this will be a drive which only has 'Q' blocks on it. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-03 10:59:55 +10:00
Andre Noll	ac5e7113e7	md: Push down data integrity code to personalities. This patch replaces md_integrity_check() by two new public functions: md_integrity_register() and md_integrity_add_rdev() which are both personality-independent. md_integrity_register() is called from the ->run and ->hot_remove methods of all personalities that support data integrity. The function iterates over the component devices of the array and determines if all active devices are integrity capable and if their profiles match. If this is the case, the common profile is registered for the mddev via blk_integrity_register(). The second new function, md_integrity_add_rdev() is called from the ->hot_add_disk methods, i.e. whenever a new device is being added to a raid array. If the new device does not support data integrity, or has a profile different from the one already registered, data integrity for the mddev is disabled. For raid0 and linear, only the call to md_integrity_register() from the ->run method is necessary. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-08-03 10:59:47 +10:00
Dan Williams	95fc17aac4	md/raid6: release spare page at ->stop() Add missing call to safe_put_page from stop() by unifying open coded raid5_conf_t de-allocation under free_conf(). Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-07-31 12:39:15 +10:00
Mike Snitzer	5dea271b6d	dm table: pass correct dev area size to device_area_is_valid Incorrect device area lengths are being passed to device_area_is_valid(). The regression appeared in 2.6.31-rc1 through commit `754c5fc7eb`. With the dm-stripe target, the size of the target (ti->len) was used instead of the stripe_width (ti->len/#stripes). An example of a consequent incorrect error message is: device-mapper: table: 254:0: sdb too small for target Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-07-23 20:30:42 +01:00
Mike Snitzer	a732c207d1	dm: remove queue next_ordered workaround for barriers This patch removes DM's bio-based vs request-based conditional setting of next_ordered. For bio-based DM the next_ordered check is no longer a concern (as that check is now in the __make_request path). For request-based DM the default of QUEUE_ORDERED_NONE is now appropriate. bio-based DM was changed to work-around the previously misplaced next_ordered check with this commit: `99360b4c18` request-based DM does not yet support barriers but reacted to the above bio-based DM change with this commit: `5d67aa2366` The above changes are no longer needed given Neil Brown's recent fix to put the next_ordered check in the __make_request path: `db64f680ba` Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: NeilBrown <neilb@suse.de> Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-07-23 20:30:40 +01:00
Mikulas Patocka	69885683d2	dm raid1: wake kmirrord when requeueing delayed bios after remote recovery The recent commit `7513c2a761` (dm raid1: add is_remote_recovering hook for clusters) changed do_writes() to update the ms->writes list but forgot to wake up kmirrord to process it. The rule is that when anything is being added on ms->reads, ms->writes or ms->failures and the list was empty before we must call wakeup_mirrord (for immediate processing) or delayed_wake (for delayed processing). Otherwise the bios could sit on the list indefinitely. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> CC: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-07-23 20:30:37 +01:00
Dan Williams	a11034b428	md/raid6: release spare page at ->stop() Add missing call to safe_put_page from stop() by unifying open coded raid5_conf_t de-allocation under free_conf(). Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-07-14 11:48:16 -07:00
Jens Axboe	8aa7e847d8	Fix congestion_wait() sync/async vs read/write confusion Commit `1faa16d228` accidentally broke the bdi congestion wait queue logic, causing us to wait on congestion for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-07-10 20:31:53 +02:00
Joe Perches	ad361c9884	Remove multiple KERN_ prefixes from printk formats Commit `5fd29d6ccb` ("printk: clean up handling of log-levels and newlines") changed printk semantics. printk lines with multiple KERN_<level> prefixes are no longer emitted as before the patch. <level> is now included in the output on each additional use. Remove all uses of multiple KERN_<level>s in formats. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-07-08 10:30:03 -07:00
Linus Torvalds	2027bd9f92	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: cfq-iosched: remove redundant check for NULL cfqq in cfq_set_request() blocK: Restore barrier support for md and probably other virtual devices. block: get rid of queue-private command filter block: Create bip slabs with embedded integrity vectors cfq-iosched: get rid of the need for __GFP_NOFAIL in cfq_find_alloc_queue() cfq-iosched: move cfqq initialization out of cfq_find_alloc_queue() Trivial typo fixes in Documentation/block/data-integrity.txt.	2009-07-01 10:41:09 -07:00
Linus Torvalds	544ae5f96e	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: md: use interruptible wait when duration is controlled by userspace. md/raid5: suspend shouldn't affect read requests. md: tidy up error paths in md_alloc md: fix error path when duplicate name is found on md device creation. md: avoid dereferencing NULL pointer when accessing suspend_* sysfs attributes. md: Use new topology calls to indicate alignment and I/O sizes	2009-07-01 10:31:26 -07:00
Martin K. Petersen	7878cba9f0	block: Create bip slabs with embedded integrity vectors This patch restores stacking ability to the block layer integrity infrastructure by creating a set of dedicated bip slabs. Each bip slab has an embedded bio_vec array at the end. This cuts down on memory allocations and also simplifies the code compared to the original bvec version. Only the largest bip slab is backed by a mempool. The pool is contained in the bio_set so stacking drivers can ensure forward progress. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@carl.(none)>	2009-07-01 10:56:25 +02:00
NeilBrown	e62e58a5ff	md: use interruptible wait when duration is controlled by userspace. User space can set various limits on an md array so that resync waits when it gets to a certain point, or so that I/O is blocked for a short while. When md is waiting against one of these limit, it should use an interruptible wait so as not to add to the load average, and so are not to trigger a warning if the wait goes on for too long. Signed-off-by: NeilBrown <neilb@suse.de>	2009-07-01 13:15:35 +10:00
NeilBrown	a5c308d4d1	md/raid5: suspend shouldn't affect read requests. md allows write to regions on an array to be suspended temporarily. This allows user-space to participate is aspects of reshape. In particular, data can be copied with not risk of a race. We should not be blocking read requests though, so don't. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-07-01 13:15:35 +10:00
NeilBrown	0909dc448c	md: tidy up error paths in md_alloc As the recent bug in md_alloc showed, having a single exit path for unlocking and putting is a good idea. So restructure md_alloc to have a single mutex_unlock and mddev_put, and use gotos where necessary. Found-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-07-01 12:27:21 +10:00
NeilBrown	1ec22eb2b4	md: fix error path when duplicate name is found on md device creation. When an md device is created by name (rather than number) we need to check that the name is not already in use. If this check finds a duplicate, we return an error without dropping the lock or freeing the newly create mddev. This patch fixes that. Cc: stable@kernel.org Found-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-07-01 12:27:21 +10:00
NeilBrown	b8d966efd9	md: avoid dereferencing NULL pointer when accessing suspend_* sysfs attributes. If we try to modify one of the md/ sysfs files suspend_lo or suspend_hi when the array is not active, we dereference a NULL. Protect against that. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-07-01 11:14:04 +10:00
Martin K. Petersen	8f6c2e4b32	md: Use new topology calls to indicate alignment and I/O sizes Switch MD over to the new disk_stack_limits() function which checks for aligment and adjusts preferred I/O sizes when stacking. Also indicate preferred I/O sizes where applicable. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-07-01 11:13:45 +10:00
Mike Snitzer	ea9df47cc9	dm table: fix blk_stack_limits arg to use bytes not sectors The offset passed to blk_stack_limits() must be in bytes not sectors. Fixes false warnings like the following: device-mapper: table: 254:1: target device sda6 is misaligned Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reported-by: Frans Pop <elendil@planet.nl> Tested-by: Frans Pop <elendil@planet.nl> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-30 15:18:17 +01:00
Milan Broz	874d2f61d3	dm exception store: really fix type lookup Fix exception store name handling. We need to reference exception store by zero terminated string. Fixes regression introduced in commit `f6bd4eb73c` Cc: Yi Yang <yi.y.yang@intel.com> Cc: Jonathan Brassow <jbrassow@redhat.com> Cc: stable@kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-30 15:18:14 +01:00
Kiyoshi Ueda	f40c67f0f7	dm mpath: change to be request based This patch converts dm-multipath target to request-based from bio-based. Basically, the patch just converts the I/O unit from struct bio to struct request. In the course of the conversion, it also changes the I/O queueing mechanism. The change in the I/O queueing is described in details as follows. I/O queueing mechanism change ----------------------------- In I/O submission, map_io(), there is no mechanism change from bio-based, since the clone request is ready for retry as it is. However, in I/O complition, do_end_io(), there is a mechanism change from bio-based, since the clone request is not ready for retry. In do_end_io() of bio-based, the clone bio has all needed memory for resubmission. So the target driver can queue it and resubmit it later without memory allocations. The mechanism has almost no overhead. On the other hand, in do_end_io() of request-based, the clone request doesn't have clone bios, so the target driver can't resubmit it as it is. To resubmit the clone request, memory allocation for clone bios is needed, and it takes some overheads. To avoid the overheads just for queueing, the target driver doesn't queue the clone request inside itself. Instead, the target driver asks dm core for queueing and remapping the original request of the clone request, since the overhead for queueing is just a freeing memory for the clone request. As a result, the target driver doesn't need to record/restore the information of the original request for resubmitting the clone request. So dm_bio_details in dm_mpath_io is removed. multipath_busy() --------------------- The target driver returns "busy", only when the following case: o The target driver will map I/Os, if map() function is called and o The mapped I/Os will wait on underlying device's queue due to their congestions, if map() function is called now. In other cases, the target driver doesn't return "busy". Otherwise, dm core will keep the I/Os and the target driver can't do what it wants. (e.g. the target driver can't map I/Os now, so wants to kill I/Os.) Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:37 +01:00
Kiyoshi Ueda	523d9297d4	dm: disable interrupt when taking map_lock This patch disables interrupt when taking map_lock to avoid lockdep warnings in request-based dm. request-based dm takes map_lock after taking queue_lock with disabling interrupt: spin_lock_irqsave(queue_lock) q->request_fn() == dm_request_fn() => dm_get_table() => read_lock(map_lock) while queue_lock could be (but isn't) taken in interrupt context. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Christof Schmitt <christof.schmitt@de.ibm.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:37 +01:00
Kiyoshi Ueda	5d67aa2366	dm: do not set QUEUE_ORDERED_DRAIN if request based Request-based dm doesn't have barrier support yet. So we need to set QUEUE_ORDERED_DRAIN only for bio-based dm. Since the device type is decided at the first table loading time, the flag set is deferred until then. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:36 +01:00
Kiyoshi Ueda	e6ee8c0b76	dm: enable request based option This patch enables request-based dm. o Request-based dm and bio-based dm coexist, since there are some target drivers which are more fitting to bio-based dm. Also, there are other bio-based devices in the kernel (e.g. md, loop). Since bio-based device can't receive struct request, there are some limitations on device stacking between bio-based and request-based. type of underlying device bio-based request-based ---------------------------------------------- bio-based OK OK request-based -- OK The device type is recognized by the queue flag in the kernel, so dm follows that. o The type of a dm device is decided at the first table binding time. Once the type of a dm device is decided, the type can't be changed. o Mempool allocations are deferred to at the table loading time, since mempools for request-based dm are different from those for bio-based dm and needed mempool type is fixed by the type of table. o Currently, request-based dm supports only tables that have a single target. To support multiple targets, we need to support request splitting or prevent bio/request from spanning multiple targets. The former needs lots of changes in the block layer, and the latter needs that all target drivers support merge() function. Both will take a time. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:36 +01:00
Kiyoshi Ueda	cec47e3d4a	dm: prepare for request based option This patch adds core functions for request-based dm. When struct mapped device (md) is initialized, md->queue has an I/O scheduler and the following functions are used for request-based dm as the queue functions: make_request_fn: dm_make_request() pref_fn: dm_prep_fn() request_fn: dm_request_fn() softirq_done_fn: dm_softirq_done() lld_busy_fn: dm_lld_busy() Actual initializations are done in another patch (PATCH 2). Below is a brief summary of how request-based dm behaves, including: - making request from bio - cloning, mapping and dispatching request - completing request and bio - suspending md - resuming md bio to request ============== md->queue->make_request_fn() (dm_make_request()) calls __make_request() for a bio submitted to the md. Then, the bio is kept in the queue as a new request or merged into another request in the queue if possible. Cloning and Mapping =================== Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()), when requests are dispatched after they are sorted by the I/O scheduler. dm_request_fn() checks busy state of underlying devices using target's busy() function and stops dispatching requests to keep them on the dm device's queue if busy. It helps better I/O merging, since no merge is done for a request once it is dispatched to underlying devices. Actual cloning and mapping are done in dm_prep_fn() and map_request() called from dm_request_fn(). dm_prep_fn() clones not only request but also bios of the request so that dm can hold bio completion in error cases and prevent the bio submitter from noticing the error. (See the "Completion" section below for details.) After the cloning, the clone is mapped by target's map_rq() function and inserted to underlying device's queue using blk_insert_cloned_request(). Completion ========== Request completion can be hooked by rq->end_io(), but then, all bios in the request will have been completed even error cases, and the bio submitter will have noticed the error. To prevent the bio completion in error cases, request-based dm clones both bio and request and hooks both bio->bi_end_io() and rq->end_io(): bio->bi_end_io(): end_clone_bio() rq->end_io(): end_clone_request() Summary of the request completion flow is below: blk_end_request() for a clone request => blk_update_request() => bio->bi_end_io() == end_clone_bio() for each clone bio => Free the clone bio => Success: Complete the original bio (blk_update_request()) Error: Don't complete the original bio => blk_finish_request() => rq->end_io() == end_clone_request() => blk_complete_request() => dm_softirq_done() => Free the clone request => Success: Complete the original request (blk_end_request()) Error: Requeue the original request end_clone_bio() completes the original request on the size of the original bio in successful cases. Even if all bios in the original request are completed by that completion, the original request must not be completed yet to keep the ordering of request completion for the stacking. So end_clone_bio() uses blk_update_request() instead of blk_end_request(). In error cases, end_clone_bio() doesn't complete the original bio. It just frees the cloned bio and gives over the error handling to end_clone_request(). end_clone_request(), which is called with queue lock held, completes the clone request and the original request in a softirq context (dm_softirq_done()), which has no queue lock, to avoid a deadlock issue on submission of another request during the completion: - The submitted request may be mapped to the same device - Request submission requires queue lock, but the queue lock has been held by itself and it doesn't know that The clone request has no clone bio when dm_softirq_done() is called. So target drivers can't resubmit it again even error cases. Instead, they can ask dm core for requeueing and remapping the original request in that cases. suspend ======= Request-based dm uses stopping md->queue as suspend of the md. For noflush suspend, just stops md->queue. For flush suspend, inserts a marker request to the tail of md->queue. And dispatches all requests in md->queue until the marker comes to the front of md->queue. Then, stops dispatching request and waits for the all dispatched requests to complete. After that, completes the marker request, stops md->queue and wake up the waiter on the suspend queue, md->wait. resume ====== Starts md->queue. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:35 +01:00
Jonthan Brassow	f5db4af466	dm raid1: add userspace log This patch contains a device-mapper mirror log module that forwards requests to userspace for processing. The structures used for communication between kernel and userspace are located in include/linux/dm-log-userspace.h. Due to the frequency, diversity, and 2-way communication nature of the exchanges between kernel and userspace, 'connector' was chosen as the interface for communication. The first log implementations written in userspace - "clustered-disk" and "clustered-core" - support clustered shared storage. A userspace daemon (in the LVM2 source code repository) uses openAIS/corosync to process requests in an ordered fashion with the rest of the nodes in the cluster so as to prevent log state corruption. Other implementations with no association to LVM or openAIS/corosync, are certainly possible. (Imagine if two machines are writing to the same region of a mirror. They would both mark the region dirty, but you need a cluster-aware entity that can handle properly marking the region clean when they are done. Otherwise, you might clear the region when the first machine is done, not the second.) Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:35 +01:00
Mike Snitzer	754c5fc7eb	dm: calculate queue limits during resume not load Currently, device-mapper maintains a separate instance of 'struct queue_limits' for each table of each device. When the configuration of a device is to be changed, first its table is loaded and this structure is populated, then the device is 'resumed' and the calculated queue_limits are applied. This places restrictions on how userspace may process related devices, where it is often advantageous to 'load' tables for several devices at once before 'resuming' them together. As the new queue_limits only take effect after the 'resume', if they are changing and one device uses another, the latter must be 'resumed' before the former may be 'loaded'. This patch moves the calculation of these queue_limits out of the 'load' operation into 'resume'. Since we are no longer pre-calculating this struct, we no longer need to maintain copies within our dm structs. dm_set_device_limits() now passes the 'start' of the device's data area (aka pe_start) as the 'offset' to blk_stack_limits(). init_valid_queue_limits() is replaced by blk_set_default_limits(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: martin.petersen@oracle.com Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:34 +01:00
Mike Snitzer	18d8594dd9	dm log: fix create_log_context to use logical_block_size of log device create_log_context() must use the logical_block_size from the log disk, where the I/O happens, not the target's logical_block_size. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:33 +01:00
Mike Snitzer	af4874e03e	dm target:s introduce iterate devices fn Add .iterate_devices to 'struct target_type' to allow a function to be called for all devices in a DM target. Implemented it for all targets except those in dm-snap.c (origin and snapshot). (The raid1 version number jumps to 1.12 because we originally reserved 1.1 to 1.11 for 'block_on_error' but ended up using 'handle_errors' instead.) Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: martin.petersen@oracle.com	2009-06-22 10:12:33 +01:00
Mike Snitzer	1197764e40	dm table: establish queue limits by copying table limits Copy the table's queue_limits to the DM device's request_queue. This properly initializes the queue's topology limits and also avoids having to track the evolution of 'struct queue_limits' in dm_table_set_restrictions() Also fixes a bug that was introduced in dm_table_set_restrictions() via commit `ae03bf639a`. In addition to establishing 'bounce_pfn' in the queue's limits blk_queue_bounce_limit() also performs an allocation to setup the ISA DMA pool. This allocation resulted in "sleeping function called from invalid context" when called from dm_table_set_restrictions(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:32 +01:00
Mike Snitzer	5ab97588fb	dm table: replace struct io_restrictions with struct queue_limits Use blk_stack_limits() to stack block limits (including topology) rather than duplicate the equivalent within Device Mapper. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:32 +01:00
Mike Snitzer	be6d4305db	dm table: validate device logical_block_size Impose necessary and sufficient conditions on a devices's table such that any incoming bio which respects its logical_block_size can be processed successfully. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:31 +01:00
Mike Snitzer	02acc3a4fa	dm table: ensure targets are aligned to logical_block_size Ensure I/O is aligned to the logical block size of target devices. Rename check_device_area() to device_area_is_valid() for clarity and establish the device limits including the logical block size prior to calling it. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:30 +01:00
Milan Broz	60935eb21d	dm ioctl: support cookies for udev Add support for passing a 32 bit "cookie" into the kernel with the DM_SUSPEND, DM_DEV_RENAME and DM_DEV_REMOVE ioctls. The (unsigned) value of this cookie is returned to userspace alongside the uevents issued by these ioctls in the variable DM_COOKIE. This means the userspace process issuing these ioctls can be notified by udev after udev has completed any actions triggered. To minimise the interface extension, we pass the cookie into the kernel in the event_nr field which is otherwise unused when calling these ioctls. Incrementing the version number allows userspace to determine in advance whether or not the kernel supports the cookie. If the kernel does support this but userspace does not, there should be no impact as the new variable will just get ignored. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:30 +01:00
Peter Rajnoha	486d220fe4	dm: sysfs add suspended attribute Add a file named 'suspended' to each device-mapper device directory in sysfs. It holds the value 1 while the device is suspended. Otherwise it holds 0. Signed-off-by: Peter Rajnoha <prajnoha@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:29 +01:00
Jonthan Brassow	1b6da75459	dm table: improve warning message when devices not freed before destruction Report any devices forgotten to be freed before a table is destroyed. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:29 +01:00
Kiyoshi Ueda	f392ba889b	dm mpath: add service time load balancer This patch adds a service time oriented dynamic load balancer, dm-service-time, which selects the path with the shortest estimated service time for the incoming I/O. The service time is estimated by dividing the in-flight I/O size by a performance value of each path. The performance value can be given as a table argument at the table loading time. If no performance value is given, all paths are considered equal. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:28 +01:00
Kiyoshi Ueda	fd5e033908	dm mpath: add queue length load balancer This patch adds a dynamic load balancer, dm-queue-length, which balances the number of in-flight I/Os across the paths. The code is based on the patch posted by Stefan Bader: https://www.redhat.com/archives/dm-devel/2005-October/msg00050.html Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:27 +01:00
Kiyoshi Ueda	02ab823fd1	dm mpath: add start_io and nr_bytes to path selectors This patch makes two additions to the dm path selector interface for dynamic load balancers: o a new hook, start_io() o a new parameter 'nr_bytes' to select_path()/start_io()/end_io() to pass the size of the I/O start_io() is called when a target driver actually submits I/O to the selected path. Path selectors can use it to start accounting of the I/O. (e.g. counting the number of in-flight I/Os.) The start_io hook is based on the patch posted by Stefan Bader: https://www.redhat.com/archives/dm-devel/2005-October/msg00050.html nr_bytes, the size of the I/O, is so path selectors can take the size of the I/O into account when deciding which path to use. dm-service-time uses it to estimate service time, for example. (Added the nr_bytes member to dm_mpath_io instead of using existing details.bi_size, since request-based dm patch deletes it.) Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:27 +01:00
Mikulas Patocka	2bd0234525	dm snapshot: use barrier when writing exception store Send barrier requests when updating the exception area. Exception area updates need to be ordered w.r.t. data writes, so that the writes are not reordered in hardware disk cache. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:26 +01:00
Mikulas Patocka	51aa322849	dm io: retry after barrier error If -EOPNOTSUPP was returned and the request was a barrier request, retry it without barrier. Retry all regions for now. Barriers are submitted only for one-region requests, so it doesn't matter. (In the future, retries can be limited to the actual regions that failed.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:26 +01:00
Mikulas Patocka	5af443a7e1	dm io: record eopnotsupp Add another field, eopnotsupp_bits. It is subset of error_bits, representing regions that returned -EOPNOTSUPP. (The bit is set in both error_bits and eopnotsupp_bits). This value will be used in further patches. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:25 +01:00
Mikulas Patocka	494b3ee7d4	dm snapshot: support barriers Flush support for dm-snapshot target. This patch just forwards the flush request to either the origin or the snapshot device. (It doesn't flush exception store metadata.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:25 +01:00
Mikulas Patocka	8627921fa2	dm mpath: support barriers Flush support for dm-multipath target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:24 +01:00
Mikulas Patocka	c927259e34	dm delay: support barriers Flush support for dm-delay target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:23 +01:00
Mikulas Patocka	647c7db14e	dm crypt: support flush Flush support for dm-crypt target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:23 +01:00
Mikulas Patocka	374bf7e7f6	dm: stripe support flush Flush support for the stripe target. This sets ti->num_flush_requests to the number of stripes and remaps individual flush requests to the appropriate stripe devices. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:22 +01:00
Mikulas Patocka	433bcac564	dm: linear support flush Flush support for the linear target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:22 +01:00
Mikulas Patocka	52b1fd5a27	dm: send empty barriers to targets in dm_flush Pass empty barrier flushes to the targets in dm_flush(). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:21 +01:00
Alasdair G Kergon	9015df24a8	dm: initialise tio in alloc_tio Move repeated dm_target_io initialisation inside alloc_tio(). Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:21 +01:00
Mikulas Patocka	f9ab94cee3	dm: introduce num_flush_requests Introduce num_flush_requests for a target to set to say how many flush instructions (empty barriers) it wants to receive. These are sent by __clone_and_map_empty_barrier with map_info->flush_request going from 0 to (num_flush_requests - 1). Old targets without flush support won't receive any flush requests. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:20 +01:00
Mikulas Patocka	27eaa14975	dm: remove check that prevents mapping empty bios Remove the check that the size of the cloned bio is not zero because a subsequent patch needs to send zero-sized barriers down this path. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:20 +01:00
Mikulas Patocka	fdb9572b73	dm: remove EOPNOTSUPP for barriers If the underlying device doesn't support barriers and dm receives a barrier, it waits until all requests on that device drain so it no longer needs to report -EOPNOTSUPP to the caller. This patch deals with the confusing situation when moving a volume from one physical device to another triggers an EOPNOTSUPP on a volume that didn't report it before. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:19 +01:00
Mikulas Patocka	5aa2781d96	dm: store only first barrier error With the following patches, more than one error can occur during processing. Change md->barrier_error so that only the first one is recorded and returned to the caller. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:18 +01:00
Mikulas Patocka	2761e95fe4	dm: process requeue in dm_wq_work If barrier request was returned with DM_ENDIO_REQUEUE, requeue it in dm_wq_work instead of dec_pending. This allows us to correctly handle a situation when some targets are asking for a requeue and other targets signal an error. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:18 +01:00
Mikulas Patocka	531fe96364	dm: make dm_flush return void Make dm_flush return void. The first error during flush is stored in md->barrier_error instead. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:17 +01:00
Mikulas Patocka	32a926da5a	dm: always hold bdev reference Fix a potential deadlock when creating multiple snapshots by holding a reference to struct block_device for the whole lifecycle of every dm device instead of obtaining it independently at each point it is needed. bdget_disk() was called while the device was being suspended, in dm_suspend(). However there could be other devices already suspended, for example when creating additional snapshots of a device. bdget_disk() can wait for IO and allocate memory resulting in waiting for the already-suspended device - deadlock. This patch changes the code so that it gets the reference to struct block_device when struct mapped_device is allocated and initialized in alloc_dev() where it is always OK to allocate memory or wait for I/O. It drops the reference when it is destroyed in free_dev(). Thus there is no call to bdget_disk() while any device is suspended. Previously unlock_fs() was called only if bdev was held. Now it is called unconditionally, but the superfluous calls are harmless because it returns immediately if the filesystem was not previously frozen. This patch also now allows the device size to be changed in a noflush suspend because the bdev is held. This has no adverse effect. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:17 +01:00
Mikulas Patocka	db8fef4fab	dm: rename suspended_bdev to bdev Rename suspended_bdev to bdev. This patch doesn't change any functionality, just renames the variable. In the next patch, the variable will be used even for non-suspended device. (Pre-requisite for the per-target barrier support patches.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:15 +01:00
Jonathan Brassow	f6bd4eb73c	dm exception store: fix exstore lookup to be case insensitive When snapshots are created using 'p' instead of 'P' as the exception store type, the device-mapper table loading fails. This patch makes the code case insensitive as intended and fixes some regressions reported with device-mapper snapshots. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:15 +01:00
Mikulas Patocka	5657e8fa45	dm: use i_size_read Use i_size_read() instead of reading i_size. If someone changes the size of the device simultaneously, i_size_read is guaranteed to return a valid value (either the old one or the new one). i_size can return some intermediate invalid value (on 32-bit computers with 64-bit i_size, the reads to both halves of i_size can be interleaved with updates to i_size, resulting in garbage being returned). Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:14 +01:00
Mikulas Patocka	8cbeb67ad5	dm: avoid unsupported spanning of md stripe boundaries A bio that has two or more vector entries, size less than or equal to page size, that crosses a stripe boundary of an underlying md device is accepted by device mapper (it conforms to all its limits) but not by the underlying device. The fix is: If device mapper selects the one-page maximum request size, it also needs to set its own q->merge_bvec_fn to reject any bios with multiple vector entries that span more pages. The problem was discovered in the following scenario: * MD - RAID-0 * LV on the top of it (raid1, snapshot or striped with chunk size/stripe larger than RAID-0 stripe) * one of the logical volumes is exported to xen domU * inside xen domU it is partitioned, the key point is that the partition must be unaligned on page boundary (fdisk normally aligns the partition to 63 sectors which will trigger it) * install the system on the partitioned disk in domU This causes I/O failures in dom0. Reference: https://bugzilla.redhat.com/show_bug.cgi?id=223947 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:14 +01:00
Mikulas Patocka	53b351f972	dm mpath: flush keventd queue in destructor The commit `fe9cf30eb8` moves dm table event submission from kmultipath queue to kernel kevent queue to avoid a deadlock. There is a possibility of race condition because kevent queue is not flushed in the multipath destructor. The scenario is: - some event happens and is queued to keventd - keventd thread is delayed due to scheuling latency or some other work - multipath device is destroyed - keventd now attempts to process work_struct that is residing in already released memory. The patch flushes the keventd queue in multipath constructor. I've already fixed similar bug in dm-raid1. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org	2009-06-22 10:12:13 +01:00
Mikulas Patocka	a72986c562	dm raid1: keep retrying alloc if mempool_alloc failed If the code can't handle allocation failures, use __GFP_NOFAIL so that in case of memory pressure the allocator will retry indefinitely and won't return NULL which would cause a crash in the function. This is still not a correct fix, it may cause a classic deadlock when memory manager waits for I/O being done and I/O waits for some free memory. I/O code shouldn't allocate any memory. But in this case it probably doesn't matter much in practice, people usually do not swap on RAID. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:13 +01:00
Chandra Seetharaman	e54f77ddda	dm mpath: call activate fn for each path in pg_init Fixed a problem affecting reinstatement of passive paths. Before we moved the hardware handler from dm to SCSI, it performed a pg_init for a path group and didn't maintain any state about each path in hardware handler code. But in SCSI dh, such state is now maintained, as we want to fail I/O early on a path if it is not the active path. All the hardware handlers have a state now and set to active or some form of inactive. They have prep_fn() which uses this state to fail the I/O without it ever being sent to the device. So in effect when dm-multipath calls scsi_dh_activate(), activate is sent to only one path and the "state" of that path is changed appropriately to "active" while other paths in the same path group are never changed as they never got an "activate". In order make sure all the paths in a path group gets their state set properly when a pg_init happens, we need to call scsi_dh_activate() on all paths in a path group. Doing this at the hardware handler layer is not a good option as we want the multipath layer to define the relationship between path and path groups and not the hardware handler. Attached patch sends an "activate" on each path in a path group when a path group is switched. It also sends an activate when a path is reinstated. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:12 +01:00
Hannes Reinecke	a0cf7ea954	dm mpath: change attached scsi_dh When specifying a different hardware handler via multipath features we should be able to override the built-in defaults. The problem here is the hardware table from scsi_dh is compiled in and cannot be changed from userland. The multipath.conf OTOH is purely user-defined and, what's more, the user might have a valid reason for modifying it. (EG EMC Clariion can well be run in PNR mode even though ALUA is active, or the user might want to try ALUA on any as-of-yet unknown devices) So _not_ allowing multipath to override the device handler setting will just add to the confusion and makes error tracking even more difficult. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:11 +01:00
Milan Broz	4d89b7b4e4	dm: sysfs skip output when device is being destroyed Do not process sysfs attributes when device is being destroyed. Otherwise code can cause BUG_ON(test_bit(DMF_FREEING, &md->flags)); in dm_put() call. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:11 +01:00
Mikulas Patocka	e094f4f15f	dm mpath: validate hw_handler argument count Fix arg count parsing error in hw handlers. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:12:10 +01:00
Mikulas Patocka	0e0497c0c0	dm mpath: validate table argument count The parser reads the argument count as a number but doesn't check that sufficient arguments are supplied. This command triggers the bug: dmsetup create mpath --table "0 `blockdev --getsize /dev/mapper/cr0` multipath 0 0 2 1 round-robin 1000 0 1 1 /dev/mapper/cr0 round-robin 0 1 1 /dev/mapper/cr1 1000" kernel BUG at drivers/md/dm-mpath.c:530! Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-06-22 10:08:02 +01:00
Linus Torvalds	31583d6acf	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: Fix kernel-doc parameter name typo in blk-settings.c: block: rename CONFIG_LBD to CONFIG_LBDAF block: Fix bounce_pfn setting hd: stop defining MAJOR_NR	2009-06-19 17:43:04 -07:00
Bartlomiej Zolnierkiewicz	90c699a9ee	block: rename CONFIG_LBD to CONFIG_LBDAF Follow-up to "block: enable by default support for large devices and files on 32-bit archs". Rename CONFIG_LBD to CONFIG_LBDAF to: - allow update of existing [def]configs for "default y" change - reflect that it is used also for large files support nowadays Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-19 08:08:50 +02:00
Linus Torvalds	9729a6eb58	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: (39 commits) md/raid5: correctly update sync_completed when we reach max_resync md/raid5: add missing call to schedule() after prepare_to_wait() md/linear: use call_rcu to free obsolete 'conf' structures. md linear: Protecting mddev with rcu locks to avoid races md: Move check for bitmap presence to personality code. md: remove chunksize rounding from common code. md: raid0/linear: ensure device sizes are rounded to chunk size. md: move assignment of ->utime so that it never gets skipped. md: Push down reconstruction log message to personality code. md: merge reconfig and check_reshape methods. md: remove unnecessary arguments from ->reconfig method. md: raid5: check stripe cache is large enough in start_reshape md: raid0: chunk_sectors cleanups. md: fix some comments. md/raid5: Use is_power_of_2() in raid5_reconfig()/raid6_reconfig(). md: convert conf->chunk_size and conf->prev_chunk to sectors. md: Convert mddev->new_chunk to sectors. md: Make mddev->chunk_size sector-based. md: raid0 :Enables chunk size other than powers of 2. md: prepare for non-power-of-two chunk sizes ...	2009-06-18 13:11:50 -07:00
NeilBrown	48606a9f2f	md/raid5: correctly update sync_completed when we reach max_resync At the end of reshape_request we update cyrr_resync_completed if we are about to pause due to reaching resync_max. However we update it to the wrong value. We need to add the "reshape_sectors" that have just been reshaped. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 09:14:12 +10:00
Dan Williams	7a3ab90894	md/raid5: add missing call to schedule() after prepare_to_wait() In the unlikely event that reshape progresses past the current request while it is waiting for a stripe we need to schedule() before retrying for 2 reasons: 1/ Prevent list corruption from duplicated list_add() calls without intervening list_del(). 2/ Give the reshape code a chance to make some progress to resolve the conflict. Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:50:18 +10:00
NeilBrown	495d357301	md/linear: use call_rcu to free obsolete 'conf' structures. Current, when we update the 'conf' structure, when adding a drive to a linear array, we keep the old version around until the array is finally stopped, as it is not safe to free it immediately. Now that we have rcu protection on all accesses to 'conf', we can use call_rcu to free it more promptly. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:49:42 +10:00
SandeepKsinha	af11c397fd	md linear: Protecting mddev with rcu locks to avoid races Due to the lack of memory ordering guarantees, we may have races around mddev->conf. In particular, the correct contents of the structure we get from dereferencing ->private might not be visible to this CPU yet, and they might not be correct w.r.t mddev->raid_disks. This patch addresses the problem using rcu protection to avoid such race conditions. Signed-off-by: SandeepKsinha <sandeepksinha@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:49:35 +10:00
Andre Noll	0894cc3066	md: Move check for bitmap presence to personality code. If the superblock of a component device indicates the presence of a bitmap but the corresponding raid personality does not support bitmaps (raid0, linear, multipath, faulty), then something is seriously wrong and we'd better refuse to run such an array. Currently, this check is performed while the superblocks are examined, i.e. before entering personality code. Therefore the generic md layer must know which raid levels support bitmaps and which do not. This patch avoids this layer violation without adding identical code to various personalities. This is accomplished by introducing a new public function to md.c, md_check_no_bitmap(), which replaces the hard-coded checks in the superblock loading functions. A call to md_check_no_bitmap() is added to the ->run method of each personality which does not support bitmaps and assembly is aborted if at least one component device contains a bitmap. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:49:23 +10:00
NeilBrown	8190e754e0	md: remove chunksize rounding from common code. It is easiest to round sizes to multiples of chunk size in the personality code for those personalities which care. Those personalities now do the rounding, so we can remove that function from common code. Also remove the upper bound on the size of a chunk, and the lower bound on the size of a device (1 chunk), neither of which really buy us anything. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:48:58 +10:00
NeilBrown	13f2682b72	md: raid0/linear: ensure device sizes are rounded to chunk size. This is currently ensured by common code, but it is more reliable to ensure it where it is needed in personality code. All the other personalities that care already round the size to the chunk_size. raid0 and linear are the only hold-outs. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:48:55 +10:00
NeilBrown	1b57f13223	md: move assignment of ->utime so that it never gets skipped. Currently the assignment to utime gets skipped for 'external' metadata. So move it to the top of the function so that it always gets effected. This is of largely cosmetic interest. Nothing actually depends on ->utime being right for external arrays. "mdadm --monitor" does use it for 0.90 and 1.x arrays, but with mdadm-3.0, this is not important for external metadata. Signed-off-by: NeilBrown <neilb@suse.de>	2009-06-18 08:48:19 +10:00

1 2 3 4 5 ...

1425 commits