linux

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	2943c83322	md update for 3.3 Big change is new hot-replacement. A slot in an array can hold 2 devices - one that wants-replacement and one that is the replacement. Once the replacement is built - either from the original or (in the case of errors) from elsewhere, the wants-replacement device will be removed. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQIVAwUATwUdyDnsnt1WYoG5AQJBExAAuFQrzt7CN32nmiqaLlJ1snBC/ixOSJf7 0X88YB+utO0qNIhiRTI1AulslRGms9pChyKmZZaxU2Hvzk1pTrFspsc8RlJTIv9S Si43due99Np08/Kf5rjYulyH0fcug0qIqlrCiBvc36pOIHZzam6IzrEvKFf0dqbe 4vtLWuylEUiSEbN5gBordfTFZcxSPI/huMUgx6hEpdA4NtaPmN57/03d49k4RYgp f5l9htuIqdMgHwNJRUDGMooifuvILC5HXINUbsCFT6KUF/bxA75nW7W3C2BUq0V/ CVb/ZoHFYtuaGBEcMUzN34k0DbHghPeSmlvXT4XWq7+gVNqGe33nbSJsx1oLXWr1 m/b3j4Ublv+VVd7L1Rr40vTRB6wN5/2uN7SiD6d83ppPD0TuAY8YvMHgoLZQmQvh Ak4fEz07re+tueKhHwbi+1qMIw/ciQ9O/tI7r+AsVklkAxJNAfFKW6FOEFL2cjEW h4rbg1z9sU+xD8G01LBnJ0to/ajrJ4ch4wV/raLgi4+dJ4Tt3+/tas6WJlxHbyFF IiHCFM0+KcxLcwNkalZYg/zr5qu7EcwpPKLnc68m+LjXlYVWcnNHwv5WnOCHHQTw 6yurGGlKqBfvsb8zJJGSRWqEtRSYjDlZiMt10/u+H40HiCiLX//vSvVbZoFiwWu1 VzgEhzhvaYw= =j00/ -----END PGP SIGNATURE----- Merge tag 'md-3.3' of git://neil.brown.name/md md update for 3.3 Big change is new hot-replacement. A slot in an array can hold 2 devices - one that wants-replacement and one that is the replacement. Once the replacement is built - either from the original or (in the case of errors) from elsewhere, the wants-replacement device will be removed. * tag 'md-3.3' of git://neil.brown.name/md: (36 commits) md/raid1: Mark device want_replacement when we see a write error. md/raid1: If there is a spare and a want_replacement device, start replacement. md/raid1: recognise replacements when assembling arrays. md/raid1: handle activation of replacement device when recovery completes. md/raid1: Allow a failed replacement device to be removed. md/raid1: Allocate spare to store replacement devices and their bios. md/raid1: Replace use of mddev->raid_disks with conf->raid_disks. md/raid10: If there is a spare and a want_replacement device, start replacement. md/raid10: recognise replacements when assembling array. md/raid10: Allow replacement device to be replace old drive. md/raid10: handle recovery of replacement devices. md/raid10: Handle replacement devices during resync. md/raid10: writes should get directed to replacement as well as original. md/raid10: allow removal of failed replacement devices. md/raid10: preferentially read from replacement device if possible. md/raid10: change read_balance to return an rdev md/raid10: prepare data structures for handling replacement. md/raid5: Mark device want_replacement when we see a write error. md/raid5: If there is a spare and a want_replacement device, start replacement. md/raid5: recognise replacements when assembling array. ...	2012-01-08 13:28:33 -08:00
Al Viro	ff01bb4832	fs: move code out of buffer.c Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export kill_bdev as well, so brd doesn't have to open code it. Reduce buffer_head.h requirement accordingly. Removed a rather large comment from invalidate_bdev, as it looked a bit obsolete to bother moving. The small comment replacing it says enough. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:07 -05:00
NeilBrown	7bfec5f35c	md/raid5: If there is a spare and a want_replacement device, start replacement. When attempting to add a spare to a RAID[456] array, also consider adding it as a replacement for a want_replacement device. This requires that common md code attempt hot_add even when the array is not formally degraded. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:53 +11:00
NeilBrown	2d78f8c451	md: create externally visible flags for supporting hot-replace. hot-replace is a feature being added to md which will allow a device to be replaced without removing it from the array first. With hot-replace a spare can be activated and recovery can start while the original device is still in place, thus allowing a transition from an unreliable device to a reliable device without leaving the array degraded during the transition. It can also be use when the original device is still reliable but it not wanted for some reason. This will eventually be supported in RAID4/5/6 and RAID10. This patch adds a super-block flag to distinguish the replacement device. If an old kernel sees this flag it will reject the device. It also adds two per-device flags which are viewable and settable via sysfs. "want_replacement" can be set to request that a device be replaced. "replacement" is set to show that this device is replacing another device. The "rd%d" links in /sys/block/mdXx/md only apply to the original device, not the replacement. We currently don't make links for the replacement - there doesn't seem to be a need. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:51 +11:00
NeilBrown	b8321b68d1	md: change hot_remove_disk to take an rdev rather than a number. Soon an array will be able to have multiple devices with the same raid_disk number (an original and a replacement). So removing a device based on the number won't work. So pass the actual device handle instead. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:51 +11:00
NeilBrown	476a7abb9b	md: remove test for duplicate device when setting slot number. When setting the slot number on a device in an active array we currently check that the number is not already in use. We then call into the personality's hot_add_disk function which performs the same test and returns the same error. Thus the common test is not needed. As we will shortly be changing some personalities to allow duplicates in some cases (to support hot-replace), the common test will become inconvenient. So remove the common test. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:51 +11:00
NeilBrown	506c9e44a8	md: allow non-privileged uses to GET_*_INFO about raid arrays. The info is already available in /proc/mdstat and /sys/block in an accessible form so there is no point in putting a road-block in the ioctl for information gathering. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 10:17:26 +11:00
NeilBrown	60fc13702a	md: don't give up looking for spares on first failure-to-add Before performing a recovery we try to remove any spares that might not be working, then add any that might have become relevant. Currently we abort on the first spare that cannot be added. This is a false optimisation. It is conceivable that - depending on rules in the personality - a subsequent spare might be accepted. Also the loop does other things like count the available spares and reset the 'recovery_offset' value. If we abort early these might not happen properly. So remove the early abort. In particular if you have an array what is undergoing recovery and which has extra spares, then the recovery may not restart after as reboot as the could of 'spares' might end up as zero. Reported-by: Anssi Hannula <anssi.hannula@iki.fi> Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-23 09:57:19 +11:00
NeilBrown	8bd2f0a05b	md: ensure new badblocks are handled promptly. When we mark blocks as bad we need them to be acknowledged by the metadata handler promptly. For an in-kernel metadata handler that was already being done. But for an external metadata handler we need to alert it of the change by sending a notification through the sysfs file. This adds that notification. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-08 16:26:08 +11:00
NeilBrown	52c64152a9	md: bad blocks shouldn't cause a Blocked status on a Faulty device. Once a device is marked Faulty the badblocks - whether acknowledged or not - become irrelevant. So they shouldn't cause the device to be marked as Blocked. Without this patch, a process might write "-blocked" to clear the Blocked status, but while that will correctly fail the device, it won't remove the apparent 'blocked' status. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-08 16:22:48 +11:00
NeilBrown	af8a24347f	md: take a reference to mddev during sysfs access. When we are accessing an mddev via sysfs we know that the mddev cannot disappear because it has an embedded kobj which is refcounted by sysfs. And we also take the mddev_lock. However this is not enough. The final mddev_put could have been called and the mddev_delayed_delete is waiting for sysfs to let go so it can destroy the kobj and mddev. In this state there are a lot of changes that should not be attempted. To to guard against this we: - initialise mddev->all_mddevs in on last put so the state can be easily detected. - in md_attr_show and md_attr_store, check ->all_mddevs under all_mddevs_lock and mddev_get the mddev if it still appears to be active. This means that if we get to sysfs as the mddev is being deleted we will get -EBUSY. rdev_attr_store and rdev_attr_show are similar but already have sufficient protection. They check that rdev->mddev still points to mddev after taking mddev_lock. As this is cleared before delayed removal which can only be requested under the mddev_lock, this ensure the rdev and mddev are still alive. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-08 15:49:46 +11:00
NeilBrown	1d23f178d5	md: refine interpretation of "hold_active == UNTIL_IOCTL". We like md devices to disappear when they really are not needed. However it is not possible to tell from the current state whether it is needed or not. We can only tell from recent history of changes. In particular immediately after we create an md device it looks very similar to immediately after we have finished with it. So we always preserve a newly created md device until something significant happens. This state is stored in 'hold_active'. The normal case is to keep it until an ioctl happens, as that will normally either activate it, or explicitly de-activate it. If it doesn't then it was probably created by mistake and it is now time to get rid of it. We can also modify an array via sysfs (instead of via ioctl) and we currently treat any change via sysfs like an ioctl as a sign that if it now isn't more active, it should be destroyed. However this is not appropriate as changes made via sysfs are more gradual so we should look for a more definitive change. So this patch only clears 'hold_active' from UNTIL_IOCTL to clear when the array_state is changed via sysfs. Other changes via sysfs are ignored. Signed-off-by: NeilBrown <neilb@suse.de>	2011-12-08 15:49:12 +11:00
Linus Torvalds	32aaeffbd4	Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h	2011-11-06 19:44:47 -08:00
Linus Torvalds	b4fdcb02f1	Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-block * 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits) block: don't call blk_drain_queue() if elevator is not up blk-throttle: use queue_is_locked() instead of lockdep_is_held() blk-throttle: Take blkcg->lock while traversing blkcg->policy_list blk-throttle: Free up policy node associated with deleted rule block: warn if tag is greater than real_max_depth. block: make gendisk hold a reference to its queue blk-flush: move the queue kick into blk-flush: fix invalid BUG_ON in blk_insert_flush block: Remove the control of complete cpu from bio. block: fix a typo in the blk-cgroup.h file block: initialize the bounce pool if high memory may be added later block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown block: drop @tsk from attempt_plug_merge() and explain sync rules block: make get_request[_wait]() fail if queue is dead block: reorganize throtl_get_tg() and blk_throtl_bio() block: reorganize queue draining block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() block: pass around REQ_* flags instead of broken down booleans during request alloc/free block: move blk_throtl prototypes to block/blk.h block: fix genhd refcounting in blkio_policy_parse_and_set() ... Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion and making the request functions be of type "void" instead of "int" in - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c} - drivers/staging/zram/zram_drv.c	2011-11-04 17:06:58 -07:00
Paul Gortmaker	056075c764	md: Add module.h to all files using it implicitly A pending cleanup will mean that module.h won't be implicitly everywhere anymore. Make sure the modular drivers in md dir are actually calling out for <module.h> explicitly in advance. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2011-10-31 19:31:18 -04:00
Jens Axboe	5c04b426f2	Merge branch 'v3.1-rc10' into for-3.2/core Conflicts: block/blk-core.c include/linux/blkdev.h Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:30:42 +02:00
Chris Dunlop	751e67ca2e	md.c: trivial comment fix Trivial comment fix Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-19 17:15:15 +11:00
Andrei Warkentin	d70ed2e4fa	MD: Allow restarting an interrupted incremental recovery. If an incremental recovery was interrupted, a subsequent re-add will result in a full recovery, even though an incremental should be possible (seen with raid1). Solve this problem by not updating the superblock on the recovering device until array is not degraded any longer. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrei Warkentin <andreiw@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-18 12:16:48 +11:00
NeilBrown	d30519fc59	md: clear In_sync bit on devices added to an active array. When we add a device to an active array it can be meaningful to set the 'insync' flag. This indicates that the device is in-sync with the array except for locations recorded in the bitmap. A bitmap-based recovery can then bring it completely in-sync. Internally we move that flag to 'saved_raid_disk' but forgot to clear In_sync like we do in add_new_disk. So clear In_sync after moving its value to saved_raid_disk. Reported-by: Andrei Warkentin <andreiw@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-18 12:13:47 +11:00
NeilBrown	84fc4b56db	md: rename "mdk_personality" to "md_personality" "mdk" doesn't mean anything any more. Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-11 16:49:58 +11:00
NeilBrown	2b8bf3451d	md: remove typedefs: mdk_thread_t -> struct md_thread Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-11 16:48:23 +11:00
NeilBrown	fd01b88c75	md: remove typedefs: mddev_t -> struct mddev Having mddev_t and 'struct mddev_s' is ugly and not preferred Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-11 16:47:53 +11:00
NeilBrown	3cb0300200	md: removing typedefs: mdk_rdev_t -> struct md_rdev The typedefs are just annoying. 'mdk' probably refers to 'md_k.h' which used to be an include file that defined this thing. Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-11 16:45:26 +11:00
NeilBrown	36a4e1fe0f	md: remove PRINTK and dprintk debugging and use pr_debug Being able to dynamically enable these make them much more useful. Signed-off-by: NeilBrown <neilb@suse.de>	2011-10-07 14:23:17 +11:00
Daniel P. Berrange	2dba6a911c	md: don't delay reboot by 1 second if no MD devices exist The md_notify_reboot() method includes a call to mdelay(1000), to deal with "exotic SCSI devices" which are too volatile on reboot. The delay is unconditional. Even if the machine does not have any block devices, let alone MD devices, the kernel shutdown sequence is slowed down. 1 second does not matter much with physical hardware, but with certain virtualization use cases any wasted time in the bootup & shutdown sequence counts for alot. * drivers/md/md.c: md_notify_reboot() - only impose a delay if there was at least one MD device to be stopped during reboot Signed-off-by: Daniel P. Berrange <berrange@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-09-23 19:54:04 +10:00
NeilBrown	01f96c0a99	md: Avoid waking up a thread after it has been freed. Two related problems: 1/ some error paths call "md_unregister_thread(mddev->thread)" without subsequently clearing ->thread. A subsequent call to mddev_unlock will try to wake the thread, and crash. 2/ Most calls to md_wakeup_thread are protected against the thread disappeared either by: - holding the ->mutex - having an active request, so something else must be keeping the array active. However mddev_unlock calls md_wakeup_thread after dropping the mutex and without any certainty of an active request, so the ->thread could theoretically disappear. So we need a spinlock to provide some protections. So change md_unregister_thread to take a pointer to the thread pointer, and ensure that it always does the required locking, and clears the pointer properly. Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de> cc: stable@kernel.org	2011-09-21 15:30:20 +10:00
Christoph Hellwig	5a7bbad27a	block: remove support for bio remapping from ->make_request There is very little benefit in allowing to let a ->make_request instance update the bios device and sector and loop around it in __generic_make_request when we can archive the same through calling generic_make_request from the driver and letting the loop in generic_make_request handle it. Note that various drivers got the return value from ->make_request and returned non-zero values for errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-09-12 12:12:01 +02:00
NeilBrown	27a7b260f7	md: Fix handling for devices from 2TB to 4TB in 0.90 metadata. 0.90 metadata uses an unsigned 32bit number to count the number of kilobytes used from each device. This should allow up to 4TB per device. However we multiply this by 2 (to get sectors) before casting to a larger type, so sizes above 2TB get truncated. Also we allow rdev->sectors to be larger than 4TB, so it is possible for the array to be resized larger than the metadata can handle. So make sure rdev->sectors never exceeds 4TB when 0.90 metadata is in used. Also the sanity check at the end of super_90_load should include level 1 as it used ->size too. (RAID0 and Linear don't use ->size at all). Reported-by: Pim Zandbergen <P.Zandbergen@macroscoop.nl> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2011-09-10 17:21:28 +10:00
NeilBrown	7da64a0abc	md: fix clearing of 'blocked' flag in the presence of bad blocks. When the 'blocked' flag on a device is cleared while there are unacknowledged bad blocks we must fail the device. This is needed for backwards compatability of the interface. The code currently uses the wrong test for "unacknowledged bad blocks exist". Change it to the right test. Signed-off-by: NeilBrown <neilb@suse.de>	2011-08-30 16:20:17 +10:00
Namhyung Kim	a5bf4df0c8	md: use REQ_NOIDLE flag in md_super_write() Queue idling is used for the anticipation of immediate sequencial I/O's but md_super_write() is a kind of one- shot operation, coupled with md_super_wait(), so the idling in this case will be just a waste of time. Specifying REQ_NOIDLE prevents it. Instead of adding the flag to submit_bio() directly, use pre-defined macro WRITE_FLUSH_FUA. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-08-25 14:43:34 +10:00
NeilBrown	aeb9b21184	md: ensure changes to 'write-mostly' are reflected in metadata. The 'write-mostly' flag can be changed through sysfs. With 0.90 metadata, those changes are reflected in the metadata. For 1.x metadata, they aren't. So fix super_1_sync to record 'write-mostly' status. Signed-off-by: NeilBrown <neilb@suse.de>	2011-08-25 14:43:08 +10:00
NeilBrown	5ef56c8fec	md: report failure if a 'set faulty' request doesn't. Sometimes a device will refuse to be set faulty. e.g. RAID1 will never let the last working device become faulty. So check if "md_error()" did manage to set the faulty flag and fail with EBUSY if it didn't. Resolves-Debian-Bug: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=601198 Reported-by: Mike Hommey <mh+reportbug@glandium.org> Signed-off-by: NeilBrown <neilb@suse.de>	2011-08-25 14:42:51 +10:00
Linus Torvalds	6140333d36	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: (75 commits) md/raid10: handle further errors during fix_read_error better. md/raid10: Handle read errors during recovery better. md/raid10: simplify read error handling during recovery. md/raid10: record bad blocks due to write errors during resync/recovery. md/raid10: attempt to fix read errors during resync/check md/raid10: Handle write errors by updating badblock log. md/raid10: clear bad-block record when write succeeds. md/raid10: avoid writing to known bad blocks on known bad drives. md/raid10 record bad blocks as needed during recovery. md/raid10: avoid reading known bad blocks during resync/recovery. md/raid10 - avoid reading from known bad blocks - part 3 md/raid10: avoid reading from known bad blocks - part 2 md/raid10: avoid reading from known bad blocks - part 1 md/raid10: Split handle_read_error out from raid10d. md/raid10: simplify/reindent some loops. md/raid5: Clear bad blocks on successful write. md/raid5. Don't write to known bad block on doubtful devices. md/raid5: write errors should be recorded as bad blocks if possible. md/raid5: use bad-block log to improve handling of uncorrectable read errors. md/raid5: avoid reading from known bad blocks. ...	2011-07-28 05:50:27 -07:00
NeilBrown	e875ecea26	md/raid10 record bad blocks as needed during recovery. When recovering one or more devices, if all the good devices have bad blocks we should record a bad block on the device being rebuilt. If this fails, we need to abort the recovery. To ensure we don't think that we aborted later than we actually did, we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync, in particular before mddev->curr_resync is updated. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:24 +10:00
NeilBrown	de393cdea6	md: make it easier to wait for bad blocks to be acknowledged. It is only safe to choose not to write to a bad block if that bad block is safely recorded in metadata - i.e. if it has been 'acknowledged'. If it hasn't we need to wait for the acknowledgement. We support that using rdev->blocked wait and md_wait_for_blocked_rdev by introducing a new device flag 'BlockedBadBlock'. This flag is only advisory. It is cleared whenever we acknowledge a bad block, so that a waiter can re-check the particular bad blocks that it is interested it. It should be set by a caller when they find they need to wait. This (set after test) is inherently racy, but as md_wait_for_blocked_rdev already has a timeout, losing the race will have minimal impact. When we clear "Blocked" was also clear "BlockedBadBlocks" incase it was set incorrectly (see above race). We also modify the way we manage 'Blocked' to fit better with the new handling of 'BlockedBadBlocks' and to make it consistent between externally managed and internally managed metadata. This requires that each raidXd loop checks if the metadata needs to be written and triggers a write (md_check_recovery) if needed. Otherwise a queued write request might cause raidXd to wait for the metadata to write, and only that thread can write it. Before writing metadata, we set FaultRecorded for all devices that are Faulty, then after writing the metadata we clear Blocked for any device for which the Fault was certainly Recorded. The 'faulty' device flag now appears in sysfs if the device is faulty or it has unacknowledged bad blocks. So user-space which does not understand bad blocks can continue to function correctly. User space which does, should not assume a device is faulty until it sees the 'faulty' flag, and then sees the list of unacknowledged bad blocks is empty. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:31:48 +10:00
NeilBrown	d7a9d443bc	md: add 'write_error' flag to component devices. If a device has ever seen a write error, we will want to handle known-bad-blocks differently. So create an appropriate state flag and export it via sysfs. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:31:48 +10:00
NeilBrown	d2eb35acfd	md/raid1: avoid reading from known bad blocks. Now that we have a bad block list, we should not read from those blocks. There are several main parts to this: 1/ read_balance needs to check for bad blocks, and return not only the chosen device, but also how many good blocks are available there. 2/ fix_read_error needs to avoid trying to read from bad blocks. 3/ read submission must be ready to issue multiple reads to different devices as different bad blocks on different devices could mean that a single large read cannot be served by any one device, but can still be served by the array. This requires keeping count of the number of outstanding requests per bio. This count is stored in 'bi_phys_segments' 4/ retrying a read needs to also be ready to submit a smaller read and queue another request for the rest. This does not yet handle bad blocks when reading to perform resync, recovery, or check. 'md_trim_bio' will also be used for RAID10, so put it in md.c and export it. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:31:48 +10:00
NeilBrown	9f2f383078	md: Disable bad blocks and v0.90 metadata. v0.90 metadata cannot record bad blocks, so when loading metadata for such a device, set shift to -1. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:31:47 +10:00
NeilBrown	2699b67223	md: load/store badblock list from v1.x metadata Space must have been allocated when array was created. A feature flag is set when the badblock list is non-empty, to ensure old kernels don't load and trust the whole device. We only update the on-disk badblocklist when it has changed. If the badblocklist (or other metadata) is stored on a bad block, we don't cope very well. If metadata has no room for bad block, flag bad-blocks as disabled, and do the same for 0.90 metadata. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:31:47 +10:00
NeilBrown	16c791a5af	md/bad-block-log: add sysfs interface for accessing bad-block-log. This can show the log (providing it fits in one page) and allows bad blocks to be 'acknowledged' meaning that they have safely been recorded in metadata. Clearing bad blocks is not allowed via sysfs (except for code testing). A bad block can only be cleared when a write to the block succeeds. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:31:47 +10:00
NeilBrown	2230dfe4cc	md: beginnings of bad block management. This the first step in allowing md to track bad-blocks per-device so that we can fail individual blocks rather than the whole device. This patch just adds a data structure for recording bad blocks, with routines to add, remove, search the list. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:31:46 +10:00
NeilBrown	a519b26dbe	md: remove suspicious size_of() When calling bioset_create we pass the size of the front_pad as sizeof(mddev) which looks suspicious as mddev is a pointer and so it looks like a common mistake where sizeof(mddev) was intended. The size is actually correct as we want to store a pointer in the front padding of the bios created by the bioset, so make the intent more explicit by using sizeof(mddev_t ) Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 07:56:24 +10:00
Jonathan Brassow	768e587e18	MD: generate an event when array sync is complete This patch causes MD to generate an event (for device-mapper) when the synchronization thread is reaped. This is expected behavior for device-mapper. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:37 +10:00
Namhyung Kim	65a06f0674	md: get rid of unnecessary casts on page_address() page_address() returns void pointer, so the casts can be removed. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
NeilBrown	5389042ffa	md: change managed of recovery_disabled. If we hit a read error while recovering a mirror, we want to abort the recovery without necessarily failing the disk - as having a disk this a read error is better than not having an array at all. Currently this is managed with a per-array flag "recovery_disabled" and is only implemented for RAID1. For RAID10 we will need finer grained control as we might want to disable recovery for individual devices separately. So push more of the decision making into the personality. 'recovery_disabled' is now a 'cookie' which is copied when the personality want to disable recovery and is changed when a device is added to the array as this is used as a trigger to 'try recovery again'. This will allow RAID10 to get the control that it needs. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Namhyung Kim	a478a069b6	md: remove ro check in md_check_recovery() Commit `c89a8eee61` ("Allow faulty devices to be removed from a readonly array.") added some work on ro array in the function, but it couldn't be done since we didn't allow the ro array to be handled from the beginning. Fix it. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Namhyung Kim	36fad858a7	md: introduce link/unlink_rdev() helpers There are places where sysfs links to rdev are handled in a same way. Add the helper functions to consolidate them. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Kay Sievers	f15146380d	fs: seq_file - add event counter to simplify poll() support Moving the event counter into the dynamically allocated 'struc seq_file' allows poll() support without the need to allocate its own tracking structure. All current users are switched over to use the new counter. Requested-by: Andrew Morton akpm@linux-foundation.org Acked-by: NeilBrown <neilb@suse.de> Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:50 -04:00
NeilBrown	4274215d24	md: avoid endless recovery loop when waiting for fail device to complete. If a device fails in a way that causes pending request to take a while to complete, md will not be able to immediately remove it from the array in remove_and_add_spares. It will then incorrectly look like a spare device and md will try to recover it even though it is failed. This leads to a recovery process starting and instantly aborting over and over again. We should check if the device is faulty before considering it to be a spare. This will avoid trying to start a recovery that cannot proceed. This bug was introduced in 2.6.26 so that patch is suitable for any kernel since then. Cc: stable@kernel.org Reported-by: Jim Paradis <james.paradis@stratus.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-28 16:59:42 +10:00
Namhyung Kim	01393f3d58	md: check ->hot_remove_disk when removing disk Check pers->hot_remove_disk instead of pers->hot_add_disk in slot_store() during disk removal. The linear personality only has ->hot_add_disk and no ->hot_remove_disk, so that removing disk in the array resulted to following kernel bug: $ sudo mdadm --create /dev/md0 --level=linear --raid-devices=4 /dev/loop[0-3] $ echo none \| sudo tee /sys/block/md0/md/dev-loop2/slot BUG: unable to handle kernel NULL pointer dereference at (null) IP: [< (null)>] (null) PGD c9f5d067 PUD 8575a067 PMD 0 Oops: 0010 [#1] SMP CPU 2 Modules linked in: linear loop bridge stp llc kvm_intel kvm asus_atk0110 sr_mod cdrom sg Pid: 10450, comm: tee Not tainted 3.0.0-rc1-leonard+ #173 System manufacturer System Product Name/P5G41TD-M PRO RIP: 0010:[<0000000000000000>] [< (null)>] (null) RSP: 0018:ffff880085757df0 EFLAGS: 00010282 RAX: ffffffffa00168e0 RBX: ffff8800d1431800 RCX: 000000000000006e RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff88008543c000 RBP: ffff880085757e48 R08: 0000000000000002 R09: 000000000000000a R10: 0000000000000000 R11: ffff88008543c2e0 R12: 00000000ffffffff R13: ffff8800b4641000 R14: 0000000000000005 R15: 0000000000000000 FS: 00007fe8c9e05700(0000) GS:ffff88011fa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000000b4502000 CR4: 00000000000406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process tee (pid: 10450, threadinfo ffff880085756000, task ffff8800c9f08000) Stack: ffffffff8138496a ffff8800b4641000 ffff88008543c268 0000000000000000 ffff8800b4641000 ffff88008543c000 ffff8800d1431868 ffffffff81a78a90 ffff8800b4641000 ffff88008543c000 ffff8800d1431800 ffff880085757e98 Call Trace: [<ffffffff8138496a>] ? slot_store+0xaa/0x265 [<ffffffff81384bae>] rdev_attr_store+0x89/0xa8 [<ffffffff8115a96a>] sysfs_write_file+0x108/0x144 [<ffffffff81106b87>] vfs_write+0xb1/0x10d [<ffffffff8106e6c0>] ? trace_hardirqs_on_caller+0x111/0x135 [<ffffffff81106cac>] sys_write+0x4d/0x77 [<ffffffff814fe702>] system_call_fastpath+0x16/0x1b Code: Bad RIP value. RIP [< (null)>] (null) RSP <ffff880085757df0> CR2: 0000000000000000 ---[ end trace ba5fc64319a826fb ]--- Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-09 11:42:54 +10:00

1 2 3 4 5 ...

608 Commits (5f85d20d04cdc4c6ed15022a5ed76907ad88d4ae)