summaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
* md: manage redundancy group in sysfs when changing level.NeilBrown2010-07-053-13/+38
| | | | | | | | | | | | | | | | | | | | | | | | commit a64c876fd357906a1f7193723866562ad290654c upstream. Some levels expect the 'redundancy group' to be present, others don't. So when we change level of an array we might need to add or remove this group. This requires fixing up the current practice of overloading ->private to indicate (when ->pers == NULL) that something needs to be removed. So create a new ->to_remove to fill that role. When changing levels, we may need to add or remove attributes. When changing RAID5 -> RAID6, we both add and remove the same thing. It is important to catch this and optimise it out as the removal is delayed until a lock is released, so trying to add immediately would cause problems. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* md: set mddev readonly flag on blkdev BLKROSET ioctlDan Williams2010-07-051-0/+29
| | | | | | | | | | | | | | | | | | commit e2218350465e7e0931676b4849b594c978437bce upstream. When the user sets the block device to readwrite then the mddev should follow suit. Otherwise, the BUG_ON in md_write_start() will be set to trigger. The reverse direction, setting mddev->ro to match a set readonly request, can be ignored because the blkdev level readonly flag precludes the need to have mddev->ro set correctly. Nevermind the fact that setting mddev->ro to 1 may fail if the array is in use. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* md: remove unneeded sysfs files more promptlyNeilBrown2010-07-051-10/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit b6eb127d274385d81ce8dd45c98190f097bce1b4 upstream. When an array is stopped we need to remove some sysfs files which are dependent on the type of array. We need to delay that deletion as deleting them while holding reconfig_mutex can lead to deadlocks. We currently delay them until the array is completely destroyed. However it is possible to deactivate and then reactivate the array. It is also possible to need to remove sysfs files when changing level, which can potentially happen several times before an array is destroyed. So we need to delete these files more promptly: as soon as reconfig_mutex is dropped. We need to ensure this happens before do_md_run can restart the array, so we use open_mutex for some extra locking. This is not deadlock prone. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* md/linear: avoid possible oops and array stopNeilBrown2010-07-051-0/+1
| | | | | | | | | | | | | | | | | commit ef2f80ff7325b2c1888ff02ead28957b5840bf51 upstream. Since commit ef286f6fa673cd7fb367e1b145069d8dbfcc6081 it has been important that each personality clears ->private in the ->stop() function, or sets it to a attribute group to be removed. linear.c doesn't. This can sometimes lead to an oops, though it doesn't always. Suitable for 2.6.33-stable and 2.6.34. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* md: Fix read balancing in RAID1 and RAID10 on drives > 2TBNeilBrown2010-07-052-3/+3
| | | | | | | | | | | | | | | commit af3a2cd6b8a479345786e7fe5e199ad2f6240e56 upstream. read_balance uses a "unsigned long" for a sector number which will get truncated beyond 2TB. This will cause read-balancing to be non-optimal, and can cause data to be read from the 'wrong' branch during a resync. This has a very small chance of returning wrong data. Reported-by: Jordan Russell <jr-list-2010@quo.to> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* md/raid1: fix counting of write targets.NeilBrown2010-07-051-2/+3
| | | | | | | | | | | | | | | | | | | | | | commit 964147d5c86d63be79b442c30f3783d49860c078 upstream. There is a very small race window when writing to a RAID1 such that if a device is marked faulty at exactly the wrong time, the write-in-progress will not be sent to the device, but the bitmap (if present) will be updated to say that the write was sent. Then if the device turned out to still be usable as was re-added to the array, the bitmap-based-resync would skip resyncing that block, possibly leading to corruption. This would only be a problem if no further writes were issued to that area of the device (i.e. that bitmap chunk). Suitable for any pending -stable kernel. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* md/raid6: Fix raid-6 read-error correction in degraded stateGabriele A. Trombetti2010-05-121-1/+1
| | | | | | | | | | | | | | | commit 87aa63000c484bfb9909989316f615240dfee018 upstream. Fix: Raid-6 was not trying to correct a read-error when in singly-degraded state and was instead dropping one more device, going to doubly-degraded state. This patch fixes this behaviour. Tested-by: Janos Haar <janos.haar@netcenter.hu> Signed-off-by: Gabriele A. Trombetti <g.trombetti.lkrnl1213@logicschema.com> Reported-by: Janos Haar <janos.haar@netcenter.hu> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* md: restore ability of spare drives to spin down.NeilBrown2010-05-121-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1176568de7e066c0be9e46c37503b9fd4730edcf upstream. Some time ago we stopped the clean/active metadata updates from being written to a 'spare' device in most cases so that it could spin down and say spun down. Device failure/removal etc are still recorded on spares. However commit 51d5668cb2e3fd1827a55 broke this 50% of the time, depending on whether the event count is even or odd. The change log entry said: This means that the alignment between 'odd/even' and 'clean/dirty' might take a little longer to attain, how ever the code makes no attempt to create that alignment, so it could take arbitrarily long. So when we find that clean/dirty is not aligned with odd/even, force a second metadata-update immediately. There are already cases where a second metadata-update is needed immediately (e.g. when a device fails during the metadata update). We just piggy-back on that. Reported-by: Joe Bryant <tenminjoe@yahoo.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* md/raid5: fix previous patch.NeilBrown2010-05-121-17/+18
| | | | | | | | | | | | | | | commit 6e3b96ed610e5a1838e62ddae9fa0c3463f235fa upstream. Previous patch changes stripe and chunk_number to sector_t but mistakenly did not update all of the divisions to use sector_dev(). This patch changes all the those divisions (actually the '%' operator) to sector_div. Signed-off-by: NeilBrown <neilb@suse.de> Tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* md/raid5: allow for more than 2^31 chunks.NeilBrown2010-05-121-12/+7
| | | | | | | | | | | | | commit 35f2a591192d0a5d9f7fc696869c76f0b8e49c3d upstream. With many large drives and small chunk sizes it is possible to create a RAID5 with more than 2^31 chunks. Make sure this works. Reported-by: Brett King <king.br@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* md: deal with merge_bvec_fn in component devices better.NeilBrown2010-04-264-30/+43
| | | | | | | | | | | | | | | | | | | | commit 627a2d3c29427637f4c5d31ccc7fcbd8d312cd71 upstream. If a component device has a merge_bvec_fn then as we never call it we must ensure we never need to. Currently this is done by setting max_sector to 1 PAGE, however this does not stop a bio being created with several sub-page iovecs that would violate the merge_bvec_fn. So instead set max_phys_segments to 1 and set the segment boundary to the same as a page boundary to ensure there is only ever one single-page segment of IO requested at a time. This can particularly be an issue when 'xen' is used as it is known to submit multiple small buffers in a single bio. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* dm ioctl: introduce flag indicating uevent was generatedPeter Rajnoha2010-04-263-12/+18
| | | | | | | | | | | | | commit 3abf85b5b5851b5f28d3d8a920ebb844edd08352 upstream. Set a new DM_UEVENT_GENERATED_FLAG when returning from ioctls to indicate that a uevent was actually generated. This tells the userspace caller that it may need to wait for the event to be processed. Signed-off-by: Peter Rajnoha <prajnoha@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* dm ioctl: only issue uevent on resume if state changedMike Snitzer2010-03-151-4/+5
| | | | | | | | | | | | | commit 0f3649a9e305ea22eb196a84a2d7520afcaa6060 upstream. Only issue a uevent on a resume if the state of the device changed, i.e. if it was suspended and/or its table was replaced. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* dm: free dm_io before bio_endio not afterMikulas Patocka2010-03-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a97f925a32aad2a37971d7bfb657006acf04e42d upstream. Free the dm_io structure before calling bio_endio() instead of after it, to ensure that the io_pool containing it is not referenced after it is freed. This partially fixes a problem described here https://www.redhat.com/archives/dm-devel/2010-February/msg00109.html thread 1: bio_endio(bio, io_error); /* scheduling happens */ thread 2: close the device remove the device thread 1: free_io(md, io); Thread 2, when removing the device, sees non-empty md->io_pool (because the io hasn't been freed by thread 1 yet) and may crash with BUG in mempool_free. Thread 1 may also crash, when freeing into a nonexisting mempool. To fix this we must make sure that bio_endio() is the last call and the md structure is not accessed afterwards. There is another bio_endio in process_barrier, but it is called from the thread and the thread is destroyed prior to freeing the mempools, so this call is not affected by the bug. A similar bug exists with module unloads - the module may be unloaded immediately after bio_endio - but that is more difficult to fix. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* dm: sysfs revert add empty release function to avoid debug warningAlasdair G Kergon2010-02-161-8/+0
| | | | | | | | | | | | | | | | | | | Revert commit d2bb7df8cac647b92f51fb84ae735771e7adbfa7 at Greg's request. Author: Milan Broz <mbroz@redhat.com> Date: Thu Dec 10 23:51:53 2009 +0000 dm: sysfs add empty release function to avoid debug warning This patch just removes an unnecessary warning: kobject: 'dm': does not have a release() function, it is broken and must be fixed. The kobject is embedded in mapped device struct, so code does not need to release memory explicitly here. Cc: Greg KH <gregkh@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm mpath: fix stall when requeueing ioKiyoshi Ueda2010-02-161-4/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes the problem that system may stall if target's ->map_rq returns DM_MAPIO_REQUEUE in map_request(). E.g. stall happens on 1 CPU box when a dm-mpath device with queue_if_no_path bounces between all-paths-down and paths-up on I/O load. When target's ->map_rq returns DM_MAPIO_REQUEUE, map_request() requeues the request and returns to dm_request_fn(). Then, dm_request_fn() doesn't exit the I/O dispatching loop and continues processing the requeued request again. This map and requeue loop can be done with interrupt disabled, so 1 CPU system can be stalled if this situation happens. For example, commands below can stall my 1 CPU box within 1 minute or so: # dmsetup table mp mp: 0 2097152 multipath 1 queue_if_no_path 0 1 1 service-time 0 1 2 8:144 1 1 # while true; do dd if=/dev/mapper/mp of=/dev/null bs=1M count=100; done & # while true; do \ > dmsetup message mp 0 "fail_path 8:144" \ > dmsetup suspend --noflush mp \ > dmsetup resume mp \ > dmsetup message mp 0 "reinstate_path 8:144" \ > done To fix the problem above, this patch changes dm_request_fn() to exit the I/O dispatching loop once if a request is requeued in map_request(). Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm raid1: fix null pointer dereference in suspendTakahiro Yasui2010-02-161-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When suspending a failed mirror, bios are completed by mirror_end_io() and __rh_lookup() in dm_rh_dec() returns NULL where a non-NULL return value is required by design. Fix this by not changing the state of the recovery failed region from DM_RH_RECOVERING to DM_RH_NOSYNC in dm_rh_recovery_end(). Issue On 2.6.33-rc1 kernel, I hit the bug when I suspended the failed mirror by dmsetup command. BUG: unable to handle kernel NULL pointer dereference at 00000020 IP: [<f94f38e2>] dm_rh_dec+0x35/0xa1 [dm_region_hash] ... EIP: 0060:[<f94f38e2>] EFLAGS: 00010046 CPU: 0 EIP is at dm_rh_dec+0x35/0xa1 [dm_region_hash] EAX: 00000286 EBX: 00000000 ECX: 00000286 EDX: 00000000 ESI: eff79eac EDI: eff79e80 EBP: f6915cd4 ESP: f6915cc4 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process dmsetup (pid: 2849, ti=f6914000 task=eff03e80 task.ti=f6914000) ... Call Trace: [<f9530af6>] ? mirror_end_io+0x53/0x1b1 [dm_mirror] [<f9413104>] ? clone_endio+0x4d/0xa2 [dm_mod] [<f9530aa3>] ? mirror_end_io+0x0/0x1b1 [dm_mirror] [<f94130b7>] ? clone_endio+0x0/0xa2 [dm_mod] [<c02d6bcb>] ? bio_endio+0x28/0x2b [<f952f303>] ? hold_bio+0x2d/0x62 [dm_mirror] [<f952f942>] ? mirror_presuspend+0xeb/0xf7 [dm_mirror] [<c02aa3e2>] ? vmap_page_range+0xb/0xd [<f9414c8d>] ? suspend_targets+0x2d/0x3b [dm_mod] [<f9414ca9>] ? dm_table_presuspend_targets+0xe/0x10 [dm_mod] [<f941456f>] ? dm_suspend+0x4d/0x150 [dm_mod] [<f941767d>] ? dev_suspend+0x55/0x18a [dm_mod] [<c0343762>] ? _copy_from_user+0x42/0x56 [<f9417fb0>] ? dm_ctl_ioctl+0x22c/0x281 [dm_mod] [<f9417628>] ? dev_suspend+0x0/0x18a [dm_mod] [<f9417d84>] ? dm_ctl_ioctl+0x0/0x281 [dm_mod] [<c02c3c4b>] ? vfs_ioctl+0x22/0x85 [<c02c422c>] ? do_vfs_ioctl+0x4cb/0x516 [<c02c42b7>] ? sys_ioctl+0x40/0x5a [<c0202858>] ? sysenter_do_call+0x12/0x28 Analysis When recovery process of a region failed, dm_rh_recovery_end() function changes the state of the region from RM_RH_RECOVERING to DM_RH_NOSYNC. When recovery_complete() is executed between dm_rh_update_states() and dm_writes() in do_mirror(), bios are processed with the region state, DM_RH_NOSYNC. However, the region data is freed without checking its pending count when dm_rh_update_states() is called next time. When bios are finished by mirror_end_io(), __rh_lookup() in dm_rh_dec() returns NULL even though a valid return value are expected. Solution Remove the state change of the recovery failed region from DM_RH_RECOVERING to DM_RH_NOSYNC in dm_rh_recovery_end(). We can remove the state change because: - If the region data has been released by dm_rh_update_states(), a new region data is created with the state of DM_RH_NOSYNC, and bios are processed according to the DM_RH_NOSYNC state. - If the region data has not been released by dm_rh_update_states(), a state of the region is DM_RH_RECOVERING and bios are put in the delayed_bio list. The flag change from DM_RH_RECOVERING to DM_RH_NOSYNC in dm_rh_recovery_end() was added in the following commit: dm raid1: handle resync failures author Jonathan Brassow <jbrassow@redhat.com> Thu, 12 Jul 2007 16:29:04 +0000 (17:29 +0100) http://git.kernel.org/linus/f44db678edcc6f4c2779ac43f63f0b9dfa28b724 Signed-off-by: Takahiro Yasui <tyasui@redhat.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm raid1: fail writes if errors are not handled and log failsMikulas Patocka2010-02-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | If the mirror log fails when the handle_errors option was not selected and there is no remaining valid mirror leg, writes return success even though they weren't actually written to any device. This patch completes them with EIO instead. This code path is taken: do_writes: bio_list_merge(&ms->failures, &sync); do_failures: if (!get_valid_mirror(ms)) (false) else if (errors_handled(ms)) (false) else bio_endio(bio, 0); The logic in do_failures is based on presuming that the write was already tried: if it succeeded at least on one leg (without handle_errors) it is reported as success. Reference: https://bugzilla.redhat.com/show_bug.cgi?id=555197 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm log: userspace fix overhead_size calcuationsJonathan Brassow2010-02-161-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes two bugs that revolve around the miscalculation and misuse of the variable 'overhead_size'. 'overhead_size' is the size of the various header structures used during communication. The first bug is the use of 'sizeof' with the pointer of a structure instead of the structure itself - resulting in the wrong size being computed. This is then used in a check to see if the payload (data_size) would be to large for the preallocated structure. Since the bug produces a smaller value for the overhead, it was possible for the structure to be breached. (Although the current users of the code do not currently send enough data to trigger this bug.) The second bug is that the 'overhead_size' value is used to compute how much of the preallocated space should be cleared before populating it with fresh data. This should have simply been 'sizeof(struct cn_msg)' not overhead_size. The fact that 'overhead_size' was computed incorrectly made this problem "less bad" - leaving only a pointer's worth of space at the end uncleared. Thus, this bug was never producing a bad result, but still needs to be fixed - especially now that the value is computed correctly. Cc: stable@kernel.org Signed-off-by: Jonathan Brassow <jbrassow@redhat.com Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm snapshot: persistent annotate work_queue as on stackMike Snitzer2010-02-161-1/+1
| | | | | | | | | | chunk_io() declares its 'struct mdata_req' on the stack and then initializes its 'struct work_struct' member. Annotate the initialization of this workqueue with INIT_WORK_ON_STACK to suppress a debugobjects warning seen when CONFIG_DEBUG_OBJECTS_WORK is enabled. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm stripe: avoid divide by zero with invalid stripe countNikanth Karthikesan2010-02-161-1/+1
| | | | | | | | | | | | | | | | If a table containing zero as stripe count is passed into stripe_ctr the code attempts to divide by zero. This patch changes DM_TABLE_LOAD to return -EINVAL if the stripe count is zero. We now get the following error messages: device-mapper: table: 253:0: striped: Invalid stripe count device-mapper: ioctl: error adding target to table Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* md: fix some lockdep issues between md and sysfs.NeilBrown2010-02-102-11/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ====== This fix is related to http://bugzilla.kernel.org/show_bug.cgi?id=15142 but does not address that exact issue. ====== sysfs does like attributes being removed while they are being accessed (i.e. read or written) and waits for the access to complete. As accessing some md attributes takes the same lock that is held while removing those attributes a deadlock can occur. This patch addresses 3 issues in md that could lead to this deadlock. Two relate to calling flush_scheduled_work while the lock is held. This is probably a bad idea in general and as we use schedule_work to delete various sysfs objects it is particularly bad. In one case flush_scheduled_work is called from md_alloc (called by md_probe) called from do_md_run which holds the lock. This call is only present to ensure that ->gendisk is set. However we can be sure that gendisk is always set (though possibly we couldn't when that code was originally written. This is because do_md_run is called in three different contexts: 1/ from md_ioctl. This requires that md_open has succeeded, and it fails if ->gendisk is not set. 2/ from writing a sysfs attribute. This can only happen if the mddev has been registered in sysfs which happens in md_alloc after ->gendisk has been set. 3/ from autorun_array which is only called by autorun_devices, which checks for ->gendisk to be set before calling autorun_array. So the call to md_probe in do_md_run can be removed, and the check on ->gendisk can also go. In the other case flush_scheduled_work is being called in do_md_stop, purportedly to wait for all md_delayed_delete calls (which delete the component rdevs) to complete. However there really isn't any need to wait for them - they have already been disconnected in all important ways. The third issue is that raid5->stop() removes some attribute names while the lock is held. There is already some infrastructure in place to delay attribute removal until after the lock is released (using schedule_work). So extend that infrastructure to remove the raid5_attrs_group. This does not address all lockdep issues related to the sysfs "s_active" lock. The rest can be address by splitting that lockdep context between symlinks and non-symlinks which hopefully will happen. Signed-off-by: NeilBrown <neilb@suse.de>
* md: fix 'degraded' calculation when starting a reshape.NeilBrown2010-02-091-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | This code was written long ago when it was not possible to reshape a degraded array. Now it is so the current level of degraded-ness needs to be taken in to account. Also newly addded devices should only reduce degradedness if they are deemed to be in-sync. In particular, if you convert a RAID5 to a RAID6, and increase the number of devices at the same time, then the 5->6 conversion will make the array degraded so the current code will produce a wrong value for 'degraded' - "-1" to be precise. If the reshape runs to completion end_reshape will calculate a correct new value for 'degraded', but if a device fails during the reshape an incorrect decision might be made based on the incorrect value of "degraded". This patch is suitable for 2.6.32-stable and if they are still open, 2.6.31-stable and 2.6.30-stable as well. Cc: stable@kernel.org Reported-by: Michael Evans <mjevans1983@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* DM: Fix device mapper topology stackingMartin K. Petersen2010-01-111-15/+5
| | | | | | | | | | | | | | Make DM use bdev_stack_limits() function so that partition offsets get taken into account when calculating alignment. Clarify stacking warnings. Also remove obsolete clearing of final alignment_offset and misalignment flag. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Alasdair G. Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* md: allow a resync that is waiting for other resync to complete, to be aborted.NeilBrown2009-12-301-2/+3
| | | | | | | | | | | | | | If two arrays share a device, then they will not both resync at the same time. One will wait for the other to complete. While waiting, the MD_RECOVERY_INTR flag is not checked so a device failure, which would make the resync pointless, does not cause the resync to abort, so the failed device cannot be removed (as it cannot be remove while a resync is happening). So add a test for MD_RECOVERY_INTR. Reported-by: Brett Russ <bruss@netezza.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md: remove unnecessary code from do_md_runNeilBrown2009-12-301-28/+0
| | | | | | | | | | | | | Since commit dfc7064500061677720fa26352963c772d3ebe6b, ->hot_remove_disks has not removed non-failed devices from an array until recovery is no longer possible. So the code in do_md_run to get around the fact that md_check_recovery (which calls ->hot_remove_disks) would remove partially-in-sync devices is no longer needed. So remove it. Signed-off-by: NeilBrown <neilb@suse.de>
* md: make recovery started by do_md_run() visible via sync_actionDan Williams2009-12-301-0/+1
| | | | | | | | | | By default md_do_sync() will perform recovery if no other actions are specified. However, action_show() relies on MD_RECOVERY_RECOVER to be set otherwise it returns 'idle'. So, add a missing set MD_RECOVERY_RECOVER when starting recovery. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md: fix small irregularity with start_ro module parameterNeilBrown2009-12-301-1/+1
| | | | | | | | | | | | | | | | | The start_ro modules parameter can be used to force arrays to be started in 'auto-readonly' in which they are read-only until the first write. This ensures that no resync/recovery happens until something else writes to the device. This is important for resume-from-disk off an md array. However if an array is started 'readonly' (by writing 'readonly' to the 'array_state' sysfs attribute) we want it to be really 'readonly', not 'auto-readonly'. So strengthen the condition to only set auto-readonly if the array is not already read-only. Signed-off-by: NeilBrown <neilb@suse.de>
* md: Fix unfortunate interaction with evmsNeilBrown2009-12-301-1/+7
| | | | | | | | | | | | | | | | | | | | | evms configures md arrays by: open device send ioctl close device for each different ioctl needed. Since 2.6.29, the device can disappear after the 'close' unless a significant configuration has happened to the device. The change made by "SET_ARRAY_INFO" can too minor to stop the device from disappearing, but important enough that losing the change is bad. So: make sure SET_ARRAY_INFO sets mddev->ctime, and keep the device active as long as ctime is non-zero (it gets zeroed with lots of other things when the array is stopped). This is suitable for -stable kernels since 2.6.29. Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dmLinus Torvalds2009-12-1518-874/+2274
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (80 commits) dm snapshot: use merge origin if snapshot invalid dm snapshot: report merge failure in status dm snapshot: merge consecutive chunks together dm snapshot: trigger exceptions in remaining snapshots during merge dm snapshot: delay merging a chunk until writes to it complete dm snapshot: queue writes to chunks being merged dm snapshot: add merging dm snapshot: permit only one merge at once dm snapshot: support barriers in snapshot merge target dm snapshot: avoid allocating exceptions in merge dm snapshot: rework writing to origin dm snapshot: add merge target dm exception store: add merge specific methods dm snapshot: create function for chunk_is_tracked wait dm snapshot: make bio optional in __origin_write dm mpath: reject messages when device is suspended dm: export suspended state to targets dm: rename dm_suspended to dm_suspended_md dm: swap target postsuspend call and setting suspended flag dm crypt: add plain64 iv ...
| * dm snapshot: use merge origin if snapshot invalidMikulas Patocka2009-12-101-5/+4
| | | | | | | | | | | | | | | | | | If the snapshot we are merging became invalid (e.g. it ran out of space) redirect all I/O directly to the origin device. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm snapshot: report merge failure in statusMike Snitzer2009-12-101-2/+28
| | | | | | | | | | | | | | | | Set 'merge_failed' flag if a snapshot fails to merge. Update snapshot_status() to report "Merge failed" if 'merge_failed' is set. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm snapshot: merge consecutive chunks togetherMike Snitzer2009-12-101-10/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | s->store->type->prepare_merge returns the number of chunks that can be copied linearly working backwards from the returned chunk number. For example, if it returns 3 chunks with old_chunk == 10 and new_chunk == 20, then chunk 20 can be copied to 10, chunk 19 to 9 and 18 to 8. Until now kcopyd only copied one chunk at a time. This patch now copies the full set at once. Consequently, snapshot_merge_process() needs to delay the merging of all chunks if any have writes in progress, not just the first chunk in the region that is to be merged. snapshot-merge's performance is now comparable to the original snapshot-origin target. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm snapshot: trigger exceptions in remaining snapshots during mergeMikulas Patocka2009-12-101-0/+83
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When there is one merging snapshot and other non-merging snapshots, snapshot_merge_process() must make exceptions in the non-merging snapshots. Use a sequence count to resolve the race between I/O to chunks that are about to be merged. The count increases each time an exception reallocation finishes. Use wait_event() to wait until the count changes. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm snapshot: delay merging a chunk until writes to it completeMikulas Patocka2009-12-101-1/+5
| | | | | | | | | | | | | | | | | | Track writes to chunks that are currently being merged and delay merging a chunk until all writes to that chunk finish. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm snapshot: queue writes to chunks being mergedMikulas Patocka2009-12-101-13/+78
| | | | | | | | | | | | | | | | | | While a set of chunks is being merged, any overlapping writes need to be queued. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm snapshot: add mergingMikulas Patocka2009-12-102-6/+244
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merging is started when origin is resumed and it is stopped when origin is suspended or when the merging snapshot is destroyed or errors are detected. Merging is not yet interlocked with writes: this will be handled in subsequent patches. The code relies on callbacks from a private kcopyd thread. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm snapshot: permit only one merge at onceMikulas Patocka2009-12-101-6/+27
| | | | | | | | | | | | | | | | | | Merging more than one snapshot is not supported, so prevent this happening. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm snapshot: support barriers in snapshot merge targetMike Snitzer2009-12-101-3/+18
| | | | | | | | | | | | | | | | | | | | | | Sets num_flush_requests=2 to support flushing both the origin and cow devices used by the snapshot-merge target. Also, snapshot_ctr() now gets the origin device using FMODE_WRITE if the target is snapshot-merge (which writes to the origin device). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm snapshot: avoid allocating exceptions in mergeMikulas Patocka2009-12-101-1/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The snapshot-merge target should not allocate new exceptions because the intent is to merge all of its exceptions as quickly and safely as possible. This patch introduces the snapshot-merge mapping function and updates __origin_write() so that it doesn't allocate exceptions on any snapshots that are being merged. If a write request to a merging snapshot device is to be dispatched directly to the origin (because the chunk is not remapped or was already merged), snapshot_merge_map() must make exceptions in other snapshots so calls do_origin(). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm snapshot: rework writing to originMikulas Patocka2009-12-101-106/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To track the completion of exceptions relating to the same location on the device, the current code selects one exception as primary_pe, links the other exceptions to it and uses reference counting to wait until all the reallocations are complete. It is considered too complicated to extend this code to handle the new snapshot-merge target, where sets of non-overlapping chunks would also need to become linked. Instead, a simpler (but less efficient) approach is taken. Bios are linked to one exception. When it completes, bios are simply retried, and if other related exceptions are still outstanding, they'll get queued again to wait for another one. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm snapshot: add merge targetMikulas Patocka2009-12-101-12/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The snapshot-merge target allows a snapshot to be merged back into the snapshot's origin device. One anticipated use of snapshot merging is the rollback of filesystems to back out problematic system upgrades. This patch adds snapshot-merge target management to both dm_snapshot_init() and dm_snapshot_exit(). As an initial place-holder, snapshot-merge is identical to the snapshot target. Documentation is provided. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm exception store: add merge specific methodsMikulas Patocka2009-12-102-2/+129
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add functions that decide how many consecutive chunks of snapshot to merge back into the origin next and to update the metadata afterwards. prepare_merge provides a pointer to the most recent still-to-be-merged chunk and returns how many previous ones are consecutive and can be processed together. commit_merge removes the nr_merged most-recent chunks permanently from the exception store. The number must not exceed that returned by prepare_merge. Introduce NUM_SNAPSHOT_HDR_CHUNKS to show where the snapshot header chunk is accounted for. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm snapshot: create function for chunk_is_tracked waitMike Snitzer2009-12-101-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | Move the __chunk_is_tracked() loop into a separate function as we will also need to call it from the write path in the rare case of conflicting writes to the same chunk. Originally introduced in commit a8d41b59f3f5a7ac19452ef442a7fc1b5fa17366 ("dm snapshot: fix race during exception creation"). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm snapshot: make bio optional in __origin_writeMikulas Patocka2009-12-101-5/+18
| | | | | | | | | | | | | | | | | | | | | | | | To support the merging of snapshots back into their origin we need to trigger exceptions in other snapshots not being merged without any incoming bio on the origin device. The bio parameter to __origin_write() becomes optional and the sector needs supplying separately. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm mpath: reject messages when device is suspendedKiyoshi Ueda2009-12-101-0/+5
| | | | | | | | | | | | | | | | | | | | This patch rejects messages that can generate I/O while the device itself is suspended. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Cc: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm: export suspended state to targetsKiyoshi Ueda2009-12-101-0/+11
| | | | | | | | | | | | | | | | | | | | This patch adds the exported dm_suspended() function so that targets can check whether or not they are suspended. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm: rename dm_suspended to dm_suspended_mdKiyoshi Ueda2009-12-104-11/+16
| | | | | | | | | | | | | | | | | | | | | | This patch renames dm_suspended() to dm_suspended_md() and keeps it internal to dm. No functional change. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm: swap target postsuspend call and setting suspended flagKiyoshi Ueda2009-12-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | This patch moves DMF_SUSPENDED flag set before postsuspend. No one should care about the ordering, because the flag set and the postsuspend are protected by a single lock, md->suspend_lock, and all strict flag-checkers take the lock. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm crypt: add plain64 ivMilan Broz2009-12-101-0/+18
| | | | | | | | | | | | | | | | | | | | The default plain IV is 32-bit only. This plain64 IV provides a compatible mode for encrypted devices bigger than 4TB. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>