summaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
...
| * md: use resync_max_sectors for reshape as well as resync.NeilBrown2012-05-211-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Some resync type operations need to act on the address space of the device, others on the address space of the array. This only affects RAID10, so it sets resync_max_sectors to the array size (it defaults to the device size), and that is currently used for resync only. However reshape of a RAID10 must be done against the array size, not device size, so change code to use resync_max_sectors for both the resync and the reshape cases. This does not affect RAID5 or RAID1, just RAID10. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: teach sync_page_io about new_data_offset.NeilBrown2012-05-211-0/+4
| | | | | | | | | | | | | | | | | | | | | | Some code in raid1 and raid10 use sync_page_io to read/write pages when responding to read errors. As we will shortly support changing data_offset for raid10, this function must understand new_data_offset. So add that understanding. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid10: collect some geometry fields into a dedicated structure.NeilBrown2012-05-212-108/+115
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We will shortly be adding reshape support for RAID10 which will require it having 2 concurrent geometries (before and after). To make that easier, collect most geometry fields into 'struct geom' and access them from there. Then we will more easily be able to add a second set of fields. Note that 'copies' is not in this struct and so cannot be changed. There is little need to change this number and doing so is a lot more difficult as it requires reallocating more things. So leave it out for now. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid5: allow for change in data_offset while managing a reshape.NeilBrown2012-05-212-33/+82
| | | | | | | | | | | | | | | | | | | | | | The important issue here is incorporating the different in data_offset into calculations concerning when we might need to over-write data that is still thought to be valid. To this end we find the minimum offset difference across all devices and add that where appropriate. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid5: Use correct data_offset for all IO.NeilBrown2012-05-211-13/+59
| | | | | | | | | | | | | | As there can now be two different data_offsets - an 'old' and a 'new' - we need to carefully choose between them. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: add possibility to change data-offset for devices.NeilBrown2012-05-215-32/+214
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When reshaping we can avoid costly intermediate backup by changing the 'start' address of the array on the device (if there is enough room). So as a first step, allow such a change to be requested through sysfs, and recorded in v1.x metadata. (As we didn't previous check that all 'pad' fields were zero, we need a new FEATURE flag for this. A (belatedly) check that all remaining 'pad' fields are zero to avoid a repeat of this) The new data offset must be requested separately for each device. This allows each to have a different change in the data offset. This is not likely to be used often but as data_offset can be set per-device, new_data_offset should be too. This patch also removes the 'acknowledged' arg to rdev_set_badblocks as it is never used and never will be. At the same time we add a new arg ('in_new') which is currently always zero but will be used more soon. When a reshape finishes we will need to update the data_offset and rdev->sectors. So provide an exported function to do that. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: allow a reshape operation to be reversed.NeilBrown2012-05-213-13/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently a reshape operation always progresses from the start of the array to the end unless the number of devices is being reduced, in which case it progressed in the opposite direction. To reverse a partial reshape which changes the number of devices you can stop the array and re-assemble with the raid-disks numbers reversed and it will undo. However for a reshape that does not change the number of devices it is not possible to reverse the reshape in the middle - you have to wait until it completes. So add a 'reshape_direction' attribute with is either 'forwards' or 'backwards' and can be explicitly set when delta_disks is zero. This will become more important when we allow the data_offset to change in a reshape. Then the explicit statement of what direction is being used will be more useful. This can be enabled in raid5 trivially as it already supports reverse reshape and just needs to use a different trigger to request it. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: using GFP_NOIO to allocate bio for flush requestShaohua Li2012-05-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | A flush request is usually issued in transaction commit code path, so using GFP_KERNEL to allocate memory for flush request bio falls into the classic deadlock issue. This is suitable for any -stable kernel to which it applies as it avoids a possible deadlock. Cc: stable@vger.kernel.org Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | Merge tag 'dm-3.4-fixes-2' of ↵Linus Torvalds2012-05-181-15/+17
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm Pull a dm fix from Alasdair G Kergon: "A fix to the thin provisioning userspace interface." * tag 'dm-3.4-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: dm thin: fix table output when pool target disables discard passdown internally
| * | dm thin: fix table output when pool target disables discard passdown internallyMike Snitzer2012-05-191-15/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the thin pool target clears the discard_passdown parameter internally, it incorrectly changes the table line reported to userspace. This breaks dumb string comparisons on these table lines in generic userspace device-mapper library code and leads to tables being reloaded repeatedly when nothing is actually meant to be changing. This patch corrects this by no longer changing the table line when discard passdown was disabled. We can still tell when discard passdown is overridden by looking for the message "Discard unsupported by data device (sdX): Disabling discard passdown." This automatic detection is also moved from the 'load' to the 'resume' so that it is re-evaluated should the properties of underlying devices change. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | | Merge tag 'md-3.4-fixes' of git://neil.brown.name/mdLinus Torvalds2012-05-181-1/+1
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | Pull one more md bugfix from NeilBrown: "Fix bug in recent fix to RAID10. Without this patch, recovery will crash" * tag 'md-3.4-fixes' of git://neil.brown.name/md: md/raid10: fix transcription error in calc_sectors conversion.
| * | md/raid10: fix transcription error in calc_sectors conversion.NeilBrown2012-05-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The old code was sector_div(stride, fc); the new code was sector_dir(size, conf->near_copies); 'size' is right (the stride various wasn't really needed), but 'fc' means 'far_copies', and that is an important difference. Signed-off-by: NeilBrown <neilb@suse.de>
* | | Merge tag 'md-3.4-fixes' of git://neil.brown.name/mdLinus Torvalds2012-05-172-24/+34
|\| | | |/ |/| | | | | | | | | | | | | | | | | | | | | Pull two md fixes from NeilBrown: "One fixes a bug in the new raid10 resize code so is relevant to 3.4 only. The other fixes a bug in the use of md by dm-raid, so is relevant to any kernel with dm-raid support" * tag 'md-3.4-fixes' of git://neil.brown.name/md: MD: Add del_timer_sync to mddev_suspend (fix nasty panic) md/raid10: set dev_sectors properly when resizing devices in array.
| * MD: Add del_timer_sync to mddev_suspend (fix nasty panic)Jonathan Brassow2012-05-171-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | Use del_timer_sync to remove timer before mddev_suspend finishes. We don't want a timer going off after an mddev_suspend is called. This is especially true with device-mapper, since it can call the destructor function immediately following a suspend. This results in the removal (kfree) of the structures upon which the timer depends - resulting in a very ugly panic. Therefore, we add a del_timer_sync to mddev_suspend to prevent this. Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid10: set dev_sectors properly when resizing devices in array.NeilBrown2012-05-171-24/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | raid10 stores dev_sectors in 'conf' separately from the one in 'mddev' because it can have a very significant effect on block addressing and so need to be updated carefully. However raid10_resize isn't updating it at all! To update it correctly, we need to make sure it is a proper multiple of the chunksize taking various details of the layout in to account. This calculation is currently done in setup_conf. So split it out from there and call it from raid10_resize as well. Then set conf->dev_sectors properly. Signed-off-by: NeilBrown <neilb@suse.de>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2012-05-121-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David S. Miller: 1) Since we do RCU lookups on ipv4 FIB entries, we have to test if the entry is dead before returning it to our caller. 2) openvswitch locking and packet validation fixes from Ansis Atteka, Jesse Gross, and Pravin B Shelar. 3) Fix PM resume locking in IGB driver, from Benjamin Poirier. 4) Fix VLAN header handling in vhost-net and macvtap, from Basil Gor. 5) Revert a bogus network namespace isolation change that was causing regressions on S390 networking devices. 6) If bonding decides to process and handle a LACPDU frame, we shouldn't bump the rx_dropped counter. From Jiri Bohac. 7) Fix mis-calculation of available TX space in r8169 driver when doing TSO, which can lead to crashes and/or hung device. From Julien Ducourthial. 8) SCTP does not validate cached routes properly in all cases, from Nicolas Dichtel. 9) Link status interrupt needs to be handled in ks8851 driver, from Stephen Boyd. 10) Use capable(), not cap_raised(), in connector/userns netlink code. From Eric W. Biederman via Andrew Morton. 11) Fix pktgen OOPS on module unload, from Eric Dumazet. 12) iwlwifi under-estimates SKB truesizes, also from Eric Dumazet. 13) Cure division by zero in SFC driver, from Ben Hutchings. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits) ks8851: Update link status during link change interrupt macvtap: restore vlan header on user read vhost-net: fix handle_rx buffer size bonding: don't increase rx_dropped after processing LACPDUs connector/userns: replace netlink uses of cap_raised() with capable() sctp: check cached dst before using it pktgen: fix crash at module unload Revert "net: maintain namespace isolation between vlan and real device" ehea: fix losing of NEQ events when one event occurred early igb: fix rtnl race in PM resume path ipv4: Do not use dead fib_info entries. r8169: fix unsigned int wraparound with TSO sfc: Fix division by zero when using one RX channel and no SR-IOV openvswitch: Validation of IPv6 set port action uses IPv4 header net: compare_ether_addr[_64bits]() has no ordering cdc_ether: Ignore bogus union descriptor for RNDIS devices bnx2x: bug fix when loading after SAN boot e1000: Silence sparse warnings by correcting type igb, ixgbe: netdev_tx_reset_queue incorrectly called from tx init path openvswitch: Release rtnl_lock if ovs_vport_cmd_build_info() failed. ...
| * | connector/userns: replace netlink uses of cap_raised() with capable()Eric W. Biederman2012-05-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In 2009 Philip Reiser notied that a few users of netlink connector interface needed a capability check and added the idiom cap_raised(nsp->eff_cap, CAP_SYS_ADMIN) to a few of them, on the premise that netlink was asynchronous. In 2011 Patrick McHardy noticed we were being silly because netlink is synchronous and removed eff_cap from the netlink_skb_params and changed the idiom to cap_raised(current_cap(), CAP_SYS_ADMIN). Looking at those spots with a fresh eye we should be calling capable(CAP_SYS_ADMIN). The only reason I can see for not calling capable is that it once appeared we were not in the same task as the caller which would have made calling capable() impossible. In the initial user_namespace the only difference between between cap_raised(current_cap(), CAP_SYS_ADMIN) and capable(CAP_SYS_ADMIN) are a few sanity checks and the fact that capable(CAP_SYS_ADMIN) sets PF_SUPERPRIV if we use the capability. Since we are going to be using root privilege setting PF_SUPERPRIV seems the right thing to do. The motivation for this that patch is that in a child user namespace cap_raised(current_cap(),...) tests your capabilities with respect to that child user namespace not capabilities in the initial user namespace and thus will allow processes that should be unprivielged to use the kernel services that are only protected with cap_raised(current_cap(),..). To fix possible user_namespace issues and to just clean up the code replace cap_raised(current_cap(), CAP_SYS_ADMIN) with capable(CAP_SYS_ADMIN). Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Patrick McHardy <kaber@trash.net> Cc: Philipp Reisner <philipp.reisner@linbit.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: David Howells <dhowells@redhat.com> Reviewed-by: James Morris <james.l.morris@oracle.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | dm mpath: check if scsi_dh module already loaded before trying to loadMike Snitzer2012-05-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the requested scsi_dh module is already loaded then skip request_module(). Multipath table loads can hang in an unnecessary __request_module. Reported-by: Ben Marzinski <bmarzins@redhat.com> Cc: stable@kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | | dm thin: correct module descriptionAlasdair G Kergon2012-05-121-1/+1
| | | | | | | | | | | | | | | | | | | | | Remove duplicate copy of string "device-mapper" (DM_NAME) from MODULE_DESCRIPTION. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | | dm thin: fix unprotected use of prepared_discards listMike Snitzer2012-05-121-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix two places in commit 104655fd4dce ("dm thin: support discards") that didn't use pool->lock to protect against concurrent changes to the prepared_discards list. Without this fix, thin_endio() can race with process_discard(), leading to concurrent list_add()s that result in the processes locking up with an error like the following: WARNING: at lib/list_debug.c:32 __list_add+0x8f/0xa0() ... list_add corruption. next->prev should be prev (ffff880323b96140), but was ffff8801d2c48440. (next=ffff8801d2c485c0). ... Pid: 17205, comm: kworker/u:1 Tainted: G W O 3.4.0-rc3.snitm+ #1 Call Trace: [<ffffffff8103ca1f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff8103cb16>] warn_slowpath_fmt+0x46/0x50 [<ffffffffa04f6ce6>] ? bio_detain+0xc6/0x210 [dm_thin_pool] [<ffffffff8124ff3f>] __list_add+0x8f/0xa0 [<ffffffffa04f70d2>] process_discard+0x2a2/0x2d0 [dm_thin_pool] [<ffffffffa04f6a78>] ? remap_and_issue+0x38/0x50 [dm_thin_pool] [<ffffffffa04f7c3b>] process_deferred_bios+0x7b/0x230 [dm_thin_pool] [<ffffffffa04f7df0>] ? process_deferred_bios+0x230/0x230 [dm_thin_pool] [<ffffffffa04f7e42>] do_worker+0x52/0x60 [dm_thin_pool] [<ffffffff81056fa9>] process_one_work+0x129/0x450 [<ffffffff81059b9c>] worker_thread+0x17c/0x3c0 [<ffffffff81059a20>] ? manage_workers+0x120/0x120 [<ffffffff8105eabe>] kthread+0x9e/0xb0 [<ffffffff814ceda4>] kernel_thread_helper+0x4/0x10 [<ffffffff8105ea20>] ? kthread_freezable_should_stop+0x70/0x70 [<ffffffff814ceda0>] ? gs_change+0x13/0x13 ---[ end trace 7e0a523bc5e52692 ]--- Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | | dm thin: reinstate missing mempool_free in cell_release_singletonMike Snitzer2012-05-121-3/+6
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | Fix a significant memory leak inadvertently introduced during simplification of cell_release_singleton() in commit 6f94a4c45a6f744383f9f695dde019998db3df55 ("dm thin: fix stacked bi_next usage"). A cell's hlist_del() must be accompanied by a mempool_free(). Use __cell_release() to do this, like before. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | md/bitmap: fix calculation of 'chunks' - missing shift.NeilBrown2012-05-042-5/+1
|/ | | | | | | | | | | | | | | | commit 61a0d80c "md/bitmap: discard CHUNK_BLOCK_SHIFT macro" replaced CHUNK_BLOCK_RATIO() by the same text that was replacing CHUNK_BLOCK_SHIFT() - which is clearly wrong. The result is that 'chunks' is often too small by 1, which can sometimes result in a crash (not sure how). So use the correct replacement, and get rid of CHUNK_BLOCK_RATIO which is no longe used. Reported-by: Karl Newman <siliconfiend@gmail.com> Tested-by: Karl Newman <siliconfiend@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md: fix possible corruption of array metadata on shutdown.NeilBrown2012-04-241-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | commit c744a65c1e2d59acc54333ce8 md: don't set md arrays to readonly on shutdown. removed the possibility of a 'BUG' when data is written to an array that has just been switched to read-only, but also introduced the possibility that the array metadata could be corrupted. If, when md_notify_reboot gets the mddev lock, the array is in a state where it is assembled but hasn't been started (as can happen if the personality module is not available, or in other unusual situations), then incorrect metadata will be written out making it impossible to re-assemble the array. So only call __md_stop_writes() if the array has actually been activated. This patch is needed for any stable kernel which has had the above commit applied. Cc: stable@vger.kernel.org Reported-by: Christoph Nelles <evilazrael@evilazrael.de> Signed-off-by: NeilBrown <neilb@suse.de>
* md: don't call ->add_disk unless there is good reason.NeilBrown2012-04-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | Commit 7bfec5f35c68121e7b18 md/raid5: If there is a spare and a want_replacement device, start replacement. cause md_check_recovery to call ->add_disk much more often. Instead of only when the array is degraded, it is now called whenever md_check_recovery finds anything useful to do, which includes updating the metadata for clean<->dirty transition. This causes unnecessary work, and causes info messages from ->add_disk to be reported much too often. So refine md_check_recovery to only do any actual recovery checking (including ->add_disk) if MD_RECOVERY_NEEDED is set. This fix is suitable for 3.3.y: Cc: stable@vger.kernel.org Reported-by: Jan Ceuleers <jan.ceuleers@computer.org> Signed-off-by: NeilBrown <neilb@suse.de>
* DM RAID: Use safe version of rdev_for_eachJonathan Brassow2012-04-241-2/+2
| | | | | | | | | | Fix segfault caused by using rdev_for_each instead of rdev_for_each_safe Commit dafb20fa34320a472deb7442f25a0c086e0feb33 mistakenly replaced a safe iterator with an unsafe one when making some macro changes. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md/bitmap: prevent bitmap_daemon_work running while initialising bitmapNeilBrown2012-04-121-0/+2
| | | | | | | | | | | | | | | | If a bitmap is added while the array is active, it is possible for bitmap_daemon_work to run while the bitmap is being initialised. This is particularly a problem if bitmap_daemon_work sees bitmap->filemap as non-NULL before it has been filled in properly. So hold bitmap_info.mutex while filling in ->filemap to prevent problems. This patch is suitable for any -stable kernel, though it might not apply cleanly before about 3.1. Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
* md/raid1,raid10: Fix calculation of 'vcnt' when processing error recovery.majianpeng2012-04-122-3/+4
| | | | | | | | | | | | If r1bio->sectors % 8 != 0,then the memcmp and a later memcpy will omit the last bio_vec. This is suitable for any stable kernel since 3.1 when bad-block management was introduced. Cc: stable@vger.kernel.org Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* MD: Bitmap version cleanup.Andrei Warkentin2012-04-121-3/+0
| | | | | | | | | | | | bitmap_new_disk_sb() would still create V3 bitmap superblock with host-endian layout. Perhaps I'm confused, but shouldn't bitmap_new_disk_sb() be creating a V4 bitmap superblock instead, that is portable, as per comment in bitmap.h? Signed-off-by: Andrei Warkentin <andrey.warkentin@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md/raid1,raid10: don't compare excess byte during consistency check.NeilBrown2012-04-032-2/+2
| | | | | | | | | | | | | | | | When comparing two pages read from different legs of a mirror, only compare the bytes that were read, not the whole page. In most cases we read a whole page, but in some cases with bad blocks or odd sizes devices we might read fewer than that. This bug has been present "forever" but at worst it might cause a report of two many mismatches and generate a little bit extra resync IO, so there is no need to back-port to -stable kernels. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md/raid5: Fix a bug about judging if the operation is syncing or replacingmajianpeng2012-04-031-1/+3
| | | | | | | | | | | | | | | | | | | | | | When create a raid5 using assume-clean and echo check or repair to sync_action.Then component disks did not operated IO but the raid check/resync faster than normal. Because the judgement in function analyse_stripe(): if (do_recovery || sh->sector >= conf->mddev->recovery_cp) s->syncing = 1; else s->replacing = 1; When check or repair,the recovery_cp == MaxSectore,so syncing equal zero not one. This bug was introduced by commit 9a3e1101b827 md/raid5: detect and handle replacements during recovery. so this patch is suitable for 3.3-stable. Cc: stable@vger.kernel.org Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md/raid1:Remove unnecessary rcu_dereference(conf->mirrors[i].rdev).majianpeng2012-04-031-2/+1
| | | | | | | | Because rde->nr_pending > 0,so can not remove this disk. And in any case, we aren't holding rcu_read_lock() Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md: Avoid OOPS when reshaping raid1 to raid0Jes Sorensen2012-04-031-1/+17
| | | | | | | | | raid1 arrays do not have the notion of chunk size. Calculate the largest chunk sector size we can use to avoid a divide by zero OOPS when aligning the size of the new array to the chunk size. Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md/raid5: fix handling of bad blocks during recovery.NeilBrown2012-04-031-26/+29
| | | | | | | | | | | | | 1/ We can only treat a known-bad-block like a read-error if we have the data that belongs in that block. So fix that test. 2/ If we cannot recovery a stripe due to insufficient data, don't tell "md_done_sync" that the sync failed unless we really did fail something. If we successfully record bad blocks, that is success. Reported-by: "majianpeng" <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md/raid1: If md_integrity_register() failed,run() must free the memmajianpeng2012-04-021-1/+7
| | | | | Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md/raid0: If md_integrity_register() fails, raid0_run() must free the mem.majianpeng2012-04-021-1/+8
| | | | | Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md/linear: If md_integrity_register() fails, linear_run() must free the mem.majianpeng2012-04-021-1/+8
| | | | | Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* dm: add verity targetMikulas Patocka2012-03-283-0/+934
| | | | | | | | | | | | | | | | | | | | This device-mapper target creates a read-only device that transparently validates the data on one underlying device against a pre-generated tree of cryptographic checksums stored on a second device. Two checksum device formats are supported: version 0 which is already shipping in Chromium OS and version 1 which incorporates some improvements. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Signed-off-by: Will Drewry <wad@chromium.org> Signed-off-by: Elly Jones <ellyjones@chromium.org> Cc: Milan Broz <mbroz@redhat.com> Cc: Olof Johansson <olofj@chromium.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm bufio: prefetchMikulas Patocka2012-03-282-26/+90
| | | | | | | | | This patch introduces a new function dm_bufio_prefetch. It prefetches the specified range of blocks into dm-bufio cache without waiting for i/o completion. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm thin: add pool target flags to control discardJoe Thornber2012-03-281-27/+108
| | | | | | | | | | | | | Add dm thin target arguments to control discard support. ignore_discard: Disables discard support no_discard_passdown: Don't pass discards down to the underlying data device, but just remove the mapping within the thin provisioning target. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm thin: support discardsJoe Thornber2012-03-281-14/+158
| | | | | | | | | | | | | | | | | | | | | | Support discards in the thin target. On discard the corresponding mapping(s) are removed from the thin device. If the associated block(s) are no longer shared the discard is passed to the underlying device. All bios other than discards now have an associated deferred_entry that is saved to the 'all_io_entry' in endio_hook. When non-discard IO completes and associated mappings are quiesced any discards that were deferred, via ds_add_work() in process_discard(), will be queued for processing by the worker thread. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> drivers/md/dm-thin.c | 173 ++++++++++++++++++++++++++++++++++++++++++++++---- drivers/md/dm-thin.c | 172 ++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 158 insertions(+), 14 deletions(-)
* dm thin: prepare to support discardJoe Thornber2012-03-281-53/+72
| | | | | | | | | | | | | | | | This patch contains the ground work needed for dm-thin to support discard. - Adds endio function that replaces shared_read_endio. - Introduce an explicit 'quiesced' flag into the new_mapping structure. Before, this was implicitly indicated by m->list being empty. - The map_info->ptr remains constant for the duration of a bio's trip through the thin target. Make it easier to reason about it. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm thin: use dm_target_offsetAlasdair G Kergon2012-03-281-1/+1
| | | | | | | Use dm_target_offset wrapper instead of referencing the awkward ti->begin explicitly. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm thin: support read only external snapshot originsJoe Thornber2012-03-281-14/+71
| | | | | | | | | | | | | | | | | Support the use of an external _read only_ device as an origin for a thin device. Any read to an unprovisioned area of the thin device will be passed through to the origin. Writes trigger allocation of new blocks as usual. One possible use case for this would be VM hosts that want to run guests on thinly-provisioned volumes but have the base image on another device (possibly shared between many VMs). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm thin: relax hard limit on the maximum size of a metadata deviceMike Snitzer2012-03-283-15/+20
| | | | | | | | | | | | | | | | | | The thin metadata format can only make use of a device that is <= THIN_METADATA_MAX_SECTORS (currently 15.9375 GB). Therefore, there is no practical benefit to using a larger device. However, it may be that other factors impose a certain granularity for the space that is allocated to a device (E.g. lvm2 can impose a coarse granularity through the use of large, >= 1 GB, physical extents). Rather than reject a larger metadata device, during thin-pool device construction, switch to allowing it but issue a warning if a device larger than THIN_METADATA_MAX_SECTORS_WARNING (16 GB) is provided. Any space over 15.9375 GB will not be used. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm persistent data: remove space map ref_count entries if redundantJoe Thornber2012-03-281-3/+0
| | | | | | | | | | | | | | | | | Save space by removing entries from the space map ref_count tree if they're no longer needed. Ref counts are stored in two places: a bitmap if the ref_count is below 3, or a btree of uint32_t if 3 or above. When a ref_count that was above 3 drops below we can remove it from the tree and save some metadata space. This removal was commented out before because I was unsure why this was causing under-populated btree nodes. Earlier patches have fixed this issue. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm thin: commit outstanding data every secondJoe Thornber2012-03-281-2/+26
| | | | | | | | | | | | | | | | | Commit unwritten data every second to prevent too much building up. Released blocks don't become available until after the next commit (for crash resilience). Prior to this patch commits were only triggered by a message to the target or a REQ_{FLUSH,FUA} bio. This allowed far too big a position to build up. The interval is hard-coded to 1 second. This is a sensible setting. I'm not making this user configurable, since there isn't much to be gained by tweaking this - and a lot lost by setting it far too high. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: reject trailing characters in sccanf inputMikulas Patocka2012-03-2813-25/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | Device mapper uses sscanf to convert arguments to numbers. The problem is that the way we use it ignores additional unmatched characters in the scanned string. For example, this `if (sscanf(string, "%d", &number) == 1)' will match a number, but also it will match number with some garbage appended, like "123abc". As a result, device mapper accepts garbage after some numbers. For example the command `dmsetup create vg1-new --table "0 16384 linear 254:1bla 34816bla"' will pass without an error. This patch fixes all sscanf uses in device mapper. It appends "%c" with a pointer to a dummy character variable to every sscanf statement. The construct `if (sscanf(string, "%d%c", &number, &dummy) == 1)' succeeds only if string is a null-terminated number (optionally preceded by some whitespace characters). If there is some character appended after the number, sscanf matches "%c", writes the character to the dummy variable and returns 2. We check the return value for 1 and consequently reject numbers with some garbage appended. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm raid: handle failed devices during start upJonathan E Brassow2012-03-281-2/+51
| | | | | | | | | | | | | | | The dm-raid code currently fails to create a RAID array if any of the superblocks cannot be read. This was an oversight as there is already code to handle this case if the values ('- -') were provided for the failed array position. With this patch, if a superblock cannot be read, the array position's fields are initialized as though '- -' was set in the table. That is, the device is failed and the position should not be used, but if there is sufficient redundancy, the array should still be activated. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm thin metadata: pass correct space map to dm_sm_root_sizeJoe Thornber2012-03-281-1/+1
| | | | | | | | | | | | | | Fix a harmless typo. The root is a chunk of data that gets written to the superblock. This data is used to recreate the space map when opening a metadata area. We have two space maps; one tracking space on the metadata device and one of the data device. Both of these use the same format for their root, so this typo was harmless. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm persistent data: remove redundant value_size arg from value_ptrJoe Thornber2012-03-283-33/+29
| | | | | | | | | | | | | Now that the value_size is held within every node of the btrees we can remove this argument from value_ptr(). For the last few months a BUG_ON has been checking this argument is the same as that held in the node. No issues were reported. So this is a safe change. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>