Fix for bug 596453: multiple mirror image failures cause lvm repair...

The lvm repair issues I believe are the superficial symptoms of this bug - there are worse issues that are not as clearly seen. From my inline comments: * If the mirror was successfully recovered, we want to always * force every machine to write to all devices - otherwise, * corruption will occur. Here's how: * Node1 suffers a failure and marks a region out-of-sync * Node2 attempts a write, gets by is_remote_recovering, * and queries the sync status of the region - finding * it out-of-sync. * Node2 thinks the write should be a nosync write, but it * hasn't suffered the drive failure that Node1 has yet. * It then issues a generic_make_request directly to * the primary image only - which is exactly the device * that has suffered the failure. * Node2 suffers a lost write - which completely bypasses the * mirror layer because it had gone through generic_m_r. * The file system will likely explode at this point due to * I/O errors. If it wasn't the primary that failed, it is * easily possible in this case to issue writes to just one * of the remaining images - also leaving the mirror inconsistent. * * We let in_sync() return 1 in a cluster regardless of what is * in the bitmap once recovery has successfully completed on a * mirror. This ensures the mirroring code will continue to * attempt to write to all mirror images. The worst that can * happen for reads is that additional read attempts may be * taken.
author: Jonathan Earl Brassow <jbrassow@redhat.com> 2010-08-17 23:56:23 +0000
committer: Jonathan Earl Brassow <jbrassow@redhat.com> 2010-08-17 23:56:23 +0000
commit: 53670b18f5f5a8c3f6449bca9a0b996fd4a818b3 (patch)
tree: 503d624ba02a255e75c747407651d213c46c5fdd /daemons/cmirrord/functions.c
parent: c496bc13e294d42830fb75f4b9bbda681cdef4a1 (diff)
download: lvm2-53670b18f5f5a8c3f6449bca9a0b996fd4a818b3.tar.gz
lvm2-53670b18f5f5a8c3f6449bca9a0b996fd4a818b3.tar.xz
lvm2-53670b18f5f5a8c3f6449bca9a0b996fd4a818b3.zip
1 files changed, 42 insertions, 1 deletions
diff --git a/daemons/cmirrord/functions.c b/daemons/cmirrord/functions.c
index 99176259..36910b19 100644
--- a/daemons/cmirrord/functions.c
+++ b/daemons/cmirrord/functions.c
@@ -54,6 +54,7 @@ struct log_c {
 
 	time_t delay; /* limits how fast a resume can happen after suspend */
 	int touched;
+	int in_sync;  /* An in-sync that stays set until suspend/resume */
 	uint32_t region_size;
 	uint32_t region_count;
 	uint64_t sync_count;
@@ -720,6 +721,7 @@ static int clog_resume(struct dm_ulog_request *rq)
 	if (!lc)
 		return -EINVAL;
 
+	lc->in_sync = 0;
 	switch (lc->resume_override) {
 	case 1000:
 		LOG_ERROR("[%s] Additional resume issued before suspend",
@@ -963,6 +965,42 @@ static int clog_in_sync(struct dm_ulog_request *rq)
 		return -EINVAL;
 
 	*rtn = log_test_bit(lc->sync_bits, region);
+
+	/*
+	 * If the mirror was successfully recovered, we want to always
+	 * force every machine to write to all devices - otherwise,
+	 * corruption will occur.  Here's how:
+	 *    Node1 suffers a failure and marks a region out-of-sync
+	 *    Node2 attempts a write, gets by is_remote_recovering,
+   	 *          and queries the sync status of the region - finding
+	 *	    it out-of-sync.
+	 *    Node2 thinks the write should be a nosync write, but it
+	 *          hasn't suffered the drive failure that Node1 has yet.
+	 *          It then issues a generic_make_request directly to
+	 *          the primary image only - which is exactly the device
+	 *          that has suffered the failure.
+	 *    Node2 suffers a lost write - which completely bypasses the
+	 *          mirror layer because it had gone through generic_m_r.
+	 *    The file system will likely explode at this point due to
+	 *    I/O errors.  If it wasn't the primary that failed, it is
+	 *    easily possible in this case to issue writes to just one
+	 *    of the remaining images - also leaving the mirror inconsistent.
+	 *
+	 * We let in_sync() return 1 in a cluster regardless of what is
+	 * in the bitmap once recovery has successfully completed on a
+	 * mirror.  This ensures the mirroring code will continue to
+	 * attempt to write to all mirror images.  The worst that can
+	 * happen for reads is that additional read attempts may be
+	 * taken.
+	 *
+	 * Futher investigation may be required to determine if there are
+	 * similar possible outcomes when the mirror is in the process of
+	 * recovering.  In that case, lc->in_sync would not have been set
+	 * yet.
+	 */
+	if (!*rtn && lc->in_sync)
+		*rtn = 1;
+
 	if (*rtn)
 		LOG_DBG("[%s] Region is in-sync: %llu",
 			SHORT_UUID(lc->uuid), (unsigned long long)region);
@@ -1282,7 +1320,7 @@ static int clog_set_region_sync(struct dm_ulog_request *rq, uint32_t originator)
 				lc->skip_bit_warning = lc->region_count;
 
 			if (pkg->region > (lc->skip_bit_warning + 5)) {
-				LOG_ERROR("*** Region #%llu skipped during recovery ***",
+				LOG_SPRINT(lc, "*** Region #%llu skipped during recovery ***",
 					  (unsigned long long)lc->skip_bit_warning);
 				lc->skip_bit_warning = lc->region_count;
 #ifdef DEBUG
@@ -1324,6 +1362,9 @@ static int clog_set_region_sync(struct dm_ulog_request *rq, uint32_t originator)
 			   "(lc->sync_count > lc->region_count) - this is bad",
 			   rq->seq, SHORT_UUID(lc->uuid), originator);
 
+	if (lc->sync_count == lc->region_count)
+		lc->in_sync = 1;
+
 	rq->data_size = 0;
 	return 0;
 }
author	Jonathan Earl Brassow <jbrassow@redhat.com>	2010-08-17 23:56:23 +0000
committer	Jonathan Earl Brassow <jbrassow@redhat.com>	2010-08-17 23:56:23 +0000
commit	53670b18f5f5a8c3f6449bca9a0b996fd4a818b3 (patch)
tree	503d624ba02a255e75c747407651d213c46c5fdd /daemons/cmirrord/functions.c
parent	c496bc13e294d42830fb75f4b9bbda681cdef4a1 (diff)
download	lvm2-53670b18f5f5a8c3f6449bca9a0b996fd4a818b3.tar.gz lvm2-53670b18f5f5a8c3f6449bca9a0b996fd4a818b3.tar.xz lvm2-53670b18f5f5a8c3f6449bca9a0b996fd4a818b3.zip