Fix inconsistent state between WAL and saved Snapshot #3584

zghh · 2022-08-11T03:15:38Z

Type of change

Bug fix

Description

It is a bug on etcd.

When an orderer has not participated in the consensus for some time and crashes during the process of writing the latest snapshot to the file, a panic error occurs after the restart.

This bug has been fixed in etcd, but not in fabric.

This PR fixes it into the fabric.

yacovm · 2022-08-11T09:40:32Z

orderer/consensus/etcdraft/storage_test.go

@@ -80,7 +80,7 @@ func TestOpenWAL(t *testing.T) {
 		for i := 0; i < 10; i++ {
 			store.Store(


I don't understand how this test demonstrates the problem you describe.

Can you reproduce the problem in a unit test?

zghh · 2022-08-11T12:10:47Z

orderer/consensus/etcdraft/storage_test.go

@@ -395,3 +395,82 @@ func TestApplyOutOfDateSnapshot(t *testing.T) {
 		assertFileCount(t, 12, 1)
 	})
 }
+
+func TestAbortWhenWritingSnapshot(t *testing.T) {


I add the unit test to reproduce the problem.

Thanks but I meant a unit test that uses an instance of a Fabric Raft chain, in orderer/consensus/etcdraft/chain.go, not a unit test that uses pure etcd.io/raft packages.

We need to be sure that a Fabric Raft chain instance can encounter the bespoken problem, so that:

We know we really have a problem, because it might be that Fabric sidesteps this problem via some mechanism and etcd has this problem.

If the problem occurs in a later point due to a code change, a unit test will notify us.

The problem can be reproduced with the following steps:

Orderer A has not participated in consensus for a while.

Other orderers generate new blocks in consensus.

Other orderers generate a new snapshot.

Orderer A back to normal and receives the new snapshot from other orderers.

Orderer A persists the new snapshot but crashes before calling rs.saveSnap(snapshot).

Orderer A restarts.

yacovm · 2022-08-11T16:09:07Z

@Param-S @jiangyaoguo what is your opinion on this?

zghh · 2022-08-16T13:41:44Z

What's the progress for this PR?

yacovm · 2022-08-16T13:45:55Z

What's the progress for this PR?

@Param-S and @tock-ibm are looking at it, please standby :)

Param-S · 2022-08-18T11:52:12Z

orderer/consensus/etcdraft/storage.go

-	if err := rs.wal.Save(hardstate, entries); err != nil {
-		return err
-	}
-
 	if !raft.IsEmptySnap(snapshot) {
 		if err := rs.saveSnap(snapshot); err != nil {


I think, we need to swap the order of writing the snapshot entries and snapshot file(needs change in saveSnap) as it is done in etcdserver https://github.com/etcd-io/etcd/blob/6c2f5dc78af6b6970d48cecaac515c58a91efca8/server/storage/storage.go#L66

Param-S · 2022-08-18T11:58:48Z

Able to recreate the issue by terminating the orderer process just after saving walsnap entries and before saving snap file.

func (rs *RaftStorage) saveSnap(snap raftpb.Snapshot) error {

if err := rs.wal.SaveSnapshot(walsnap); err != nil {
return errors.Errorf("failed to save snapshot to WAL: %s", err)
}

terminate the process here

if err := rs.snap.SaveSnap(snap); err != nil {
return errors.Errorf("failed to save snapshot to disk: %s", err)
}

2022-08-18 04:04:19.431 PDT 031a PANI [orderer.consensus.etcdraft] loadState -> 5 state.commit 15 is out of range [0, 3] channel=test-system-channel-name node=5
> [unrecovered-panic] runtime.fatalpanic() /usr/local/go/src/runtime/panic.go:1065 (hits goroutine(1):1 total:1) (PC: 0x441b00)
Warning: debugging optimized function
	runtime.curg._panic.arg: interface {}(string) "5 state.commit 15 is out of range [0, 3]"

I will continue to go through the PR changes more and update here.

zghh · 2022-09-01T14:12:41Z

What's the progress? @Param-S

Param-S · 2022-09-02T06:48:20Z

I could not spend time on this last week. I will work on this next couple of days & confirm.

zghh · 2022-10-17T14:40:22Z

What's the progress for this PR?

yacovm · 2022-10-24T21:09:18Z

What's the progress for this PR?

I'm sorry, but we're just too busy to review it.
We will get there eventually but unfortunately I cannot commit to a deadline.

denyeart · 2022-10-28T02:59:49Z

@Mergifyio rebase

Signed-off-by: zghh <1069308575@qq.com>

…it test to reproduce the problem. Signed-off-by: zghh <1069308575@qq.com>

mergify · 2022-10-28T03:00:32Z

rebase

✅ Branch has been successfully rebased

denyeart · 2022-10-28T03:30:32Z

Trying to get checks re-triggered... let me try to Close and Re-open.

zghh · 2023-02-15T09:55:31Z

What's the progress for this PR? @Param-S

denyeart · 2023-04-10T19:35:15Z

@zghh @Param-S

This one has been stale for some time, what is the status of it?

A few questions that may help to clarify the severity:

Why was this opened against release-2.2 instead of main branch? Is the problem resolved on main branch and release-2.5 already given the upgrade of etcdraft in those branches?

What is the overall impact after the problem occurs? The Description says a "panic error occurs after the restart". Will the panic occur after every subsequent restart? Is there any resolution of the problem, or the orderer node must be abandoned and a new orderer node created to replace it?

Has this problem been observed in practice?

zghh requested a review from a team as a code owner August 11, 2022 03:15

zghh force-pushed the release-2.2 branch from d8be2fc to 4a347c4 Compare August 11, 2022 03:19

yacovm reviewed Aug 11, 2022

View reviewed changes

zghh force-pushed the release-2.2 branch from 3bafa7b to a24fd77 Compare August 11, 2022 12:05

zghh commented Aug 11, 2022

View reviewed changes

zghh requested a review from yacovm August 11, 2022 12:21

Param-S reviewed Aug 18, 2022

View reviewed changes

zghh added 2 commits October 28, 2022 03:00

Fix inconsistent state between WAL and saved Snapshot

909d0fe

Signed-off-by: zghh <1069308575@qq.com>

Fix inconsistent state between WAL and saved Snapshot, and add the un…

82a897e

…it test to reproduce the problem. Signed-off-by: zghh <1069308575@qq.com>

C0rWin force-pushed the release-2.2 branch from a24fd77 to 82a897e Compare October 28, 2022 03:00

denyeart closed this Oct 28, 2022

denyeart reopened this Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix inconsistent state between WAL and saved Snapshot #3584

Fix inconsistent state between WAL and saved Snapshot #3584

zghh commented Aug 11, 2022

yacovm Aug 11, 2022

zghh Aug 11, 2022

yacovm Aug 11, 2022

zghh Aug 11, 2022

yacovm commented Aug 11, 2022

zghh commented Aug 16, 2022

yacovm commented Aug 16, 2022

Param-S Aug 18, 2022 •

edited

Param-S commented Aug 18, 2022

zghh commented Sep 1, 2022 •

edited

Param-S commented Sep 2, 2022

zghh commented Oct 17, 2022

yacovm commented Oct 24, 2022

denyeart commented Oct 28, 2022

mergify bot commented Oct 28, 2022

denyeart commented Oct 28, 2022

zghh commented Feb 15, 2023

denyeart commented Apr 10, 2023

		@@ -80,7 +80,7 @@ func TestOpenWAL(t *testing.T) {
		for i := 0; i < 10; i++ {
		store.Store(

Fix inconsistent state between WAL and saved Snapshot #3584

Are you sure you want to change the base?

Fix inconsistent state between WAL and saved Snapshot #3584

Conversation

zghh commented Aug 11, 2022

Type of change

Description

yacovm Aug 11, 2022

Choose a reason for hiding this comment

zghh Aug 11, 2022

Choose a reason for hiding this comment

yacovm Aug 11, 2022

Choose a reason for hiding this comment

zghh Aug 11, 2022

Choose a reason for hiding this comment

yacovm commented Aug 11, 2022

zghh commented Aug 16, 2022

yacovm commented Aug 16, 2022

Param-S Aug 18, 2022 • edited

Choose a reason for hiding this comment

Param-S commented Aug 18, 2022

zghh commented Sep 1, 2022 • edited

Param-S commented Sep 2, 2022

zghh commented Oct 17, 2022

yacovm commented Oct 24, 2022

denyeart commented Oct 28, 2022

mergify bot commented Oct 28, 2022

✅ Branch has been successfully rebased

denyeart commented Oct 28, 2022

zghh commented Feb 15, 2023

denyeart commented Apr 10, 2023

Param-S Aug 18, 2022 •

edited

zghh commented Sep 1, 2022 •

edited