Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SyncSubmitBlock #4245

Open
ruseinov opened this issue Apr 19, 2024 · 15 comments
Open

SyncSubmitBlock #4245

ruseinov opened this issue Apr 19, 2024 · 15 comments

Comments

@ruseinov
Copy link
Contributor

ruseinov commented Apr 19, 2024

The block submission to the swarm is done.

The sanity checks are not, as this requires introducing some extra abstractions and infra.

@alt-ivan
Copy link

alt-ivan commented May 8, 2024

Hi!

Unfortunately, seems like the chain head still doesn't get updated. I'm not sure why. I've set tracing to debug (info for the most noise-producing stuff) and I'm attaching a log. Hopefully it's at least a little bit helpful.

forestlog.txt

The commit hash from which Forest was built for testing is 3a1291d4655fb641a85d25dd288cd3f19d5c2c2d .
The testing was performed on a local 2K devnet, with Forest bootstrapped and synced via a Lotus node.

Tested using custom miner that is tested against Lotus and works on the same devnet on the Lotus node.

@ruseinov
Copy link
Contributor Author

ruseinov commented May 8, 2024

@alt-ivan how do I reproduce this?
I’d love to see the steps both for Lotus and Forest to find the culprit. Thank you in advance!

@alt-ivan
Copy link

alt-ivan commented May 8, 2024

@ruseinov

Hi Roman,

I don't currently have a great method of reproducing to share. I tried quickly jury-rigging lotus-miner to work with Forest as means of being able to create a repro but I haven't had much success yet.

Here's the RPC sequence used by the custom miner:

  1. ChainHead
  2. MinerGetBaseInfo -> Beacon entries & is eligible for mining
  3. WalletSign -> VRFProof for computing election proof
  4. MinerGetBaseInfo -> Sector info for winning PoST
  5. WalletSign -> Signing computed block ticket hash
  6. MpoolSelect -> Messages to include in block
  7. MinerCreateBlock
  8. SyncSubmitBlock -> With result from MinerCreateBlock
  9. ChainHead

The only change I made is to hard-code block_delay_secs from StateGetNetworkParams as that RPC isn't implemented in Forest yet. I used the same value returned by the Lotus node.

I have also manually run each of these against both Lotus/Forest synced to the same tipset/head and all RPCs listed here return identical responses on both Forest and Lotus nodes for the same inputs.

If you would find it helpful, I can share the request/response for each of these as well.

Edit:

  1. MinerCreateBlock
  2. SyncSubmitBlock -> With result from MinerCreateBlock

These should be enough for manual testing if you have Forest/Lotus synced to the same head. I suppose you could at that point take MinerCreateBlock result from lotus-miner and manually feed it into both nodes' SyncSubmitBlock RPCs

@ruseinov
Copy link
Contributor Author

@alt-ivan
Thanks for lining that up for me, it's helpful. The requests and responses would be nice to see as well if possible.

With the sequence above - how can we be sure that the head is never set?

Bear in mind that the way this works in forest is that a block is sent to the message queue for processing, so there are no guarantees as to exactly when it actually gets processed, though it should be quite fast. Have you tried verifying that the given block never makes it into the chain?

@ruseinov
Copy link
Contributor Author

ruseinov commented May 10, 2024

3a1291d

This does not seem to exist anymore, could you try the latest main? Though as long as it had my latest SyncSubmitBlock change - it should not matter.

Edit: I have double checked the files there and it seems correct.

@alt-ivan
Copy link

@ruseinov

The requests and responses would be nice to see as well if possible.

Seems like I didn't save them after all 🤦 ... I'll redo the sequence and send the requests/responses. Sorry for the delay.

With the sequence above - how can we be sure that the head is never set?

Calling ChainHead returns the same tipset as before the SyncSubmitBlock call. Also, SyncSubmitBlock can be called over and over again with the same block and it will return a successful result, but the blocks seems to get "eaten" somewhere.

Bear in mind that the way this works in forest is that a block is sent to the message queue for processing, so there are no guarantees as to exactly when it actually gets processed, though it should be quite fast. Have you tried verifying that the given block never makes it into the chain?

I figured as much when looking at the codebase. I only stepped through the SyncSubmitBlock impl and there I saw that the created tipset doesn't get a CID assigned to it at that point in time. After that I didn't know where to follow the propagation of the block to see if it gets recorded anywhere using GetChainBlock, because I couldn't find its CID if it did.

@ruseinov
Copy link
Contributor Author

Seems like I didn't save them after all 🤦 ... I'll redo the sequence and send the requests/responses. Sorry for the delay.

Please do!

@ruseinov
Copy link
Contributor Author

I figured as much when looking at the codebase. I only stepped through the SyncSubmitBlock impl and there I saw that the created tipset doesn't get a CID assigned to it at that point in time. After that I didn't know where to follow the propagation of the block to see if it gets recorded anywhere using GetChainBlock, because I couldn't find its CID if it did.

That is helpful, let me check..

@ruseinov
Copy link
Contributor Author

I figured as much when looking at the codebase. I only stepped through the SyncSubmitBlock impl and there I saw that the created tipset doesn't get a CID assigned to it at that point in time. After that I didn't know where to follow the propagation of the block to see if it gets recorded anywhere using GetChainBlock, because I couldn't find its CID if it did.

That is helpful, let me check..

@alt-ivan in fact key is lazily initialized on fn key() call.

forest/src/blocks/tipset.rs

Lines 301 to 304 in 4a6ad5c

pub fn key(&self) -> &TipsetKey {
self.key
.get_or_init(|| TipsetKey::from(self.headers.iter_ne().map(|h| *h.cid()).collect_vec()))
}

So you can basically check the CIDs of headers to see which ones to monitor.

For further debugging:

  1. Here's where tipset channel is being consumed:
    match self.tipsets.as_mut().poll_next(cx) {
  2. Here's where the rest of the logic happens:
    match self.state {
    TipsetProcessorState::Idle => {
    // Set the state to FindRange if we have a tipset to sync towards
    // Consume the tipsets received, start syncing the heaviest tipset group, and
    // discard the rest
    if let Some(((epoch, parents), heaviest_tipset_group)) = grouped_tipsets
    .into_iter()
    .max_by(|(_, a), (_, b)| a.weight_cmp(b))
    {
    trace!("Finding range for tipset epoch = {}", epoch);
    self.state = TipsetProcessorState::FindRange {
    epoch,
    parents,
    range_finder: self.find_range(heaviest_tipset_group),
    current_sync: None,
    next_sync: None,
    };
    }
    }
    TipsetProcessorState::FindRange {
    ref mut epoch,
    ref mut parents,
    ref mut current_sync,
    ref mut next_sync,
    ..
    } => {
    // Add tipsets to the current sync cache
    if let Some(tipset_group) = grouped_tipsets.remove(&(*epoch, parents.clone())) {
    match current_sync {
    Some(cs) => {
    // This check is redundant
    if cs.is_mergeable(&tipset_group) {
    cs.merge(tipset_group);
    }
    }
    None => *current_sync = Some(tipset_group),
    }
    }
    // Update or replace the next sync
    if let Some(heaviest_tipset_group) = grouped_tipsets
    .into_iter()
    .max_by(|(_, a), (_, b)| a.weight_cmp(b))
    .map(|(_, group)| group)
    {
    // Find the heaviest tipset group and either merge it with the
    // tipset group in the next_sync or replace it.
    match next_sync {
    None => *next_sync = Some(heaviest_tipset_group),
    Some(ns) => {
    if ns.is_mergeable(&heaviest_tipset_group) {
    // Both tipsets groups have the same epoch & parents, so merge them
    ns.merge(heaviest_tipset_group);
    } else if heaviest_tipset_group.is_heavier_than(ns) {
    // The tipset group received is heavier than the one saved, replace
    // it.
    *next_sync = Some(heaviest_tipset_group);
    }
    // Otherwise, drop the heaviest tipset group
    }
    }
    }
    }
    TipsetProcessorState::SyncRange {
    ref mut range_syncer,
    ref mut next_sync,
    } => {
    // Add tipsets to the current tipset range syncer
    if let Some(tipset_group) = grouped_tipsets.remove(&(
    range_syncer.proposed_head_epoch(),
    range_syncer.proposed_head_parents(),
    )) {
    tipset_group.tipsets().into_iter().for_each(|ts| {
    let tipset_key = ts.key().clone();
    match range_syncer.add_tipset(ts) {
    Ok(added) => {
    if added {
    trace!("Successfully added tipset [key = {:?}] to running range syncer", tipset_key);
    }
    }
    Err(why) => {
    error!("Adding tipset to range syncer failed: {}", why);
    }
    }
    });
    }
    // Update or replace the next sync
    if let Some(heaviest_tipset_group) = grouped_tipsets
    .into_iter()
    .max_by(|(_, a), (_, b)| a.weight_cmp(b))
    .map(|(_, group)| group)
    {
    // Find the heaviest tipset group and either merge it with the
    // tipset group in the next_sync or replace it.
    match next_sync {
    None => *next_sync = Some(heaviest_tipset_group),
    Some(ns) => {
    if ns.is_mergeable(&heaviest_tipset_group) {
    // Both tipsets groups have the same epoch & parents, so merge them
    ns.merge(heaviest_tipset_group);
    } else if heaviest_tipset_group.is_heavier_than(ns) {
    // The tipset group received is heavier than the one saved, replace
    // it.
    *next_sync = Some(heaviest_tipset_group);
    } else {
    // Otherwise, drop the heaviest tipset group
    trace!("Dropping collected tipset groups");
    }
    }
    }
    }
    }
    }
    // Drive underlying futures to completion
    loop {
    match self.state {
    TipsetProcessorState::Idle => {
    // Tipsets stream must have already registered with the task driver
    // to wake up this task when a new tipset is available.
    // Otherwise, the future can stall.
    return Poll::Pending;
    }
    TipsetProcessorState::FindRange {
    ref mut range_finder,
    ref mut current_sync,
    ref mut next_sync,
    ..
    } => match range_finder.take() {
    Some(mut range_syncer) => {
    debug!(
    "Determined epoch range for next sync: [{}, {}]",
    range_syncer.current_head.epoch(),
    range_syncer.proposed_head.epoch(),
    );
    // Add current_sync to the yielded range syncer.
    // These tipsets match the range's [epoch, parents]
    if let Some(tipset_group) = current_sync.take() {
    tipset_group.tipsets().into_iter().for_each(|ts| {
    if let Err(why) = range_syncer.add_tipset(ts) {
    error!("Adding tipset to range syncer failed: {}", why);
    }
    });
    }
    self.state = TipsetProcessorState::SyncRange {
    range_syncer: Box::pin(range_syncer),
    next_sync: next_sync.take(),
    };
    }
    None => {
    self.state = TipsetProcessorState::Idle;
    }
    },
    TipsetProcessorState::SyncRange {
    ref mut range_syncer,
    ref mut next_sync,
    } => {
    let proposed_head_epoch = range_syncer.proposed_head.epoch();
    let current_head_epoch = range_syncer.current_head.epoch();
    // Drive the range_syncer to completion
    match range_syncer.as_mut().poll(cx) {
    Poll::Ready(Ok(_)) => {
    metrics::HEAD_EPOCH.set(proposed_head_epoch);
    info!(
    "Successfully synced tipset range: [{}, {}]",
    current_head_epoch, proposed_head_epoch,
    );
    }
    Poll::Ready(Err(why)) => {
    metrics::TIPSET_RANGE_SYNC_FAILURE_TOTAL.inc();
    error!(
    "Syncing tipset range [{}, {}] failed: {}",
    current_head_epoch, proposed_head_epoch, why,
    );
    }
    Poll::Pending => return Poll::Pending,
    }
    // Move to the next state
    match next_sync.take() {
    // This tipset group is the heaviest that has been received while
    // rnning this tipset range syncer, so start syncing it
    Some(tipset_group) => {
    self.state = TipsetProcessorState::FindRange {
    epoch: tipset_group.epoch(),
    parents: tipset_group.parents(),
    range_finder: self.find_range(tipset_group),
    current_sync: None,
    next_sync: None,
    };
    }
    None => {
    self.state = TipsetProcessorState::Idle;
    }
    }
    }
    }
    }
    }

@alt-ivan
Copy link

alt-ivan commented May 14, 2024

@ruseinov

I tried to adapt lotus-miner so I could give you easy steps to reproduce (to stop at MinerCreateBlock so you can feed the output of that into SyncSubmitBlock) but I ran into so many speedbumps that I gave up for now :/

Some new interesting info:

So you can basically check the CIDs of headers to see which ones to monitor.

When I connect Forest/Lotus and execute SyncSubmitBlock on Forest, it propagates that block to Lotus and Lotus now has the correct chain head with the latest tipset. Forest also receives this block, seemingly back from Lotus via gossip, because I can query it via GetChainBlock and see it. However, Forest's ChainHead still points to the previous epoch tipset.

When I don't connect Forest/Lotus and execute SyncSubmitBlock on Forest, then that block doesn't get stored and I cannot query it via GetChainBlock

Here's where tipset channel is being consumed:

I added tracing into these locations and nothing happens. Doing some more eye-level examination I saw this:

    fn follow(&self, tipset_opt: Option<FullTipset>) -> ChainMuxerFuture<(), ChainMuxerError> {
        // Instantiate a TipsetProcessor

But in devnet it seems that Forest never enters follow mode #3089
From what I saw, tipset_receiver only gets consumed here via TipsetProcessor? I might be wrong though.

This may also explain other sync issues with Lotus in devnet i.e. Forest often doesn't sync fully with Lotus with similar "symptoms" (ChainHead returns earlier tipset but later one exists in store)

@ruseinov
Copy link
Contributor Author

@alt-ivan Thanks a lot for looking into this, no it starts to make a lot of sense!

When I don't connect Forest/Lotus and execute SyncSubmitBlock on Forest, then that block doesn't get stored and I cannot query it via GetChainBlock

That's probably due to the fact that the actual network send works, but the TipsetProcessor part does not work.

But in devnet it seems that Forest never enters follow mode #3089

You are correct, it seems it never does. Good spot! And logically you can submit a block if the node is never "in sync". I guess we'll have to deal with it to fix this.

@ruseinov
Copy link
Contributor Author

@alt-ivan Can you please try going through the process using this commit: #4328 ?
It's going to be merged to master automatically in a little bit. This should fix the follow issue.

@alt-ivan
Copy link

@ruseinov
I will be testing shortly and reporting back here. Thanks!

@alt-ivan
Copy link

alt-ivan commented May 15, 2024

@ruseinov
Good news! I tried it on the latest commit and mining works!

Couple of notes:

  1. It seems that Lotus processes the block received via SyncSubmitBlock synchronously while Forest does it asynchronously, leading to a small delay between SyncSubmitBlock and the tipset being processed/chain head getting updated. This shouldn't be a big deal in practice because I've just added a small delay to the miner to tackle this issue, but I assume it could potentially be a problem since this delay isn't deterministic. In fact I can just tweak the miner to wait for tipset at expected height so you can ignore this altogether :D

  2. The Forest node now enters follow mode and properly syncs up with Lotus even without active miners on Lotus, hooray!

  3. Now we have a situation where Lotus fails some validation when syncing to Forest in devnet :) I'll paste Lotus logs below for posterity, but I won't expand on this here since it's outside the scope of this issue

lotus  | 2024-05-15T15:22:37.565Z       ERROR   chain   chain/sync_manager.go:255       error during sync in [bafy2bzaceajmcnxkqbwwe7fzw5ildicadw57kw2eyu2sprwynhqeklxlbaazs]: collectChain failed: chain linked to block marked previously as bad ([bafy2bzaceajmcnxkqbwwe7fzw5ildicadw57kw2eyu2sprwynhqeklxlbaazs], bafy2bzacedt364lbbw4xavfnmbka2k4r27245g3zc3kepzy2ejeyafyx7rio4) (reason: linked to bafy2bzaceaewkhk2ikz7d46ilbc7j2tbcfydcrl4u3qykgkank4th4x2xwu3q caused by: [bafy2bzacecf2hjz7sarl3qfphjn5u5mnbfmfbmeefepqp7atnzbc2gnrqkcz4] 1 error occurred:
lotus  |        * failed to validate blocks random beacon values:
lotus  |     github.com/filecoin-project/lotus/chain/consensus/filcns.(*FilecoinEC).ValidateBlock.func4
lotus  |         /lotus/chain/consensus/filcns/filecoin.go:248
lotus  |   - unexpected beacon round 7660175, expected 7660174 for epoch 180:
lotus  |     github.com/filecoin-project/lotus/chain/beacon.ValidateBlockValues
lotus  |         /lotus/chain/beacon/beacon.go:99

@ruseinov
Copy link
Contributor Author

ruseinov commented May 15, 2024

Good news! I tried it on the latest commit and mining works!

Nice!

  1. It seems that Lotus processes the block received via SyncSubmitBlock synchronously while Forest does it asynchronously, leading to a small delay between SyncSubmitBlock and the tipset being processed/chain head getting updated. This shouldn't be a big deal in practice because I've just added a small delay to the miner to tackle this issue, but I assume it could potentially be a problem since this delay isn't deterministic

Perhaps the best option is to have some sort of retry mechanism there. It should not be long, but we currently have no means of making sure this is synchronous. However we're working on a new state machine that might allow us to solve this issue cc @lemmih . I'm happy to contribute that part of the codebase if need be, but for now we have to make do with the current setup.
Ah yeah, if you can wait for a tipset - all good then. I think limiting this to a sync operation is not necessary.

3. Now we have a situation where Lotus fails some validation when syncing to Forest in devnet :) I'll paste Lotus logs below for posterity, but I won't expand on this here since it's outside the scope of this issue

Gotta investigate. Would be great to have a semi-automated reproducible case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Implemented
Development

No branches or pull requests

2 participants