Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fuel-sync design #625

Draft
wants to merge 19 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Binary file added docs/diagrams/fuel-core-design-fuel-sync-part.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
345 changes: 345 additions & 0 deletions docs/fuel-sync-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,345 @@
# The `fuel-sync` design the version for PoA

## Changelog

- TODO: Date of PR creation/approval

## Glossary

- FS - Fuel sync -> `fuel-sync`
- P2P - Fuel p2p -> `fuel-p2p`
- BI - Fuel block importer -> `fuel-block-importer`
- FC - Fuel core -> `fuel-core`
- CM - Consensus module
- Sealed data - The data(blocks, transactions, headers, etc.) signed by the block producer(PoA) or by the consensus comity(PoS)
- ABGT - Average block generation time in seconds
- TP - Transactions pool -> `fuel-txpool`

## Overview

The file describes the possible workflow of the FS service
and its relationship with P2P and BI.
During the review process, team must select which way to go, and the document's
final version will contain only the selected way with a proper overview.

The first final version of the document will describe the implementation
of the FS for the PoA version of the FC. It may contain our assumptions
and ideas for the next evolution of the FS for PoS(or maybe FS will
have several evolution points).

All names are not final and can be suggested by the reviewers=)

## Relationship with P2P and BI

![img.png](diagrams/fuel-core-design-fuel-sync-part.png)

The FS is a connection point between:
- In the PoA version P2P and BI
Copy link
Member

@Voxelot Voxelot Sep 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could generalize the connection point between FS and CM regardless of whether we are in PoS mode or not.

The benefit of this approach is that the synchronizer isn't required to contain business logic about how to validate the sealed block consensus metadata, and just defers that to whichever configuration the CM is running in (PoA or PoS).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can=) I see CM as descending service for FS in this case. So FS will execute something like the check(blocks: &[Sealed<Block>]) -> bool method and after deciding what to do with blocks(reject or commit).

- In the PoS is also additionally connects P2P with CM

There are three possible ways how to implement those relationships:

1. Every service knows about other services related somehow to it. It
can be interpreted as: each service struct containing reference to other
services inside and using the `async/await` mechanism to interact.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we require async/await only for operations that are I/O? It might be overkill to have everything async, unless ofc everything is I/O

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I meant that we could call methods directly. I will rename it to "sync/async".

1. Every service knows nothing about other services. It can be interpreted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ofc desirable, but then we need concrete types to be sent as message (via channels?). We don't know about the internals of the service, but then we need to know about the messages the service is ready to accept.

Or are we going to have a broker in the middle that will do that for us?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think about this rule a lot because the usage of channels makes the code more asynchronous but less clear. Also, most services can work synchronously, and channels make the development harder. For example current implementation of the TX pool use channels. To get the list of transactions, we need to write a lot of boilerplate code using async where we don't need it.

as: each service struct using channels for communication and knowing
nothing about subscribers/listeners.
1. Every service **may** know about descending services but nothing about
ascending services. As an example: P2P service knows nothing about its
listeners(ascending services), and it only propagates information via
channels. But FS knows about P2P(an example of descending service)
and may request some information directly by calling the `async` method.
The same applies to BI. FS notifies all subscribers that it synced a valid block(sealed or not sealed).

Those rules are rough, and services can have cases where some functionality
requires direct access and some pub-sub model. So it should be decided
individually. But after comparison, the third rule covers all our needs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it should be decided individually.

I think this is the correct approach, and focuses on using the right tool for each job instead of complicating simple APIs like the txpool with channels just for the sake of consistency.

for now, and the description below will use this rule. We can stop
following this rule if it makes development harder, adds some complexity,
or forces us to write ugly code with workarounds.

### P2P service requirements

The P2P service should not be aware of FS but still should provide some
functionality that the FS can use. As a solution, P2P can implement the
`BlocksFetcher` trait.

```rust
pub enum BlockBroadcast {
NewSealedHeaders([Sealed<FuelBlockHeader>]),
NewUnsealedHeaders([FuelBlockHeader]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these channel types must be owned (here its a semi-slice, and not a vec or array)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is only pseudocode representing the idea of what should be received from the channel.

After implementation, we will use more concrete types.

NewSealedBlocks([Sealed<FuelBlock>]),
NewUnsealedBlocks([FuelBlock]),
// TODO: Some PoS related events
}

pub enum BlockReason {
InvalidHeader,
InvalidBody,
// TODO: ?
}

pub enum Verdict<Reason> {
Positive,
Negative(Reason),
}

pub enum FetchRequest {
HeadersByRanges([Range<BlockHeight>]),
xgreenx marked this conversation as resolved.
Show resolved Hide resolved
HeadersByHash([Bytes32]),
BlocksByRanges([Range<BlockHeight>]),
BlocksByHash([Bytes32]),
}

// TODO: `D` should implement some bound that allows to serialize it for gossiping.
pub trait Gossiper<D> {
Copy link
Member

@Voxelot Voxelot Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer using a concrete enum over all gossipable data types instead of a generic param. This makes it very clear what kind of data is actually supported for gossip, and also gives us more compiler guarantees that gossiping is implemented across all of our types via exhaustive matching.

/// Gossips some `data` to the network.
fn gossip(&self, data: &D);
}

pub trait Punisher<Reason> {
/// Anyone can report the validity or invalidity of the data. The `Punisher` decide how to process it.
///
/// Example: If the transaction or blocker header is invalid, the punisher can decrease the reputation
/// of the data provider.
// TODO: `D` should implement some bound that allows retrieving the data's signer and allowing to broadcast it.
fn report<D>(&mut self, data: D, verdict: Verdict<Reason>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we have a mechanism to enforce the data to identify who is reporting? (ie AsRef<SomeKey>)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not essential "which" services report it. More critical "about what" we want to report. If we report about blocks, it can have one score system, if about transaction -> another.

Also, we don't need to know about other services, we only need to know about the data(P2P already knows about data).

}

pub trait BlocksFetcher: Punisher<BlockReason> + Gossiper {
/// Returns structure that can be used to subscribe for `BlockBroadcast`.
fn sender(&self) -> &Sender<BlockBroadcast>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we return the sender as owned? I know the user can clone it himself, but I think its always going to be used as owned

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning the reference may be helpful if you want to check something but don't subscribe. Like is_closed or something else. On the interface level, we don't need to force the user to actually subscribe=) If he needs it, he will clone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pattern we use now for owned senders is a subscribe method. We could have both an owned for creating a subscription and unowned version for read-only inquiries.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have both methods, but I want to keep the interface as minimal as possible=) sender adds flexibility, but I don't like naming

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be a Receiver? Sender implies that anyone who obtains this could send a broadcast, instead of just listening for broadcasts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the idea was to use Receiver to be able only to receive information.

I thought Sender.send requires mutability, but it does not. So yea, then better to name the method subscribe and return Receiver<...>


/// Pings the block fetcher to fetch a new latest block header and broadcasts it.
fn ping(&mut self);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we rename it? Ping is traditionally a request with a response (and the response is the indicator that everything is fine). In this case, we are not expecting a response, so we could instead call it heartbeat

AFAIU we are expecting the service to take some action in some channel. Do we have a way to enforce that in the signature? Otherwise, a service can implement this trait, and optionally implement it correctly/the intended way

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe better to remove ping at all and add this request to fetch.

I didn't get the second part=(


/// Fetches the data(somehow, maybe in parallel) and broadcasts ready parts.
fn fetch(&mut self, request: FetchRequest);
Copy link
Member

@Voxelot Voxelot Sep 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a good usecase for FuturesUnordered, which allows joining on a set of parallel tasks.

The consumer of the API would have the option of using poll_next if they want to process the data immediately as it becomes available instead of having to wait for all the futures to complete.

}
```

The FS stores the reference(or a wrapper structure around P2P to allow `mut`
methods) to the P2P and interacts with it periodically.
All data is received from the channel and processed by FS to update the
inner state of the synchronization. FS requests data from P2P for synchronization each time when it is required.
If some data is invalid or corrupted, the FS reports
it to P2P, and P2P can punish the sender of the data by decreasing the
reputation(or blacklisting). FS calls `Gossiper::gossip` at the end of block synchronization.

The same approach can be used between the P2P and TP.
`Punisher` trait adds a mechanism for [reputation management](https://github.com/FuelLabs/fuel-core/issues/460).
`Gossiper` trait solves [gossiping of invalid transactions](https://github.com/FuelLabs/fuel-core/issues/595).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how a trait solves this problem, Gossipsub Messages are already clearly defined in fuel-p2p, the issue #595 that is referenced is more when there is a "valid" type, like a Transaction, but it's invalid in terms of its state on the chain itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have several validity checks:

  • Validate that entity can be properly decoded from the bytes. It should be done on the P2P side.
  • Validate that transactions are valid in business logic(all input arguments are unique, all output are valid, use CheckedTransaction). It is not P2P responsibility. It seems part of the TXPool.
  • Validate that transition already in the TXPool(we don't need to gossip the same transaction). This can be done only by TXPool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validate that transition already in the TXPool(we don't need to gossip the same transaction). This can be done only by TXPool.

It's debatable whether compact blocks over p2p are really useful. In practice, they can slow down consensus if nodes have to query peers for txs they are missing in their txpool. I think we should assume a broadcast of full blocks. Full blocks are also required for syncing old blocks, as the txs will no longer exist in anyone's txpool.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should nodes query transactions that missing in node pool?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah wait, I misunderstood validating if a transaction already in txpool would be used for compact blocks. Re-read and realized it's just to prevent re-gossiping txs we've already received.



### BI requirements

The FS is ascending service for the BI because BI knows nothing about
other services; it only executes blocks and updates the state of the
blockchain. At the end of the block commit, it notifies subscribers.

```rust
pub enum CommitBroadcast {
Blocks([FuelBlock]),
}

pub trait BlocksCommitter {
/// Returns structure that can be used to subscribe for `CommitBroadcast`.
fn sender(&self) -> &Sender<CommitBroadcast>;

/// Commits the blocks to the database. Executes each block from
/// `blocks`, validates it, and updates the state.
///
/// Return errors if something is wrong during the commit process.
/// The state is not updated in that case.
fn commit(&mut self, blocks: [FuelBlock]) -> Result<(), CommitError>;
xgreenx marked this conversation as resolved.
Show resolved Hide resolved

/// Returns the last committed block header.
fn last_block_header(&self) -> FuelBlockHeader;
}
```

FS needs to subscribe to committed blocks because blocks can be generated
by the block production service(or consensus). The FS can use the result
of the `commit` function to update the inner state of the synchronization
in case of an error. Also, `sender` method is helpful for a TP
to prune committed transactions.

### Remark

The actual implementation may have different types but with the same
meaning. If the P2P and BI provide described functionality, it is enough
to implement FS and connect services fully.

## The blocks synchronization flow

FS has two types of synchronization:
- During PoS consensus. It includes validation of blocks, transactions,
and signatures. A partially sealed(not all participants of consensus
signed it) block may be the starting point of the synchronization.
- During sync with the network. The start point is a sealed block or
block header.

The section contains the description of the flow for the second type.

### The block propagation in the network

We have three ways of gossiping(about new block) implementation:

1. Gossip the height and block hash of a new block.
1. Gossip the header of a new block with the producer's seal(or consensus seal).
1. Gossip the entire block.

The first and second approaches require an additional request/response
round to get the block. But the network passive resource consumption is
low. With the first approach, malicious peers can report unreal heights,
and FS needs to handle those cases in the code. With the second case, FS
can verify the block's validity and start the synchronization process
only in the case of validity.

The third approach provides the entire block, and it can be a faster
way to keep the network updated. But it will spam the web with oversized
useless packages(if each node has ten connections, then nine of the
received packages are useless).

We already need to support syncing with requests/responses in the
code for outdated nodes. So, using the third approach doesn't win much
regarding the codebase size.

The second approach is our choice. When FS receives the sealed block header,
it can already be sure this is not a fake sync process. The passive load for
the network is low.

But it is not problem to try to sync to not valid height. P2P will
request blocks for that height from the peer who reported it, and it will
decrease the reputation. So we can use an approach with heights too. It requires
proper handling of this case.

#### Node behaviour

All nodes have the same FS, so it is enough to describe the work of FS.

##### Gossip about new data

Each node gossips only information that is validated by itself and can be
proved by itself. It means the node gossips the block header only after
committing the block.

The final destination of all blocks is BI. FS subscribes to events from
the BI service and gossips the information about the latest block to
the network on each commit via the `Gossiper` trait implemented by P2P.

It gossips the blocks received from the network, from the block producer, and from the consensus.

##### Ask about new data

The network gossips a new block by default with an `ABGT` interval with
the [rule above](#gossip-about-new-data). But it is possible to have cases where
someone forgot to notify another node.

FS has an internal timer that pings the P2P every `ABGT + Const`
(for example, `Const = 5` seconds). This timer resets on each block
commit or timer tick. P2P, on each ping, asks neighbors about the latest block header.

### The blocks synchronization on block header event

FS has two synchronization phases, one is active, and another is pending.
Each phase has its range of blocks to sync. They don't overlap and are
linked. The purpose of the pending phase is to collect and join new
block headers while the active phase syncs the previous range. When
the first phase finishes synchronization, it takes some range from the
pending phase, joins with the remaining range from the active phase,
and requests blocks for the final range.

Because each block header is sealed, and FS knows the block producer
of each block, it allows us to verify the validity of the header
and be sure that the block range is actual without fake heights.

#### Fork

If FS gets the valid sealed block header but already have another block for this height, it is a case of the fork.
- For the PoA version: We ignore it and try to find blocks that link with our version of the blockchain. We should report it to the node manager(via logs, or maybe we want to do something more).
- For the PoS version: We should have rules in the consensus for this case.

#### Queuing of the block header

When FS receives a block header, first FS checks:
- Is it a valid header? It should be signed by the expected, for the node, producer/consensus at according height.
- No -> `Punisher::report`.
- Yes -> Next step.
- Don't know. The Ethereum part is not synced yet to this height. If the height is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can defer ethereum height checking to the BI, to consolidate this check between CM->BI and FS->BI.

If we are behind on the ethereum height, the BI should decide whether to wait on a better height or reject the block.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it means that we should check the consensus details when the ethereum block is available. This check should be done by CM then. CM can wait for a new height from the relayer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point! Maybe we should check it in a few places then, since it'll up to the CM to decide what details from ethereum it needs (or not). Whereas the block importer needs to provide some guarantees that the messages included in a block are going to exist before passing the block to the executor.

Copy link
Collaborator Author

@xgreenx xgreenx Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it's decided=) SealValidator::valdiate is async. If there is no Ethereum block, then CM waits for it(In the PoA, it is the default behavior for all block headers, in the PoS, we need to think about cases with fake height). When BI will ask the relayer about the information, it will already be available(because CM awaited it). BlocksCommitter::commit also is async.

With this approach, we have a problem: we can have many async pending requests, and each consumes a memory. Maybe we need to limit this somehow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first place we can restrict this is via the active set size, since that's equivalent to the buffer size of blocks we're trying to sync.

The other way to restrict this is to validate blocks with the CM and commit to BI one at a time. This assumes that the CM doesn't wait for the eth height until it's already verified things like the signature first to avoid this becoming an attack vector.

Copy link
Collaborator Author

@xgreenx xgreenx Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate more on the second solution, please? I didn't get it=(

The CM can't verify the signature(that it is done by the correct signer) because it doesn't know who the signer is(Eth is not synced enough).

Copy link
Member

@Voxelot Voxelot Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah true, I suppose we could use some heuristic where we check if our relayer node is synced and the da_height is greater than some margin to reject. This way we add some tolerance in case our relayer's eth node claims to be synced when it really isn't.

For example:

flowchart
    a["if block.da_height > relayer.da_height"] -- no --> b["check signature"]
    b -- valid --> s["success"]
    b -- invalid --> f["reject"]
    a -- yes --> c["wait for relayer to sync"]
    c --> d["is block.da_height > relayer.da_height + 10(adjustable margin)"]
    d -- yes --> f
    d -- no --> a
    

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, added the final description of what to do in the case when the Relayer is behind.

not far from the local ethereum height(the height returned by the
ethereum node), we ignore this header and do not punish the sender.
- Is it a new height?
- No -> Is it already in the blockchain?
- Yes -> Ignore(Maybe report some neutral error to P2P to prevent DDoS)
- No -> It is case of the [fork](#fork).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this one is a hard question. I'm not sure we can passively receive forks without being the case of a punishable action. Lets open a thread to discuss this one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you are talking about "Yes -> Ignore(Maybe report some neutral error to P2P to prevent DDoS)".

Yes, we need to punish. Actually, it should be solved on the P2P level. Because P2P should punish nodes that send not requested information.

Each node can report about new height one time, each next time the reputation should decrease(if we didn't request it).

- Yes -> Next step.
- Is synchronization in progress?
- No -> Create an active phase and init it. Start synchronization process.
- Yes -> Next step.
- Is this height already handled by the active phase?
- Yes -> If the header is the same?
- Yes -> Ignore
- No -> It is case of the [fork](#fork).
- Don't know because it is in the middle of the range ->
FS inserts this header into mapping and will check it later,
when blocks will come.
- No -> Insert the header into range of the pending phase.

Each valid modification of the active or pending phase triggers requesting
blocks for the active phase from P2P. P2P is responsible for managing how
to process those requests. P2P decides which peers to ask, manages timeouts,
and manages duplicate requests from other services.

P2P is also responsible for punishing nodes that send not requested information.
Other nodes can only share information about their last block header(But it also
should be limited because each node applies the block only once).

The P2P's responsibility can be part of the synchronizer instead, and it can store
the information about the height of the peers and has an internal
reputation who ask about blocks. It will make FS contain the full
logic about synchronization but will duplicate reputation.

#### The active phase

Active phase has a range of currently syncing blocks. Each time FS
receives a block header, it stores this header in a separate mapping.
It is optional for FS because it only helps find forks. When FS receives
the block for corresponding height, it compares headers.

If the range is extensive, FS splits it into batches, and only the first batch is
a part of the active phase, the remaining range is part of the pending phase.
FS requests only one batch simultaneously from P2P. It is possible to
request several batches in parallel in the future.

When blocks come, it checks that they are valid and linked.
If all checks pass, FS forward the blocks to BI and starts
[commit process](#the-commit-of-the-blocks). If blocks are not valid, FS reports
invalidity to P2P and requests block range again.

### The commit of the blocks

When FS receives blocks linked to the current latest block of the
blockchain, FS starts to commit them(`BlockCommiter::commit`). Only one
commit can be run in the BI at the exact moment, so BI has local mutex
only for `commit` to prevent multiple commits.

`BlockCommiter::commit` return the result of execution:
- In the case of successful execution. FS waits for the event from the
channel to remove committed blocks from the active phase, and join a new range from
pending phase.
- It can be an error that those blocks already are committed. It is a
good case because another job has already committed blocks. Do the same
as in the case of successful execution.
- It can be an error that BI got some error during execution. It is a
bad case, and it means that FS got a valid block signed by the producer,
but FS can't apply it to the blockchain. Report it to logs and notify the node owner.

#### On block commit event

FS gossips about the node's latest via `Gossiper::gossip`.

FS removes committed blocks from the range of active phase. After FS
fulfills it with a new range from the pending phase. The new range should
always be <= the batch size. FS requests a new fulfilled range from P2P.