Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: single-pass snapshot reading #1838

Open
wants to merge 123 commits into
base: master
Choose a base branch
from

Conversation

segfault-magnet
Copy link
Contributor

@segfault-magnet segfault-magnet commented Apr 18, 2024

closes: #1823

When importing, if a table is derivable we're currently (in master) re-reading the data from which the table is to be derived. This multiplies the overhead in terms of IO, decompressing and decoding the data.

This PR adapts the import logic so that it can write to both on and off chain databases while keeping the cursor correct in the case either fails.

The downside is that we will first import a batch on-chain and then, if successful, off-chain (if any tables are to be derived from it). There is opportunity for concurrency here in the future -- the import task can split off sub-tasks for each batch, one for on-chain and another (or more) for off chain.

@segfault-magnet segfault-magnet marked this pull request as draft April 23, 2024 19:00
Base automatically changed from feature/snapshot_generation_graceful_shutdown to master April 24, 2024 07:18
@segfault-magnet segfault-magnet marked this pull request as ready for review April 24, 2024 15:23
Copy link
Member

@MitchTurner MitchTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only made it part way through.

groups,
db,
progress_reporter,
start_imports!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to get any kind of compiler error/test failure if we add new tables in the future, or are we just going to need to remember to add them here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all tables get a snapshot file due to two reasons:

  • it is not migrated at all
  • or it is derived from an already present file

So if a table is to be migrated the developer should:

  1. export the data (if not derived)
  2. import it (by implementing ImportTable or changing the impl of an existing one in the case of derived tables).

}
let group = group?;

if Some(index) > on_chain_last_idx {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid doing a comparison of Option here. I wasn't sure what the behavior was, so I went and tried it out. It's not completely intuitive.

Also, reading through this, I mistook it for an if let Some statement.

Perhaps we could change it to:

index > on_chain_last_idx.unwrap_or(0)

I guess the case where index==0 and on_chain_last_idx==None it would have different behavior, since Some(0) > None but 0 !> 0. Would that be acceptable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no batch is handled then the database has no entry for that table. i.e. a select would return 0 rows. Here given as None from the db.

When a batch is handled it's index (parquet group number) is recorded.

So an entry of 0 in the database means the first batch is handled successfully.

So here the resume logic is identical to the comparison of options. If there is no last recorded idx for the on chain database then no batch was imported and so proceed to import.

If there is an entry in the db check that this index is greater so you don't reimport the same batch.

But unwrap_or(0) would cause problems for the first batch because of the difference in behavior you mentioned.

I agree the options comparison is unclear if you're coming across it for the first time. Maybe the second time as well :D.

The alternative would be to unpack the behavior:

let comparison_result = if let Some(chain_idx) = on_chain_last_idx {
    index > chain_idx
} else {
    true
};


if is_cancelled {
bail!("Import cancelled")
if Some(index) > off_chain_last_idx {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here.

#[derive(Default, Clone)]
struct Spy {
on_chain_called_with: Arc<Mutex<Vec<Vec<TableEntry<Coins>>>>>,
off_chain_called_with: Arc<Mutex<Vec<Vec<TableEntry<Coins>>>>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names confuse me.

Something has called the on_chain data? And this is what it "called" it "with"? That's my best guess.

Can we change this to something clearer, along with the getters below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A spy records arguments, ret values etc.

This Spy constructs TestTableImporters that implement ImportTable.

ImportTable has two methods: a handler for on_chain db changes and a handler for off_chain db changes.

So the spy records what the on_chain was called_with and what the off_chain was called_with.

#[test_case::test_case(Some(1), Some(0); "off chain reverted")]
#[test_case::test_case(Some(3), Some(1); "off chain reverted multiple times")]
#[test_case::test_case(Some(1), Some(3); "on chain reverted multiple times")]
fn can_recover_when_both_tx_dont_succeed_together(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test could have a clearer name.

What "can_recover"? What are the "tx"s here? I don't see anything called "tx" in this test. I feel like the name should have more to do with reconciling the groups being processed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What "can_recover"?

Importing/genesis/regenesis after the process exists due to an error.

What are the "tx"s here?

Database transactions. Two of them, one for on-chain and another on for off chain databases. Only when successfully committed the import progress is going to be incremented. They will be committed only if nothing failed during the import.

I feel like the name should have more to do with reconciling the groups being processed.

Maybe synchronizes_imports_upon_previous_failure ? To say that even if a batch succeeds on one of the databases and fails on another, the next time the import is run the imports are first going to be synchronized.

Thing is I don't want to commit to synchronizes because that is not a requirement. If somebody parallelized this they can very well have both databases continue importing even though they're not handling the same batch at the same time.

I guess failed batches will be retried after failure. Something along those lines?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize snapshot reading when deriving tables
4 participants