Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: speed up bin/dry-run.js #214

Merged
merged 8 commits into from
May 17, 2024
Merged

perf: speed up bin/dry-run.js #214

merged 8 commits into from
May 17, 2024

Conversation

juliangruber
Copy link
Member

@juliangruber juliangruber commented May 8, 2024

Noticed in https://github.com/filecoin-station/spark-evaluate/actions/runs/8995936280/job/24711656364?pr=205 that it can take a while. While most of the time there is waiting for w3s data to become available, even without it will take minutes

@juliangruber juliangruber requested a review from bajtos May 8, 2024 03:52
@juliangruber
Copy link
Member Author

Downside: This makes logs harder to read. We could prefix messages with the CID that's being preprocessed. Wdyt?

Copy link
Member

@bajtos bajtos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Let's discuss this a bit.

IMO, if we are fetching 20 CIDs from web3.storage as part of the dry run, we shouldn't wait 60 seconds for each CID. Instead, there should be a single pause at the start, if any pause is needed at all.

In a typical dry run, we fetch CIDs committed for the round before the previous one, i.e. those CIDs are more than 20 minutes old. That means we don't need to wait and can start fetching immediately.

I believe that's already the case - the pause is implemented in the event listener onMeasurementsAdded, while dry-run calls preprocess directly.

The second cause of slow dry run is the retry mechanism we use as a workaround for failing web3.storage retrievals. If I remember correctly, this has a very low success rate - once the web3.storage cache is poisoned with invalid block data, it takes very long until the cache slot is refreshed (if it's refreshed at all).

If the cache is poisoned, then I think the best thing to do in the dry run is to abort with an error immediately or retry at most once.

I propose to make the retry logic configurable and set retries=1 when calling preprocess from the dry-run script.

This may be all that we need to change. If not, then I am fine to continue the work on this pull request and prefix log messages with the CID being fetched/preprocessed, as you suggested:

Downside: This makes logs harder to read. We could prefix messages with the CID that's being preprocessed.

@juliangruber
Copy link
Member Author

juliangruber commented May 10, 2024

IMO, if we are fetching 20 CIDs from web3.storage as part of the dry run, we shouldn't wait 60 seconds for each CID. Instead, there should be a single pause at the start, if any pause is needed at all.

I don't understand this pause. Some CIDs take many retries until we can finally fetch them. Which pause are you talking about?

In a typical dry run, we fetch CIDs committed for the round before the previous one, i.e. those CIDs are more than 20 minutes old. That means we don't need to wait and can start fetching immediately.

Please check CI output of this check, this unfortunately isn't what's happening. Eg: (I have bad train internet so will look these up later)

The second cause of slow dry run is the retry mechanism we use as a workaround for failing web3.storage retrievals. If I remember correctly, this has a very low success rate - once the web3.storage cache is poisoned with invalid block data, it takes very long until the cache slot is refreshed (if it's refreshed at all).

I'm not aware that the success rate is low, in my experience we just need to have a retry count that's high enough to wait until w3s's caches have been invalidated.

If the cache is poisoned, then I think the best thing to do in the dry run is to abort with an error immediately or retry at most once.

I agree, in prod we should retry but in dry-run we don't need to retry at all, and should just ignore this failure (unless all are failing).

I propose to make the retry logic configurable and set retries=1 when calling preprocess from the dry-run script.

Will do and see if the performance is sufficient 👍

@juliangruber
Copy link
Member Author

The last check took 1m 29s. I'll see how much we can squeeze out of it

@juliangruber
Copy link
Member Author

1m 9s for the concurrent run. Wdyt @bajtos?

@bajtos
Copy link
Member

bajtos commented May 15, 2024

First of all, thank you for making the effort to speed up our CI builds! ❤️

IMO, if we are fetching 20 CIDs from web3.storage as part of the dry run, we shouldn't wait 60 seconds for each CID. Instead, there should be a single pause at the start, if any pause is needed at all.

I don't understand this pause. Some CIDs take many retries until we can finally fetch them. Which pause are you talking about?

In another discussion, we considered adding a delay between committing a CID to the chain and spark-evaluate fetching it from web3.storage. The idea was to give more time to web3.storage to propagate the content so that the retrieval attempt does not poison the cache.

My comment was referring to the situation after we implemented such a delay (pause).

In a typical dry run, we fetch CIDs committed for the round before the previous one, i.e. those CIDs are more than 20 minutes old. That means we don't need to wait and can start fetching immediately.

Please check CI output of this check, this unfortunately isn't what's happening. Eg: (I have bad train internet so will look these up later)

Isn't this happening because spark-evaluate does not wait long enough to let web3.storage propagates the recently uploaded content, and therefore, the cache is already poisoned when our CI runs the dry run?

Is it worth fixing spark-evaluate to not poison web3-storage cache in the first place?

The second cause of slow dry run is the retry mechanism we use as a workaround for failing web3.storage retrievals. If I remember correctly, this has a very low success rate - once the web3.storage cache is poisoned with invalid block data, it takes very long until the cache slot is refreshed (if it's refreshed at all).

I'm not aware that the success rate is low, in my experience we just need to have a retry count that's high enough to wait until w3s's caches have been invalidated.

I see. I wrongly assumed that the cache was invalidated after so long that we are very unlikely to reach that point, even with several retry attempts.

If the cache is poisoned, then I think the best thing to do in the dry run is to abort with an error immediately or retry at most once.

I agree, in prod we should retry but in dry-run we don't need to retry at all, and should just ignore this failure (unless all are failing).

👍🏻

I propose to make the retry logic configurable and set retries=1 when calling preprocess from the dry-run script.

Will do and see if the performance is sufficient 👍

👍🏻

The last check took 1m 29s. I'll see how much we can squeeze out of it
1m 9s for the concurrent run. Wdyt @bajtos?

If I am reading this correctly, the current serial run takes 1m29s, and the parallel run takes 1m9s, saving ~20 seconds.

To me, such a benefit does not justify the additional complexity involved in the parallelised version. I don't hold that opinion strongly, though.

Copy link
Member

@bajtos bajtos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not forget to prefix log messages with the CID being preprocessed if we are going to land the parallelisation.

This may be all that we need to change. If not, then I am fine to continue the work on this pull request and prefix log messages with the CID being fetched/preprocessed, as you suggested:

Downside: This makes logs harder to read. We could prefix messages with the CID that's being preprocessed.

lib/preprocess.js Outdated Show resolved Hide resolved
@juliangruber
Copy link
Member Author

First of all, thank you for making the effort to speed up our CI builds! ❤️

IMO, if we are fetching 20 CIDs from web3.storage as part of the dry run, we shouldn't wait 60 seconds for each CID. Instead, there should be a single pause at the start, if any pause is needed at all.

I don't understand this pause. Some CIDs take many retries until we can finally fetch them. Which pause are you talking about?

In another discussion, we considered adding a delay between committing a CID to the chain and spark-evaluate fetching it from web3.storage. The idea was to give more time to web3.storage to propagate the content so that the retrieval attempt does not poison the cache.

My comment was referring to the situation after we implemented such a delay (pause).

Ah, I wasn't aware you're referring to this here. I hadn't considered it, since it's not part of the code yet. Looks like we don't need this parallelize anyhow, so no need to solve this now.

In a typical dry run, we fetch CIDs committed for the round before the previous one, i.e. those CIDs are more than 20 minutes old. That means we don't need to wait and can start fetching immediately.

Please check CI output of this check, this unfortunately isn't what's happening. Eg: (I have bad train internet so will look these up later)

Isn't this happening because spark-evaluate does not wait long enough to let web3.storage propagates the recently uploaded content, and therefore, the cache is already poisoned when our CI runs the dry run?

Is it worth fixing spark-evaluate to not poison web3-storage cache in the first place?

Unfortunately the cache can get poisioned even if we fetch an old round :| So I don't think this is in our control.

The last check took 1m 29s. I'll see how much we can squeeze out of it
1m 9s for the concurrent run. Wdyt @bajtos?

If I am reading this correctly, the current serial run takes 1m29s, and the parallel run takes 1m9s, saving ~20 seconds.

To me, such a benefit does not justify the additional complexity involved in the parallelised version. I don't hold that opinion strongly, though.

👍

@juliangruber juliangruber changed the title perf: parallelize bin/dry-run.js perf: speed up bin/dry-run.js May 16, 2024
@juliangruber juliangruber requested a review from bajtos May 16, 2024 10:21
@bajtos
Copy link
Member

bajtos commented May 16, 2024

Isn't this happening because spark-evaluate does not wait long enough to let web3.storage propagates the recently uploaded content, and therefore, the cache is already poisoned when our CI runs the dry run?

Is it worth fixing spark-evaluate to not poison web3-storage cache in the first place?

Unfortunately the cache can get poisioned even if we fetch an old round :| So I don't think this is in our control.

I vaguely remember that we wanted to implement a delay in spark-evaluate - when we receive a new MeasurementsAdded event, we can wait a bit before we fetch the CID from web3.storage, to allow web3.storage propagate the recently uploaded content.

The trickier part is to ensure we also delay evaluation of the last round, so that we don't evaluate it before we fetch the last batch of measurements.

Sadly, I cannot find where we were discussing this - probably somewhere in Slack.

So I think there is something we can do about this, but I am not saying it's the most meaningful use of our time.

Copy link
Member

@bajtos bajtos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@bajtos
Copy link
Member

bajtos commented May 16, 2024

Isn't this happening because spark-evaluate does not wait long enough to let web3.storage propagates the recently uploaded content, and therefore, the cache is already poisoned when our CI runs the dry run?
Is it worth fixing spark-evaluate to not poison web3-storage cache in the first place?

Unfortunately the cache can get poisioned even if we fetch an old round :| So I don't think this is in our control.

I vaguely remember that we wanted to implement a delay in spark-evaluate - when we receive a new MeasurementsAdded event, we can wait a bit before we fetch the CID from web3.storage, to allow web3.storage propagate the recently uploaded content.

The trickier part is to ensure we also delay evaluation of the last round, so that we don't evaluate it before we fetch the last batch of measurements.

Sadly, I cannot find where we were discussing this - probably somewhere in Slack.

So I think there is something we can do about this, but I am not saying it's the most meaningful use of our time.

Oh, I see we have already implemented and deployed that delay 😳🤔

#209

I agree there is little we can do now.

@juliangruber
Copy link
Member Author

juliangruber commented May 17, 2024

Isn't this happening because spark-evaluate does not wait long enough to let web3.storage propagates the recently uploaded content, and therefore, the cache is already poisoned when our CI runs the dry run?
Is it worth fixing spark-evaluate to not poison web3-storage cache in the first place?

Unfortunately the cache can get poisioned even if we fetch an old round :| So I don't think this is in our control.

I vaguely remember that we wanted to implement a delay in spark-evaluate - when we receive a new MeasurementsAdded event, we can wait a bit before we fetch the CID from web3.storage, to allow web3.storage propagate the recently uploaded content.

The trickier part is to ensure we also delay evaluation of the last round, so that we don't evaluate it before we fetch the last batch of measurements.

Sadly, I cannot find where we were discussing this - probably somewhere in Slack.

So I think there is something we can do about this, but I am not saying it's the most meaningful use of our time.

Unfortunately caches also get poisoned with a delay. The delay will help, but not fix the problem. I also think atm this isn't high priority to fix.

@juliangruber juliangruber merged commit 8ec795e into main May 17, 2024
5 of 6 checks passed
@juliangruber juliangruber deleted the perf/parallelize-dry-run branch May 17, 2024 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants