perf: speed up `bin/dry-run.js` #214

juliangruber · 2024-05-08T03:52:12Z

Noticed in https://github.com/filecoin-station/spark-evaluate/actions/runs/8995936280/job/24711656364?pr=205 that it can take a while. While most of the time there is waiting for w3s data to become available, even without it will take minutes

juliangruber · 2024-05-08T04:15:25Z

Downside: This makes logs harder to read. We could prefix messages with the CID that's being preprocessed. Wdyt?

bajtos

Hmm. Let's discuss this a bit.

IMO, if we are fetching 20 CIDs from web3.storage as part of the dry run, we shouldn't wait 60 seconds for each CID. Instead, there should be a single pause at the start, if any pause is needed at all.

In a typical dry run, we fetch CIDs committed for the round before the previous one, i.e. those CIDs are more than 20 minutes old. That means we don't need to wait and can start fetching immediately.

I believe that's already the case - the pause is implemented in the event listener onMeasurementsAdded, while dry-run calls preprocess directly.

The second cause of slow dry run is the retry mechanism we use as a workaround for failing web3.storage retrievals. If I remember correctly, this has a very low success rate - once the web3.storage cache is poisoned with invalid block data, it takes very long until the cache slot is refreshed (if it's refreshed at all).

If the cache is poisoned, then I think the best thing to do in the dry run is to abort with an error immediately or retry at most once.

I propose to make the retry logic configurable and set retries=1 when calling preprocess from the dry-run script.

This may be all that we need to change. If not, then I am fine to continue the work on this pull request and prefix log messages with the CID being fetched/preprocessed, as you suggested:

Downside: This makes logs harder to read. We could prefix messages with the CID that's being preprocessed.

This reverts commit 2d0043a.

juliangruber · 2024-05-10T15:11:31Z

IMO, if we are fetching 20 CIDs from web3.storage as part of the dry run, we shouldn't wait 60 seconds for each CID. Instead, there should be a single pause at the start, if any pause is needed at all.

I don't understand this pause. Some CIDs take many retries until we can finally fetch them. Which pause are you talking about?

In a typical dry run, we fetch CIDs committed for the round before the previous one, i.e. those CIDs are more than 20 minutes old. That means we don't need to wait and can start fetching immediately.

Please check CI output of this check, this unfortunately isn't what's happening. Eg: (I have bad train internet so will look these up later)

The second cause of slow dry run is the retry mechanism we use as a workaround for failing web3.storage retrievals. If I remember correctly, this has a very low success rate - once the web3.storage cache is poisoned with invalid block data, it takes very long until the cache slot is refreshed (if it's refreshed at all).

I'm not aware that the success rate is low, in my experience we just need to have a retry count that's high enough to wait until w3s's caches have been invalidated.

If the cache is poisoned, then I think the best thing to do in the dry run is to abort with an error immediately or retry at most once.

I agree, in prod we should retry but in dry-run we don't need to retry at all, and should just ignore this failure (unless all are failing).

I propose to make the retry logic configurable and set retries=1 when calling preprocess from the dry-run script.

Will do and see if the performance is sufficient 👍

juliangruber · 2024-05-10T15:15:51Z

The last check took 1m 29s. I'll see how much we can squeeze out of it

This reverts commit 06f9c00.

juliangruber · 2024-05-10T15:20:33Z

1m 9s for the concurrent run. Wdyt @bajtos?

bajtos · 2024-05-15T14:34:50Z

First of all, thank you for making the effort to speed up our CI builds! ❤️

IMO, if we are fetching 20 CIDs from web3.storage as part of the dry run, we shouldn't wait 60 seconds for each CID. Instead, there should be a single pause at the start, if any pause is needed at all.

I don't understand this pause. Some CIDs take many retries until we can finally fetch them. Which pause are you talking about?

In another discussion, we considered adding a delay between committing a CID to the chain and spark-evaluate fetching it from web3.storage. The idea was to give more time to web3.storage to propagate the content so that the retrieval attempt does not poison the cache.

My comment was referring to the situation after we implemented such a delay (pause).

In a typical dry run, we fetch CIDs committed for the round before the previous one, i.e. those CIDs are more than 20 minutes old. That means we don't need to wait and can start fetching immediately.

Please check CI output of this check, this unfortunately isn't what's happening. Eg: (I have bad train internet so will look these up later)

Isn't this happening because spark-evaluate does not wait long enough to let web3.storage propagates the recently uploaded content, and therefore, the cache is already poisoned when our CI runs the dry run?

Is it worth fixing spark-evaluate to not poison web3-storage cache in the first place?

The second cause of slow dry run is the retry mechanism we use as a workaround for failing web3.storage retrievals. If I remember correctly, this has a very low success rate - once the web3.storage cache is poisoned with invalid block data, it takes very long until the cache slot is refreshed (if it's refreshed at all).

I'm not aware that the success rate is low, in my experience we just need to have a retry count that's high enough to wait until w3s's caches have been invalidated.

I see. I wrongly assumed that the cache was invalidated after so long that we are very unlikely to reach that point, even with several retry attempts.

If the cache is poisoned, then I think the best thing to do in the dry run is to abort with an error immediately or retry at most once.

I agree, in prod we should retry but in dry-run we don't need to retry at all, and should just ignore this failure (unless all are failing).

👍🏻

I propose to make the retry logic configurable and set retries=1 when calling preprocess from the dry-run script.

Will do and see if the performance is sufficient 👍

👍🏻

The last check took 1m 29s. I'll see how much we can squeeze out of it
1m 9s for the concurrent run. Wdyt @bajtos?

If I am reading this correctly, the current serial run takes 1m29s, and the parallel run takes 1m9s, saving ~20 seconds.

To me, such a benefit does not justify the additional complexity involved in the parallelised version. I don't hold that opinion strongly, though.

bajtos

Let's not forget to prefix log messages with the CID being preprocessed if we are going to land the parallelisation.

This may be all that we need to change. If not, then I am fine to continue the work on this pull request and prefix log messages with the CID being fetched/preprocessed, as you suggested:

Downside: This makes logs harder to read. We could prefix messages with the CID that's being preprocessed.

lib/preprocess.js

juliangruber · 2024-05-16T10:19:31Z

First of all, thank you for making the effort to speed up our CI builds! ❤️

IMO, if we are fetching 20 CIDs from web3.storage as part of the dry run, we shouldn't wait 60 seconds for each CID. Instead, there should be a single pause at the start, if any pause is needed at all.

I don't understand this pause. Some CIDs take many retries until we can finally fetch them. Which pause are you talking about?

In another discussion, we considered adding a delay between committing a CID to the chain and spark-evaluate fetching it from web3.storage. The idea was to give more time to web3.storage to propagate the content so that the retrieval attempt does not poison the cache.

My comment was referring to the situation after we implemented such a delay (pause).

Ah, I wasn't aware you're referring to this here. I hadn't considered it, since it's not part of the code yet. Looks like we don't need this parallelize anyhow, so no need to solve this now.

In a typical dry run, we fetch CIDs committed for the round before the previous one, i.e. those CIDs are more than 20 minutes old. That means we don't need to wait and can start fetching immediately.

Please check CI output of this check, this unfortunately isn't what's happening. Eg: (I have bad train internet so will look these up later)

Isn't this happening because spark-evaluate does not wait long enough to let web3.storage propagates the recently uploaded content, and therefore, the cache is already poisoned when our CI runs the dry run?

Is it worth fixing spark-evaluate to not poison web3-storage cache in the first place?

Unfortunately the cache can get poisioned even if we fetch an old round :| So I don't think this is in our control.

The last check took 1m 29s. I'll see how much we can squeeze out of it
1m 9s for the concurrent run. Wdyt @bajtos?

If I am reading this correctly, the current serial run takes 1m29s, and the parallel run takes 1m9s, saving ~20 seconds.

To me, such a benefit does not justify the additional complexity involved in the parallelised version. I don't hold that opinion strongly, though.

👍

This reverts commit 782dee5.

bajtos · 2024-05-16T14:15:10Z

Isn't this happening because spark-evaluate does not wait long enough to let web3.storage propagates the recently uploaded content, and therefore, the cache is already poisoned when our CI runs the dry run?

Is it worth fixing spark-evaluate to not poison web3-storage cache in the first place?

Unfortunately the cache can get poisioned even if we fetch an old round :| So I don't think this is in our control.

I vaguely remember that we wanted to implement a delay in spark-evaluate - when we receive a new MeasurementsAdded event, we can wait a bit before we fetch the CID from web3.storage, to allow web3.storage propagate the recently uploaded content.

The trickier part is to ensure we also delay evaluation of the last round, so that we don't evaluate it before we fetch the last batch of measurements.

Sadly, I cannot find where we were discussing this - probably somewhere in Slack.

So I think there is something we can do about this, but I am not saying it's the most meaningful use of our time.

bajtos

bajtos · 2024-05-16T15:31:17Z

Isn't this happening because spark-evaluate does not wait long enough to let web3.storage propagates the recently uploaded content, and therefore, the cache is already poisoned when our CI runs the dry run?
Is it worth fixing spark-evaluate to not poison web3-storage cache in the first place?

Unfortunately the cache can get poisioned even if we fetch an old round :| So I don't think this is in our control.

I vaguely remember that we wanted to implement a delay in spark-evaluate - when we receive a new MeasurementsAdded event, we can wait a bit before we fetch the CID from web3.storage, to allow web3.storage propagate the recently uploaded content.

The trickier part is to ensure we also delay evaluation of the last round, so that we don't evaluate it before we fetch the last batch of measurements.

Sadly, I cannot find where we were discussing this - probably somewhere in Slack.

So I think there is something we can do about this, but I am not saying it's the most meaningful use of our time.

Oh, I see we have already implemented and deployed that delay 😳🤔

#209

I agree there is little we can do now.

juliangruber · 2024-05-17T14:52:15Z

Isn't this happening because spark-evaluate does not wait long enough to let web3.storage propagates the recently uploaded content, and therefore, the cache is already poisoned when our CI runs the dry run?
Is it worth fixing spark-evaluate to not poison web3-storage cache in the first place?

Unfortunately the cache can get poisioned even if we fetch an old round :| So I don't think this is in our control.

I vaguely remember that we wanted to implement a delay in spark-evaluate - when we receive a new MeasurementsAdded event, we can wait a bit before we fetch the CID from web3.storage, to allow web3.storage propagate the recently uploaded content.

The trickier part is to ensure we also delay evaluation of the last round, so that we don't evaluate it before we fetch the last batch of measurements.

Sadly, I cannot find where we were discussing this - probably somewhere in Slack.

So I think there is something we can do about this, but I am not saying it's the most meaningful use of our time.

Unfortunately caches also get poisoned with a delay. The delay will help, but not fix the problem. I also think atm this isn't high priority to fix.

perf: parallelize bin/dry-run.js

2d0043a

juliangruber requested a review from bajtos May 8, 2024 03:52

Merge branch 'main' into perf/parallelize-dry-run

fb37f28

bajtos requested changes May 9, 2024

View reviewed changes

juliangruber added 2 commits May 10, 2024 17:11

Revert "perf: parallelize bin/dry-run.js"

06f9c00

This reverts commit 2d0043a.

0 retries

b7e7de8

Merge branch 'main' into perf/parallelize-dry-run

79c387c

Revert "Revert "perf: parallelize bin/dry-run.js""

782dee5

This reverts commit 06f9c00.

bajtos requested changes May 15, 2024

View reviewed changes

bajtos reviewed May 15, 2024

View reviewed changes

lib/preprocess.js Outdated Show resolved Hide resolved

juliangruber added 2 commits May 16, 2024 12:19

Revert "Revert "Revert "perf: parallelize bin/dry-run.js"""

00ab186

This reverts commit 782dee5.

retries -> fetchRetries

d88d890

juliangruber changed the title ~~perf: parallelize bin/dry-run.js~~ perf: speed up bin/dry-run.js May 16, 2024

juliangruber requested a review from bajtos May 16, 2024 10:21

bajtos approved these changes May 16, 2024

View reviewed changes

juliangruber merged commit 8ec795e into main May 17, 2024
5 of 6 checks passed

juliangruber deleted the perf/parallelize-dry-run branch May 17, 2024 14:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: speed up `bin/dry-run.js` #214

perf: speed up `bin/dry-run.js` #214

juliangruber commented May 8, 2024 •

edited

juliangruber commented May 8, 2024

bajtos left a comment

juliangruber commented May 10, 2024 •

edited

juliangruber commented May 10, 2024

juliangruber commented May 10, 2024

bajtos commented May 15, 2024

bajtos left a comment

juliangruber commented May 16, 2024

bajtos commented May 16, 2024

bajtos left a comment

bajtos commented May 16, 2024

juliangruber commented May 17, 2024 •

edited

perf: speed up bin/dry-run.js #214

perf: speed up bin/dry-run.js #214

Conversation

juliangruber commented May 8, 2024 • edited

juliangruber commented May 8, 2024

bajtos left a comment

Choose a reason for hiding this comment

juliangruber commented May 10, 2024 • edited

juliangruber commented May 10, 2024

juliangruber commented May 10, 2024

bajtos commented May 15, 2024

bajtos left a comment

Choose a reason for hiding this comment

juliangruber commented May 16, 2024

bajtos commented May 16, 2024

bajtos left a comment

Choose a reason for hiding this comment

bajtos commented May 16, 2024

juliangruber commented May 17, 2024 • edited

perf: speed up `bin/dry-run.js` #214

perf: speed up `bin/dry-run.js` #214

juliangruber commented May 8, 2024 •

edited

juliangruber commented May 10, 2024 •

edited

juliangruber commented May 17, 2024 •

edited