-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: speed up bin/dry-run.js
#214
Conversation
Downside: This makes logs harder to read. We could prefix messages with the CID that's being preprocessed. Wdyt? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Let's discuss this a bit.
IMO, if we are fetching 20 CIDs from web3.storage as part of the dry run, we shouldn't wait 60 seconds for each CID. Instead, there should be a single pause at the start, if any pause is needed at all.
In a typical dry run, we fetch CIDs committed for the round before the previous one, i.e. those CIDs are more than 20 minutes old. That means we don't need to wait and can start fetching immediately.
I believe that's already the case - the pause is implemented in the event listener onMeasurementsAdded
, while dry-run calls preprocess
directly.
The second cause of slow dry run is the retry mechanism we use as a workaround for failing web3.storage retrievals. If I remember correctly, this has a very low success rate - once the web3.storage cache is poisoned with invalid block data, it takes very long until the cache slot is refreshed (if it's refreshed at all).
If the cache is poisoned, then I think the best thing to do in the dry run is to abort with an error immediately or retry at most once.
I propose to make the retry logic configurable and set retries=1
when calling preprocess
from the dry-run script.
This may be all that we need to change. If not, then I am fine to continue the work on this pull request and prefix log messages with the CID being fetched/preprocessed, as you suggested:
Downside: This makes logs harder to read. We could prefix messages with the CID that's being preprocessed.
This reverts commit 2d0043a.
I don't understand this pause. Some CIDs take many retries until we can finally fetch them. Which pause are you talking about?
Please check CI output of this check, this unfortunately isn't what's happening. Eg: (I have bad train internet so will look these up later)
I'm not aware that the success rate is low, in my experience we just need to have a retry count that's high enough to wait until w3s's caches have been invalidated.
I agree, in prod we should retry but in dry-run we don't need to retry at all, and should just ignore this failure (unless all are failing).
Will do and see if the performance is sufficient 👍 |
The last check took 1m 29s. I'll see how much we can squeeze out of it |
This reverts commit 06f9c00.
1m 9s for the concurrent run. Wdyt @bajtos? |
First of all, thank you for making the effort to speed up our CI builds! ❤️
In another discussion, we considered adding a delay between committing a CID to the chain and spark-evaluate fetching it from web3.storage. The idea was to give more time to web3.storage to propagate the content so that the retrieval attempt does not poison the cache. My comment was referring to the situation after we implemented such a delay (pause).
Isn't this happening because spark-evaluate does not wait long enough to let web3.storage propagates the recently uploaded content, and therefore, the cache is already poisoned when our CI runs the dry run? Is it worth fixing spark-evaluate to not poison web3-storage cache in the first place?
I see. I wrongly assumed that the cache was invalidated after so long that we are very unlikely to reach that point, even with several retry attempts.
👍🏻
👍🏻
If I am reading this correctly, the current serial run takes 1m29s, and the parallel run takes 1m9s, saving ~20 seconds. To me, such a benefit does not justify the additional complexity involved in the parallelised version. I don't hold that opinion strongly, though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not forget to prefix log messages with the CID being preprocessed if we are going to land the parallelisation.
This may be all that we need to change. If not, then I am fine to continue the work on this pull request and prefix log messages with the CID being fetched/preprocessed, as you suggested:
Downside: This makes logs harder to read. We could prefix messages with the CID that's being preprocessed.
Ah, I wasn't aware you're referring to this here. I hadn't considered it, since it's not part of the code yet. Looks like we don't need this parallelize anyhow, so no need to solve this now.
Unfortunately the cache can get poisioned even if we fetch an old round :| So I don't think this is in our control.
👍 |
bin/dry-run.js
bin/dry-run.js
I vaguely remember that we wanted to implement a delay in spark-evaluate - when we receive a new The trickier part is to ensure we also delay evaluation of the last round, so that we don't evaluate it before we fetch the last batch of measurements. Sadly, I cannot find where we were discussing this - probably somewhere in Slack. So I think there is something we can do about this, but I am not saying it's the most meaningful use of our time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see we have already implemented and deployed that delay 😳🤔 I agree there is little we can do now. |
Unfortunately caches also get poisoned with a delay. The delay will help, but not fix the problem. I also think atm this isn't high priority to fix. |
Noticed in https://github.com/filecoin-station/spark-evaluate/actions/runs/8995936280/job/24711656364?pr=205 that it can take a while. While most of the time there is waiting for w3s data to become available, even without it will take minutes