Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parameter provisioningTimeout to Kubernetes driver options. #2457

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Arsobbiak
Copy link

@Arsobbiak Arsobbiak commented May 14, 2024

During running builds on the Kubernetes, we hit timeout on pod provisioning.
Our k8s cluster is running on EKS with Karpenter.
From time to time, node provisioning takes around one and a half minutes till it is ready to use.

Error:

#1 waiting for 1 pods to be ready
#1 waiting for 1 pods to be ready 110.0s done
#1 ERROR: expected 1 replicas to be ready, got 0

This PR will let to specify provisioningTimeout at driver-opt parameter.

If that looks good, I will add a parameter to the docker documentation.


As a second option, I'm happy to change only 100 → 200 at line:

 case <-time.After(time.Duration(100+try*20) * time.Millisecond):  

@@ -113,7 +114,7 @@ func (d *Driver) wait(ctx context.Context) error {
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(time.Duration(100+try*20) * time.Millisecond):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the backoff logic

Copy link
Author

@Arsobbiak Arsobbiak May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be ok for you if I will just change 100300?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crazy-max I've hit this issue too and would appreciate a timeout option. I'd be interested to know why the backoff is better when the nodes don't spin up in time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be interested to know why the backoff is better when the nodes don't spin up in time.

@SimonJPegg Don't think you would want to hammer the API every Xms but instead backoff on each retry. Although looking at the retry logic it seems fair to not keep it.

It will be ok for you if I will just change 100300?

I think it's fine to keep 100 as default but might be better to use a duration, see #2457 (comment)

@@ -229,6 +232,8 @@ func (f *factory) processDriverOpts(deploymentName string, namespace string, cfg
if err != nil {
return nil, "", "", false, err
}
case "provisioningTimeout":
deploymentOpt.ProvisioningTimeout, err = strconv.Atoi(v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a time.Duration type instead? Would need to do a time.ParseDuration here.

configMapClient clientcorev1.ConfigMapInterface
podChooser podchooser.PodChooser
defaultLoad bool
provisioningTimeout int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure on naming it provisioningTimeout as this is applied in a retry logic. Maybe retryTimeout?

@tonistiigi
Copy link
Member

PTAL #2472

@Arsobbiak Arsobbiak force-pushed the extend_k8s_provisioning_timeout branch from 6a18032 to fe4788b Compare May 30, 2024 15:34
@Arsobbiak
Copy link
Author

Arsobbiak commented May 30, 2024

Sorry that it took me so long to come back to this PR.

Thanks, @crazy-max, for pointing out that I can use time.ParseDuration. This way, it will be more user-friendly.

With the chatbot's help and your comments, I improved it slightly.

@Arsobbiak Arsobbiak requested a review from crazy-max May 30, 2024 15:51
Signed-off-by: Arnold Sobanski <arnold@l4g.dev>
@Arsobbiak Arsobbiak force-pushed the extend_k8s_provisioning_timeout branch from fe4788b to 6677e0d Compare May 30, 2024 16:12
tryCounter++
}
ticker.Stop()
ticker = time.NewTicker(time.Duration(100+tryCounter*20) * time.Millisecond)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure if that logic is correct and if there is no better way of doing that.
Or maybe I should keep it static as 100*time.Millisecond

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants