feat: load balancing Google Vertex AI API across US/CA regions #2795

msg7086 · 2024-05-19T21:33:06Z

Summary

Google Vertex AI API provided by Google Cloud has a request limit quota of 1 request per minute per region as of now. If you are having a conversation with Gemini 1.5 Pro / Flash, and you reply more than twice in a minute, you'll hit quota limit, and have to wait. Load balancing across multiple region solves this problem.

It also improves load on Google side, preventing us-central1 region from being flooded by requests from the same app.

The code change is minimal, so that it doesn't impact user experience. The list only includes US/CA regions for now because they are close to the previous option us-central1. Those who live close to US central should not see any performance impact. Those who don't connect to US central well may see performance improvement.

This is a premature implementation to mitigate #2723.

Change Type

New feature (non-breaking change which adds functionality)

Testing

TBD

Checklist

My code adheres to this project's style guidelines
I have performed a self-review of my own code
I have commented in any complex areas of my code
My changes do not introduce new warnings
Local unit tests pass with my changes

feat: load balancing Google Vertex AI API across US/CA regions

d390947

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: load balancing Google Vertex AI API across US/CA regions #2795

feat: load balancing Google Vertex AI API across US/CA regions #2795

msg7086 commented May 19, 2024 •

edited

feat: load balancing Google Vertex AI API across US/CA regions #2795

Are you sure you want to change the base?

feat: load balancing Google Vertex AI API across US/CA regions #2795

Conversation

msg7086 commented May 19, 2024 • edited

Summary

Change Type

Testing

Checklist

msg7086 commented May 19, 2024 •

edited