Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ollama 0.1.38 has high video memory usage and runs very slowly. #4497

Open
chenwei0930 opened this issue May 17, 2024 · 3 comments
Open

Ollama 0.1.38 has high video memory usage and runs very slowly. #4497

chenwei0930 opened this issue May 17, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@chenwei0930
Copy link

What is the issue?

I am using Windows 10 with an NVIDIA 2080Ti graphics card that has 22GB of video memory. I upgraded from version 0.1.32 to 0.1.38 with the goal of supporting loading multiple models and handling multiple concurrent requests. However, I noticed that under version 0.1.38, the video memory usage is very high, and the speed has become much slower.

I am using the "codeqwen:7b-chat-v1.5-q8_0" model. Under version 0.1.32, it used around 8GB of video memory and output approximately 10 tokens per second. However, under version 0.1.38, it is using 18.8GB of video memory, and based on my observation, it is only outputting 1-2 tokens per second.

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.1.38

@chenwei0930 chenwei0930 added the bug Something isn't working label May 17, 2024
@oldmanjk
Copy link

Can confirm 0.1.38 seems to want more video memory

@dhiltgen
Copy link
Collaborator

@chenwei0930 you mention enabling concurrency... what settings are you using? In particular, when you set OLLAMA_NUM_PARALLEL we have to multiply the context by that number, and it looks like this model has a default context size of 8192, so if you set a large parallel factor that might explain what you're seeing. I wouldn't expect to see a drop in token rate for a single request though. Perhaps ollama ps will help shed some light? Failing that, can you share server logs so we can see what might be going on?

@dhiltgen dhiltgen self-assigned this May 21, 2024
@Kyncc
Copy link

Kyncc commented May 27, 2024

Do you use option "size of content"? When I modify this option, the GPU memory drops

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants