Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama_model_load: error loading model: unable to allocate backend buffer #7366

Closed
phaelon74 opened this issue May 18, 2024 · 2 comments
Closed

Comments

@phaelon74
Copy link

OS: Windows 11, running Text Generation WebUI, up to date on all releases.
Processor: Intel Core i5-8500 3GHz (6 Cores - no HT)
Memory: 16GB System Memory
GPUs: Five nVidia RTX 3600 - 12GB VRAM versions (First iteration during Covid)

Model: Coomand-R-35B-v1-OLD_Q4_K_M.gguf

Model Parameters:

  • n-gpu-layers: 41 (41 of 41, loading FULLY into VRAM)
  • n_ctx: 8192
  • tensor split: 10,10,10,10,10
  • flash-attn: Checked
  • tensorcores: checked
  • no-mmap: checked

Output from Model Load:

08:02:50-987413 INFO     Loading "Coomand-R-35B-v1-OLD_Q4_K_M.gguf"
08:02:51-580810 INFO     llama.cpp weights detected: "models\Coomand-R-35B-v1-OLD_Q4_K_M.gguf"
llama_model_loader: loaded meta data with 23 key-value pairs and 322 tensors from models\Coomand-R-35B-v1-OLD_Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = command-r
llama_model_loader: - kv   1:                               general.name str              = workspace
llama_model_loader: - kv   2:                      command-r.block_count u32              = 40
llama_model_loader: - kv   3:                   command-r.context_length u32              = 131072
llama_model_loader: - kv   4:                 command-r.embedding_length u32              = 8192
llama_model_loader: - kv   5:              command-r.feed_forward_length u32              = 22528
llama_model_loader: - kv   6:             command-r.attention.head_count u32              = 64
llama_model_loader: - kv   7:          command-r.attention.head_count_kv u32              = 64
llama_model_loader: - kv   8:                   command-r.rope.freq_base f32              = 8000000.000000
llama_model_loader: - kv   9:     command-r.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                      command-r.logit_scale f32              = 0.062500
llama_model_loader: - kv  12:                command-r.rope.scaling.type str              = none
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,256000]  = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,253333]  = ["─á ─á", "─á t", "e r", "i n", "─á a...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 5
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 255001
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   41 tensors
llama_model_loader: - type q4_K:  240 tensors
llama_model_loader: - type q6_K:   41 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
llm_load_vocab:
llm_load_vocab: special tokens definition check successful ( 1008/256000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = command-r
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 253333
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 64
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 8192
llm_load_print_meta: n_embd_v_gqa     = 8192
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 6.2e-02
llm_load_print_meta: n_ff             = 22528
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = none
llm_load_print_meta: freq_base_train  = 8000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 35B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 34.98 B
llm_load_print_meta: model size       = 20.04 GiB (4.92 BPW)
llm_load_print_meta: general.name     = workspace
llm_load_print_meta: BOS token        = 5 '<BOS_TOKEN>'
llm_load_print_meta: EOS token        = 255001 '<|END_OF_TURN_TOKEN|>'
llm_load_print_meta: PAD token        = 0 '<PAD>'
llm_load_print_meta: LF token         = 136 'Ä'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 5 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
  Device 3: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
  Device 4: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    1.01 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5099.12 MiB on device 4: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model
CUDA error: out of memory
  current device: 4, in function ggml_backend_cuda_host_buffer_free_buffer at D:\a\llama-cpp-python-cuBLAS-wheels\llama-cpp-python-cuBLAS-wheels\vendor\llama.cpp\ggml-cuda.cu:993
  cudaFreeHost(buffer->context)
GGML_ASSERT: D:\a\llama-cpp-python-cuBLAS-wheels\llama-cpp-python-cuBLAS-wheels\vendor\llama.cpp\ggml-cuda.cu:61: !"CUDA error"

This really doesn't make any sense to me, as a 35B paramter at Q4 should load into 50GB VRAM without issue.

@slaren
Copy link
Collaborator

slaren commented May 18, 2024

The error is exactly what it says, a call to cudaMalloc failed with an "out of memory" error, which is something outside the control of llama.cpp. NVIDIA may be able to help you with that, but I suspect that increasing the amount of system memory would fix the issue, I have found that CUDA allocations can fail when low on system memory.

@slaren slaren closed this as not planned Won't fix, can't repro, duplicate, stale May 19, 2024
@phaelon74
Copy link
Author

phaelon74 commented May 20, 2024

I moved all cards to a new system, with 128GB of system ram, same issue is occuring. The model loads without issue on transformers in 4bit, 8 bit, and full load in vram. It also loads in Q8 as a GGUF on the same system. Something about command-R is not happy with llama.cpp. @slaren it uses ~77GB of system ram and 96GB of vram before bombing when I choose llama.cpp. That's not normal. A 35B model should fit just fine in 96GB vram and 128GB system memory at 8192 context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants