Llama.cpp - confusing and infinitely long answers. #5250

pironev · 2024-04-29T14:12:13Z

pironev
Apr 29, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

import { LlamaCpp } from "@langchain/community/llms/llama_cpp";

const llamaPath = "../../tinyllama-1.1b-chat-v1.0.Q2_K.gguf"; //"../../llama-2-7b-chat.GGUF.q4_0.bin"; 

const model = new LlamaCpp({ modelPath: llamaPath, temperature: 0.7, maxTokens: 25, streaming: true });
//await model.invoke("You are helpful assistant. Respond to questions. Reply to greetings.");
const prompt = "Hello";
const stream = await model.stream(prompt);
for await (const chunk of stream) {
  console.log(chunk);
}

Description

I've tried many models but the problem is always the same - if I don't send questions but a simple "Hello", the responses are confusing and very long, sometimes it seems like it won't stop and I have to force it to close.
If I put some prompts in context like await model.invoke("You are helpful assistant.....) the answers are improved but if I repeat the same question many times is arrive at the infinite answer again.
Is a a bug or something is missing from the program?

System Info

platform - mac - Apple M1 Max
langchain@0.1.36
node-llama-cpp@2.8.10

Answered by dosubot[bot]

Apr 29, 2024

It sounds like you're encountering an issue with the LlamaCpp model generating infinitely long and confusing responses, especially when using simple prompts like "Hello". This behavior might be related to how the streaming functionality is implemented in the LlamaCpp model within the LangChain JS framework.

The stream method you're using is designed to return a stream of responses in chunks, which should theoretically stop once the model has finished generating its response based on the maxTokens parameter you've set. However, if the stream doesn't terminate as expected, it could lead to the infinite response issue you're experiencing.

Given that you're already specifying a maxTokens lim…

View full answer

@dosu · 2024-04-29T14:12:48Z

dosubot[bot]
bot Apr 29, 2024

It sounds like you're encountering an issue with the LlamaCpp model generating infinitely long and confusing responses, especially when using simple prompts like "Hello". This behavior might be related to how the streaming functionality is implemented in the LlamaCpp model within the LangChain JS framework.

The stream method you're using is designed to return a stream of responses in chunks, which should theoretically stop once the model has finished generating its response based on the maxTokens parameter you've set. However, if the stream doesn't terminate as expected, it could lead to the infinite response issue you're experiencing.

Given that you're already specifying a maxTokens limit, the problem might not be directly related to this parameter. Instead, it could be an underlying issue with the model's streaming behavior or how the stream method handles the end of a response.

To address this, you might consider implementing a manual timeout or abort mechanism similar to the AbortController used in some examples from the LangChain JS repository. This would allow you to forcefully stop the stream after a certain period or if it exceeds a reasonable length, preventing it from continuing indefinitely.

Additionally, ensure that your environment, particularly the Apple M1 Max chip, is fully compatible with the node-llama-cpp module's requirements. While there are no known compatibility issues between the langchain@0.1.36 and node-llama-cpp@2.8.10 versions, experimenting with different versions or settings might help identify a solution.

If the issue persists, providing more details about the specific models you're using and any additional configuration options might help further diagnose and resolve the problem.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

2 replies

pironev May 6, 2024
Author

Thanks for the reply. I will use AbortController. I have also seen that it depends a lot on the AI models chosen. Not all models have this behavior
Thank you.

dosubot bot May 6, 2024

You're welcome! It's great to hear that you're considering using AbortController and have noticed differences in behavior among various AI models. This insight can be valuable for fine-tuning your application's interaction with the models to achieve the desired response quality and length. If you encounter any more issues or have further questions as you continue to experiment with different models and settings, feel free to reach out. Happy coding!

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama.cpp - confusing and infinitely long answers. #5250

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Details

{{title}}

{{title}}

Details

Select a reply

Llama.cpp - confusing and infinitely long answers. #5250

pironev Apr 29, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 1 comment · 2 replies

dosubot[bot] bot Apr 29, 2024

Details

pironev May 6, 2024 Author

dosubot bot May 6, 2024

Details

pironev
Apr 29, 2024

Replies: 1 comment 2 replies

dosubot[bot]
bot Apr 29, 2024

pironev May 6, 2024
Author