#

evaluation

Here are 1,116 public repositories matching this topic...

onejune2018 / Awesome-LLM-Eval

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.

nlp benchmark machine-learning leaderboard evaluation dataset openai llama bert rag awsome-list gpt3 llm awsome-lists chatgpt large-language-model chatglm qwen llm-evaluation

Updated Jun 12, 2024

langfuse / langfuse

🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management evals llm-evaluation

Updated Jun 12, 2024
TypeScript

langchain-ai / langsmith-sdk

LangSmith Client SDK Implementations

evaluation language-model observability

Updated Jun 12, 2024
Python

tatsu-lab / alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

nlp deep-learning leaderboard evaluation instruction-following foundation-models large-language-models rlhf

Updated Jun 12, 2024
Jupyter Notebook

langchain-ai / langsmith-docs

Documentation for langsmith

testing documentation evaluation tracing langchain langsmith

Updated Jun 12, 2024
MDX

NTDLS / CMathParser

A fairly robust mathematics parsing engine for C++ projects.

library parsing math evaluation mathematics showcase expression-parser

Updated Jun 11, 2024
C++

ncalc / ncalc

Mathematical Expressions Evaluator for .NET

parser csharp math runtime async dotnet evaluation antlr antlr4 expressions ncalc

Updated Jun 11, 2024
C#

JieyuZ2 / TaskMeAnything

A task generation and model evaluation system.

benchmark evaluation foundation-models

Updated Jun 11, 2024
Python

Psycoy / MixEval

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture mixeval

Updated Jun 12, 2024
Python

kolena

kolenaIO / kolena

Python client for Kolena's machine learning testing platform

testing machine-learning evaluation evaluation-metrics evaluation-framework mlops evaluate-models llmops

Updated Jun 12, 2024
Python

microsoft / rag-experiment-accelerator

The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.

experiment information-retrieval azure evaluation indexing openai sparse vectors chunking acs embedding dense rag llm genai

Updated Jun 11, 2024
Python

promptfoo / promptfoo

Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Jun 12, 2024
TypeScript

VectorInstitute / cyclops-workshop

CyclOps for clinical ML evaluation & monitoring workshop

monitoring evaluation

Updated Jun 11, 2024
Jupyter Notebook

IAAR-Shanghai / NewsBench

[ACL 2024 Main] NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

benchmark framework evaluation dataset gpt4 large-language-models llm chatgpt ernie-bot gpt35turbo chatglm2-6b xverse internlm-20b baichaun2 aquila2 qwen-14b chatglm3-6b acl2024

Updated Jun 11, 2024
Python

athina-ai / athina-evals

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated Jun 11, 2024
Python

Striveworks / valor

Valor is a centralized evaluation store which makes it easy to measure, explore, and rank model performance.

computer-vision evaluation classification object-detection image-segmentation evaluation-metrics model-evaluation mlops

Updated Jun 11, 2024
Python

time-series-machine-learning / tsml-eval

Evaluation tools for time series machine learning algorithms.

python benchmarking data-science machine-learning time-series evaluation

Updated Jun 11, 2024
Jupyter Notebook

gereon-t / trajectopy

Trajectopy - Trajectory Evaluation in Python

benchmark metrics evaluation comparison alignment trajectory-analysis trajectory

Updated Jun 11, 2024
Python

CAS-SIAT-XinHai / CPsyCoun

[ACL 2024]CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling

nlp evaluation dataset dataset-generation mental-health llm

Updated Jun 11, 2024
Jupyter Notebook

langwatch / langwatch

🤖 Build AI applications with confidence ✅ DSPy Visualizer ✅ Understand how your users are using your LLM-app ✅ Get a full picture of the quality performance of your LLM-app ✅ Collaborate with your stakeholders in ONE platform ✅ Iterate towards the most valuable & reliable LLM-app.

ai analytics evaluation openai gpt datasets observability llm prompt-engineering

Updated Jun 11, 2024
TypeScript

Improve this page

Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."