Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
-
Updated
Jun 12, 2024
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
LangSmith Client SDK Implementations
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
Documentation for langsmith
A fairly robust mathematics parsing engine for C++ projects.
A task generation and model evaluation system.
The official evaluation suite and dynamic data release for MixEval.
Python client for Kolena's machine learning testing platform
The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.
Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
CyclOps for clinical ML evaluation & monitoring workshop
[ACL 2024 Main] NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism
Python SDK for running evaluations on LLM generated responses
Valor is a centralized evaluation store which makes it easy to measure, explore, and rank model performance.
Evaluation tools for time series machine learning algorithms.
Trajectopy - Trajectory Evaluation in Python
[ACL 2024]CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling
🤖 Build AI applications with confidence ✅ DSPy Visualizer ✅ Understand how your users are using your LLM-app ✅ Get a full picture of the quality performance of your LLM-app ✅ Collaborate with your stakeholders in ONE platform ✅ Iterate towards the most valuable & reliable LLM-app.
Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.
To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."