Incredible Pricing

Tailored Pricing Designed Specifically for Your Requirements.

Free
Synthetic data (excl Zip files and no download)
All AI Models
1 custom metric
Library of 210 metrics
Dashboard
A/B Testing
Experiments
1 User
10 experiment runs
Community support via Discord
Start up
Synthetic data (limited)
All AI Models
3 Custom metrics
Library of 210 metrics
Dashboard
A/B Testing
Experiments
3 users
500 LLM Judgements per month
Email support
Enterprise
Synthetic data generation (unlimited)
All AI Models
Unlimited Custom metrics
Library of 210 metrics
Dashboard
A/B Testing
Experiments
Unlimited users
5,000 LLM Judgements per month
Dedicated account manager
and Slack Channel
SSO / SAML
Cloud or on-prem

Frequently Asked Questions

Have another question? Please contact our team!

Yes. RagMetrics was built for benchmarking large language models. You can run identical tasks across multiple LLMs, compare their outputs side by side, and score them for reasoning quality, hallucination risk, citation reliability, and output robustness.

Yes. RagMetrics provides a powerful API for programmatically scoring and comparing LLM outputs. Use it to integrate hallucination detection, prompt testing, and model benchmarking directly into your GenAI pipeline.

RagMetrics can be deployed in multiple ways, including as a fully managed SaaS solution, inside your private cloud environment (like AWS, Azure, or GCP), or on-premises for organizations that require maximum control and compliance.

Running an experiment is simple. You connect your LLM or retrieval-augmented generation (RAG) pipeline—such as Claude, GPT-4, Gemini, or your own model—define the task you're solving, upload a labeled dataset or test prompts, select your scoring criteria like hallucination rate or retrieval accuracy, and then run the experiment through the dashboard or API.

To run an evaluation, you’ll need access to your LLM’s API key, the endpoint URL or model pipeline, a dataset or labeled test inputs, a clear task description, and a definition of success for that task. You can also include your own scoring criteria or subject matter expertise.

RagMetrics is model-agnostic and supports any public, private, or open-source LLM. You can paste your custom endpoint, evaluate outputs from models like Mistral, Llama 3, or DeepSeek, and compare results to popular models like GPT-4, Claude, and Gemini using the same scoring framework.