Your agent passes every test you wrote.
It fails on the ones you didn’t.

Evals test what you anticipated. Agents fail in the gaps, wrong tools, confident hallucinations, silent breakdowns. Those are the ones your users find first.

Get Started GitHub

// the pain

Every team shipping agents hits the same three pains.

Pain 01
Dignity

We're essentially testing in production with real users.

The bug doesn’t show up as a failed test. It shows up as a customer email.

Pain 02
Scale

Every time something breaks, an intern manually tests the entire 37-step flow.

The math doesn’t work. Tens of thousands of possible paths. A dozen tests. You can’t close that gap by hand.

Pain 03
Visibility

LangSmith shows me what the agent did. It doesn't tell me if it was right.

No dashboard flags the wrong tool call. No alert fires on confident hallucination. Someone else finds it first.

// catch what evals miss

The behavioral testing layer your stack is missing.

Invarium is a single platform to map the agent paths, simulate the edge cases, and ship with the evidence.

01 — Map

Agent Intelligence Graph. Every path. Color-coded by risk.

The map your dashboard can’t draw. Every unguarded path flagged in orange. Enriches from your LangSmith or Datadog traces.

02 — Simulation

Simulation at scale. Ten thousand test cases. Not twelve.

No writing tests by hand. Generated automatically against your highest-risk paths.

Test Run Completed · E-Commerce Agent

run #847

10,000

test cases

failure types

94%

coverage

Failure categories tested

Tool Usage

Safety

Reasoning

Context

Knowledge

Instruction

Communication

Operational

Coordination

view report ↗

03 — Verdict

Agent Health Report. A verdict you can ship behind.

A pass-or-fail verdict on every release. No LLM-as-judge. Scored on behavior, reproducible across runs.

Invarium AI | Agent Health Report

E-Commerce Agent

custom · 10 tools · 6 guardrails

×Not Ready for Deployment

AQS ≥ 85 and ARS ≥ 75 required

AQS Score

Threshold: ≥ 85

ARS Score

Threshold: ≥ 75

Pass Rate

16%

Threshold: ≥ 80%

AQS Trend

4/25 passed

Improving

// how it works

From install to verdict in four sentences.

MCP-native. Runs wherever your coding agent does. Ten minutes from command to shareable audit.

Cursor · Invarium MCP

$claude mcp add invarium "https://mcp.invarium.dev/mcp" ✓ connected

Scanning your codebase…

Architecture discovered ✓ 4 tools · 2 chains · 3 guards · 1 external service ⚠ 2 unguarded paths detected → Agent Intelligence Graph live on dashboard

Generate behavioral tests — focus on edge cases with frustrated users.

Test cases generated…

Test cases generated ✓ 24 test cases targeting 6 high-risk paths Persona: frustrated user · Low patience · High persistence → Ready on dashboard

Run the tests.

Running 24 test cases…

Results ✓ 21 passed · ✗ 3 failed ✗ Refund — skipped identity verification ✗ PII leak in multi-turn conversation ✗ Hallucinated transaction ID 73 Agent Quality Score — Good Actionable recommendations for each failure

Share this audit.

Audit report generated.

↗ invarium.dev/audit/a3f7k2x · no login required to view Includes: Agent graph · Test results · Recommendations

// how we handle your data

Your data stays yours.

01 — Redaction

PII redaction at ingest.

Names, emails, IDs, and tokens are redacted at the edge before hitting our servers.

02 — Endpoint

Your infra. Your endpoint.

We call your agent at the URL you give us. Nothing else. No shadow crawls.

03 — Training

No training on your data.

Your traces, tests, and verdicts are yours. We don’t fine-tune models on them. Not now, not ever.

04 — Deletion

Delete on request.

One email to team@invarium.dev triggers a hard purge of everything within 30 days.

// pricing

Start free. Scale when you're ready.

Free

$0 / forever

No credit card required

Everything you need to test your first agent.

agent

200

test cases
/ month

200

sim credits
/ month

1 credit = 1 conversation turn

Get Started

Teams & Enterprise

Custom / annual contract

Billed annually. Terms tailored to your team.

For teams needing higher limits, enterprise security, or dedicated onboarding.

∞

agent
limits

SSO

enterprise
security

1:1

onboarding
& support

Stop testing
on your users.

Thousands of paths, tested before production.