Is your agent
ready to ship?
certd.io stress-tests your AI against real-world "mess"—accents, vague chat prompts, and mid-sentence interruptions. We score every run on task effectiveness and human likeness, showing you exactly where your agent breaks—and why—before your users do.
Stop testing manually. Automate your eval suite.
Everything you need to know your agent handles the real world — before it goes to production.
Scenario generation
Dozens of test cases. Generated from a single prompt.
Real users don't follow your happy path. Neither do our test scenarios — from the confused first-timer to the user trying to extract data they shouldn't have. Or write your own scenarios for any edge case you care about.
- ✓Covers happy paths, edge cases, and adversarial inputs
- ✓Simulates imperfect users — vague questions, wrong assumptions, mid-conversation pivots
- ✓Tests jailbreak attempts, prompt injection, and out-of-scope requests
- ✓Prompt the generator yourself to target specific failure modes
Generate scenarios · certd.io
Describe your agent
Customer support agent for a B2B SaaS. Handles plan questions, billing, feature requests, escalations, and cancellation flows. Should not discuss competitor pricing.
Scenarios generated
50Confused free-tier user
Understand why they were charged
Power user
Ask about undocumented API limits
Churning customer
Cancel subscription after 3 rebuttals
Competitor question
Get a price comparison with rival product
Angry user
Demand a refund with escalating tone
Curate or generate more anytime
Grading
Every conversation gets a score. Every score has a reason.
Each conversation is evaluated against your scenario's expected path. Overall grade, goal verdict, task effectiveness and human likeness scores, and a step-by-step breakdown — all backed by the full transcript.
- ✓Overall 0–100 score per conversation
- ✓Task effectiveness: resolution, clarity, empathy, professionalism
- ✓Human likeness: naturalness, acknowledgment, pace and flow, interruption handling, closing
- ✓Covered vs missed steps from your expected resolution path
Test run · Support Agent v2.4
User asks about API rate limits on free tier
79
Overall score
Goal achieved
Yes
Verdict
Answered rate limit question; gave outdated pricing info.
Task effectiveness
Human likeness
Expected path coverage
Setup
One HTTPS URL or one phone number. That's the whole setup.
No SDK, no changes inside your agent, nothing new to deploy. Chat and API: paste your endpoint—we POST each simulated user turn like a real client. Voice: paste the number—we place outbound calls and run the scenario over audio.
- ✓Chat / API: your URL, our turns—full multi-step conversations over HTTP
- ✓Voice: your number, our caller—scenarios play out like a customer on the line
Connect your agent · 2 options
Chat / API agent
We'll POST to this endpoint with each simulated user turn
Voice agent
We'll call this number to run voice test scenarios
Live monitoring
Watch every conversation turn by turn.
Stream the full transcript in real time as each test runs. Every conversation is saved and searchable after the run completes.
- ✓Full transcript streamed live during each test
- ✓Complete run history saved and searchable
- ✓Replay any conversation after it ends
Live · Test in progress
Scenario #112 · Cancellation flow
I want to cancel my subscription. I'm not finding value in it anymore.
I'm sorry to hear that. Can I ask what specifically hasn't been working for you?
The integrations are too limited. I need Salesforce sync.
User personas
Pick the accent, style, and goal. Simulate real users.
Configure accent, personality, communication style, and expected behavior. Every scenario runs with a simulated real user — not a synthetic bot that always asks politely.
- ✓Dozens of voice accents — US, UK, Indian English, Australian, and more
- ✓Persona types: confused, technical, adversarial, churning, power user
- ✓Define the user's goal and expected resolution path
Persona builder
Voice accent
Indian English
▼Frustration level
High
▼Persona type
User goal
"Cancel my subscription and get a refund for the last month."
Expected resolution path
- 1. Agent acknowledges cancellation intent without pushback
- 2. Agent collects reason before offering alternatives
- 3. Agent processes cancellation or escalates to billing
How it works
From zero to eval suite in minutes.
01
Connect your agent in less than 5 minutes
Point us at your API endpoint or phone number. No SDK, no code changes, no infra to manage.
02
Define your test suite
Describe what your agent should handle. We generate realistic conversation scenarios instantly — or write your own.
03
Run automated conversations
We simulate real user turns, follow each scenario to completion, and capture full transcripts.
04
Review scored results
Every conversation gets a goal verdict, an overall 0–100 score, task effectiveness and human likeness feedback — all backed by the transcript.
Under the hood
Everything you need to ship agents with confidence.
Works with any agent
REST endpoint, WebSocket, or phone number — if it talks, we can test it. Bland, Retell, Vapi, OpenAI, custom builds.
Rich scenarios in minutes
Describe your agent's job in a sentence. We generate realistic user personas, edge cases, and adversarial inputs — no manual writing.
Task effectiveness and human likeness
Overall 0–100 plus grouped scores: resolution, clarity, empathy, professionalism — then naturalness, acknowledgment, pace and flow, interruption handling, and closing. Pinpoint exactly where your agent fails.
Regression testing on every deploy
Changed a prompt? Updated the model? Rerun your suite in one click and compare scores against the previous version.
Adversarial testing
What happens when users try to break your agent?
Prompt injection. Jailbreak attempts. Off-topic requests. Users who push past boundaries or try to extract data they shouldn't have. certd.io includes adversarial scenarios by default — so you know your guardrails hold before real users test them.
- ✓Prompt injection and jailbreak-style inputs
- ✓Attempts to extract private or confidential data
- ✓Requests completely outside your agent's scope
- ✓Users who push back, repeat, or escalate after refusals
Adversarial scenario examples
Prompt injector
Override system prompt via user input
Data extractor
Get another user's account details
Scope pusher
Get the support bot to write code for them
Persistent user
Override refund policy after 4 rejections
Jailbreaker
Get agent to roleplay with no restrictions
Compare & improve
Track progress across every prompt change.
Rerun the same suite after every update and compare scores over time. Benchmark models side by side. Organize everything into workspaces for each agent or team.
Version history
Run the same suite after every prompt or model change. See exactly what improved and what regressed.
v1.3 regressed — cheaper model cost 6 points on accuracy. Caught before merging to prod.
Benchmark agents
Run the same scenarios against multiple models or configs. Pick the best one with data, not gut feel.
Workspaces
Organize agents, test suites, and run history by project or team. Keep everything in one place.
Invite teammates, share results, and manage access per workspace.
Find out if your agent is ready.
Point certd.io at your endpoint, line up your scenarios (including the 50-scenario standard suite), and get a scored report in minutes — before the next prompt change ships a regression.
Test your agent freeNo credit card. No SDK. Just a URL or a phone number.