Skip to content
UXClaim
Design Ops

Ibras AI Agent

Claude Code marketplace for building and evaluating AI agents. Install plugins to generate complete Mastra evaluation systems with scorers, datasets, experim...

What it does

Ibras AI Agent is a Claude Code plugin marketplace that simplifies building and evaluating AI agents. It bundles plugins that automate the creation of complete evaluation systems for Mastra projects—from scorers and golden datasets to offline experiments and online sampling.

How it works

Install the marketplace once in Claude Code, then add individual plugins as needed. The flagship plugin, mastra-evals, profiles any Mastra agent or workflow and generates:

  • Scorers (code-based, LLM-judge, and rubric-based)
  • Golden datasets for benchmarking
  • Offline experiments for batch testing
  • CI regression tests via runEvals
  • Online sampling for production monitoring
  • Brainstorming entry points for defining eval goals

Use cases

Product teams use this to validate AI agent performance before shipping. Research teams run offline experiments to compare agent behaviors. DevOps teams integrate CI regression tests into their deployment pipelines. Startups rapidly prototype eval frameworks without writing boilerplate.

Who benefits

AI product managers, design systems teams, and developers building with Mastra who need evaluation rigor without manual setup overhead.

Frequently asked questions

How do I install ibras-ai-agent?
In Claude Code, run `/plugin marketplace add volfadar/ibras-ai-agent`, then `/plugin install mastra-evals@ibras-ai-agent`. Requires TypeScript runtime (Bun or npx tsx) and a Mastra project with `@mastra/core` and `@mastra/evals` packages installed.
What does mastra-evals plugin do?
It profiles your Mastra agent and auto-generates a complete evaluation system: code-based scorers, LLM judges, golden datasets, offline experiments, CI regression tests, and online sampling—all battle-tested against @mastra/core 1.45.
Can I add custom plugins to the marketplace?
Yes. Follow the structure under `plugins/<name>/` and submit via PUBLISHING.md. Each plugin lives in its own directory and can be versioned independently within the marketplace.
What are the system requirements?
You need Claude Code, a TypeScript runtime (Bun or npx tsx), and for mastra-evals specifically: a Mastra project with `@mastra/core` and `@mastra/evals` packages. Plugins use only Node.js built-ins.
How do I run CI regression tests?
The generated `runEvals` function integrates into your CI/CD pipeline. It compares current agent performance against your golden dataset baseline and fails builds if regressions exceed thresholds.
What is online sampling?
Real-time evaluation of agent outputs in production. The plugin generates hooks to continuously sample and score live agent responses against your evaluation criteria, surfacing drift or performance degradation.
Can I use ibras-ai-agent for non-Mastra AI agents?
Currently optimized for Mastra projects. The marketplace structure supports other plugins in the future, but mastra-evals specifically requires `@mastra/core` and `@mastra/evals`.
What is a golden dataset in this context?
A curated set of representative agent inputs and expected outputs. The plugin auto-generates one based on your agent's behavior, then uses it for offline experiments and regression testing to ensure consistency.

Glossary

Scorer
An evaluation function that grades agent outputs. Can be code-based (deterministic), LLM-judge (using Claude to evaluate), or rubric-based (structured criteria).
Golden dataset
A curated collection of representative inputs and expected outputs used as the ground truth for measuring agent performance and detecting regressions.
CI regression test
Automated evaluation that runs on every commit to catch performance drops. If metrics fall below baseline, the build fails, preventing degraded agents from shipping.
Online sampling
Continuous evaluation of live agent responses in production. Detects real-world performance drift or quality issues before users report them.
Offline experiment
A batch evaluation that runs agent logic against a dataset without real user traffic, used for testing changes before deployment.

More in Design Ops

All →
Design Ops

Autonomous Development Pipeline

Claude Code skill turning specs into shipped code through adaptive phases with feedforward guides, feedback sensors, and requirement traceability.

0xPuncker
Design Ops

Claude Skills

Production-grade Claude AI skills for stock analysis, prompt engineering, meeting documentation, and UX design with visual-first outputs.

bhrpraju
Design Ops

Propose

Run three parallel design agents with different philosophies to generate ranked approach proposals for any design decision.

brianharms
Design Ops

SOTA Present

Generate polished HTML slides, editable PowerPoint, and Feishu whiteboards from one content description with coordinated design and anti-slop taste rules.