Overview
BrightAgent uses a multi-layered evaluation strategy that covers every stage of the agent lifecycle. Pre-flight checks validate infrastructure before agents run. Runtime evaluations score every agent response for relevance and correctness. Post-flight checks verify outputs before they reach users. And SDLC evaluations ensure the platform maintains high standards across every release.Pre-Flight Evaluation
Before agents handle user queries, pre-flight checks ensure the underlying infrastructure and services are functioning correctly.Deterministic Function Tests
Core platform functions (authentication, API connectivity, data access) are validated with deterministic tests that use custom comparison functions to verify outputs match expected criteria.
Test Case Suites
Curated test suites define expected inputs and outputs for each agent. Single-turn tests validate one-off queries. Multi-turn tests validate conversational workflows across multiple interactions.
Agent Runner Validation
Agent runners are tested against known scenarios before deployment — verifying that the Retrieval Agent finds data, the Analyst Agent generates valid SQL, and the Visualization Agent produces charts.
Context Grounding
All agent responses are grounded in actual data from Neo4j and Redshift — not hallucinated from training data. Pre-flight checks verify that context sources are accessible and returning expected schemas.
Runtime (Online) Evaluation
Every agent interaction is scored in real-time using DeepEval metrics, providing continuous measurement of response quality.Single-Turn Metrics
For individual user queries, two metrics are measured:| Metric | What It Measures | How It Works |
|---|---|---|
| Answer Relevancy | Is the response relevant to the user’s question? | DeepEval’s AnswerRelevancyMetric scores the semantic alignment between input and output |
| Correctness | Is the response accurate and logically sound? | GEval (LLM-as-a-judge) evaluates whether the agent successfully executed the request without errors |
Multi-Turn Metrics
For conversational workflows that span multiple interactions, four additional metrics track quality across the full conversation:| Metric | What It Measures |
|---|---|
| Turn Relevancy | Are responses relevant throughout the entire conversation, not just the first message? |
| Knowledge Retention | Does the agent remember information from earlier turns? (e.g., “use the dataset I mentioned earlier”) |
| Conversation Completeness | Does the conversation achieve its intended goal by the final turn? |
| Goal Accuracy | How precisely did the agent accomplish what the user set out to do? |
LLM-as-a-Judge
The Correctness metric uses GEval — an LLM-as-a-judge approach that evaluates whether agent output is a correct and appropriate response to the input. The judge evaluates:- Whether the agent successfully executed the user’s request
- Whether the output is accurate, complete, and logically sound
- Whether the response matches expected output (when provided)
Tool Validation via MCP
Model Context Protocol (MCP) integration validates that agent tool calls are well-formed and authorized before execution. If an agent tries to call a tool with invalid parameters or access unauthorized resources, MCP blocks the call before it reaches your infrastructure.Post-Flight Verification
Before a response reaches the user, multiple verification layers ensure quality.Cross-Agent Verification
The BrightHive Agent can route results through multiple agents for validation. For example, the Governance Agent verifies that an Analyst Agent’s query respects data access policies before results are returned.
Human-in-the-Loop Review
Operations that modify infrastructure — dbt models, schema changes, governance policies — require explicit human approval. Generated code is submitted as a GitHub PR, not executed automatically.
Observable Output Chain
Every agent interaction is logged with the full chain of tool calls, data accessed, and decisions made. Users can inspect the SQL generated, the data assets queried, and the reasoning behind each response.
Context Source Attribution
Responses include traceability to the data sources used — which tables were queried, which Neo4j metadata informed the response — so users can verify the answer’s foundation.
SDLC Evaluation
Platform-level evaluations run as part of the software development lifecycle to ensure every release maintains quality standards.CI/CD Pipeline
Evaluations run automatically on every pull request via GitHub Actions: Evaluation results are posted directly as PR comments, showing per-test-case scores with color-coded badges (green/yellow/red) so reviewers can see quality impact at a glance.Parallel Test Execution
The evaluation framework runs test cases concurrently for faster feedback:- Single-turn tests: Up to 10 concurrent executions
- Multi-turn tests: Up to 5 concurrent conversations (with sequential turns within each to maintain state)
- CI mode: Reduced parallelism to conserve resources in pipeline environments
Observability
OpenTelemetry Instrumentation
Every evaluation metric is recorded via OpenTelemetry — tracking answer relevancy, correctness, turn relevancy, and goal accuracy across all agents and test types.
LangSmith Tracing
Full trace visibility into every agent step — from initial intent classification through tool calls to final response synthesis. Traces include latency breakdowns, token usage, and error attribution.
Metric Dashboards
Summary statistics across all test cases — average scores per metric, pass/fail rates, and trend analysis across releases.
Reporting Formats
Results are available in console, JSON, and Markdown formats — supporting local debugging, programmatic analysis, and GitHub PR comments.
What Gets Measured
| Category | Metrics |
|---|---|
| Agent Quality | Answer relevancy, correctness, turn relevancy, knowledge retention, conversation completeness, goal accuracy |
| Operational | Agent invocations, latency (p50/p95/p99), error rates, token usage per model |
| User Satisfaction | Helpfulness ratings, accuracy feedback, response speed perception |
| Infrastructure | Guardrails blocks (PII detections, content violations), hallucination rate |
Evaluation metrics and quality scores feed directly into continuous improvement — helping the BrightAgent team identify and fix quality regressions before they reach production. For more on how agents work, see the BrightAgent architecture.

