Evaluation & Quality

Overview

BrightAgent uses a multi-layered evaluation strategy that covers every stage of the agent lifecycle. Pre-flight checks validate infrastructure before agents run. Runtime evaluations score every agent response for relevance and correctness. Post-flight checks verify outputs before they reach users. And SDLC evaluations ensure the platform maintains high standards across every release.

Pre-Flight Evaluation

Before agents handle user queries, pre-flight checks ensure the underlying infrastructure and services are functioning correctly.

Deterministic Function Tests

Core platform functions (authentication, API connectivity, data access) are validated with deterministic tests that use custom comparison functions to verify outputs match expected criteria.

Test Case Suites

Curated test suites define expected inputs and outputs for each agent. Single-turn tests validate one-off queries. Multi-turn tests validate conversational workflows across multiple interactions.

Agent Runner Validation

Agent runners are tested against known scenarios before deployment — verifying that the Retrieval Agent finds data, the Analyst Agent generates valid SQL, and the Visualization Agent produces charts.

Context Grounding

All agent responses are grounded in actual data from Neo4j and Redshift — not hallucinated from training data. Pre-flight checks verify that context sources are accessible and returning expected schemas.

Runtime (Online) Evaluation

Every agent interaction is scored in real-time using DeepEval metrics, providing continuous measurement of response quality.

Single-Turn Metrics

For individual user queries, two metrics are measured:

Metric	What It Measures	How It Works
Answer Relevancy	Is the response relevant to the user’s question?	DeepEval’s AnswerRelevancyMetric scores the semantic alignment between input and output
Correctness	Is the response accurate and logically sound?	GEval (LLM-as-a-judge) evaluates whether the agent successfully executed the request without errors

The Correctness metric operates in two modes: strict mode compares against expected outputs when available, and open mode evaluates correctness based on the input alone — catching logical errors even without a reference answer.

Multi-Turn Metrics

For conversational workflows that span multiple interactions, four additional metrics track quality across the full conversation:

Metric	What It Measures
Turn Relevancy	Are responses relevant throughout the entire conversation, not just the first message?
Knowledge Retention	Does the agent remember information from earlier turns? (e.g., “use the dataset I mentioned earlier”)
Conversation Completeness	Does the conversation achieve its intended goal by the final turn?
Goal Accuracy	How precisely did the agent accomplish what the user set out to do?

LLM-as-a-Judge

The Correctness metric uses GEval — an LLM-as-a-judge approach that evaluates whether agent output is a correct and appropriate response to the input. The judge evaluates:

Whether the agent successfully executed the user’s request
Whether the output is accurate, complete, and logically sound
Whether the response matches expected output (when provided)

This catches subtle errors that pattern-matching can’t — like SQL that runs without errors but returns the wrong data.

Tool Validation via MCP

Model Context Protocol (MCP) integration validates that agent tool calls are well-formed and authorized before execution. If an agent tries to call a tool with invalid parameters or access unauthorized resources, MCP blocks the call before it reaches your infrastructure.

Post-Flight Verification

Before a response reaches the user, multiple verification layers ensure quality.

Cross-Agent Verification

The BrightHive Agent can route results through multiple agents for validation. For example, the Governance Agent verifies that an Analyst Agent’s query respects data access policies before results are returned.

Human-in-the-Loop Review

Operations that modify infrastructure — dbt models, schema changes, governance policies — require explicit human approval. Generated code is submitted as a GitHub PR, not executed automatically.

Observable Output Chain

Every agent interaction is logged with the full chain of tool calls, data accessed, and decisions made. Users can inspect the SQL generated, the data assets queried, and the reasoning behind each response.

Context Source Attribution

Responses include traceability to the data sources used — which tables were queried, which Neo4j metadata informed the response — so users can verify the answer’s foundation.

SDLC Evaluation

Platform-level evaluations run as part of the software development lifecycle to ensure every release maintains quality standards.

CI/CD Pipeline

Evaluations run automatically on every pull request via GitHub Actions: Evaluation results are posted directly as PR comments, showing per-test-case scores with color-coded badges (green/yellow/red) so reviewers can see quality impact at a glance.

Parallel Test Execution

The evaluation framework runs test cases concurrently for faster feedback:

Single-turn tests: Up to 10 concurrent executions
Multi-turn tests: Up to 5 concurrent conversations (with sequential turns within each to maintain state)
CI mode: Reduced parallelism to conserve resources in pipeline environments

Observability

OpenTelemetry Instrumentation

Every evaluation metric is recorded via OpenTelemetry — tracking answer relevancy, correctness, turn relevancy, and goal accuracy across all agents and test types.

LangSmith Tracing

Full trace visibility into every agent step — from initial intent classification through tool calls to final response synthesis. Traces include latency breakdowns, token usage, and error attribution.

Metric Dashboards

Summary statistics across all test cases — average scores per metric, pass/fail rates, and trend analysis across releases.

Reporting Formats

Results are available in console, JSON, and Markdown formats — supporting local debugging, programmatic analysis, and GitHub PR comments.

What Gets Measured

Category	Metrics
Agent Quality	Answer relevancy, correctness, turn relevancy, knowledge retention, conversation completeness, goal accuracy
Operational	Agent invocations, latency (p50/p95/p99), error rates, token usage per model
User Satisfaction	Helpfulness ratings, accuracy feedback, response speed perception
Infrastructure	Guardrails blocks (PII detections, content violations), hallucination rate

Evaluation metrics and quality scores feed directly into continuous improvement — helping the BrightAgent team identify and fix quality regressions before they reach production. For more on how agents work, see the BrightAgent architecture.

Getting Started

BrightAgent

Platform

​Overview

​Pre-Flight Evaluation

Deterministic Function Tests

Test Case Suites

Agent Runner Validation

Context Grounding

​Runtime (Online) Evaluation

​Single-Turn Metrics

​Multi-Turn Metrics

​LLM-as-a-Judge

​Tool Validation via MCP

​Post-Flight Verification

Cross-Agent Verification

Human-in-the-Loop Review

Observable Output Chain

Context Source Attribution

​SDLC Evaluation

​CI/CD Pipeline

​Parallel Test Execution

​Observability

OpenTelemetry Instrumentation

LangSmith Tracing

Metric Dashboards

Reporting Formats

​What Gets Measured

Overview

Pre-Flight Evaluation

Runtime (Online) Evaluation

Single-Turn Metrics

Multi-Turn Metrics

LLM-as-a-Judge

Tool Validation via MCP

Post-Flight Verification

SDLC Evaluation

CI/CD Pipeline

Parallel Test Execution

Observability

What Gets Measured