Evalaf - Embeddable LLM/RAG/Agent Evaluation Framework

"promptfoo for golang" - A flexible, embeddable evaluation framework for LLM applications

Evalaf is a comprehensive evaluation framework for testing and monitoring LLM applications, including RAG systems, agents, and general LLM outputs. Built as a reusable Go module, it can be embedded in any Go project or used as a standalone CLI tool.

Features

Embeddable: Use as a library in your Go projects
Extensible: Plugin-based architecture for custom evaluators
Flexible: Evaluate RAG, agents, LLMs, or any system
Configurable: YAML/JSON configuration with runtime customization
Parallel: Concurrent evaluation for fast execution
Multiple Output Formats: JSON, YAML, Markdown, HTML
LLM-as-Judge: Use LLMs to evaluate LLM outputs (via Genkit integration)
Production-Ready: Extract and evaluate production traces

Quick Start

Installation

go get github.com/antflydb/anteval

Basic Usage

package main

import (
    "context"
    "github.com/antflydb/anteval/eval"
)

func main() {
    ctx := context.Background()

    // Create dataset
    dataset := eval.NewJSONDatasetFromExamples("test", []eval.Example{
        {Input: "What is 2+2?", Reference: "4"},
    })

    // Create evaluators
    evaluators := []eval.Evaluator{
        eval.NewExactMatchEvaluator("exact_match"),
    }

    // Define target function (your LLM/RAG/agent)
    target := func(ctx context.Context, example eval.Example) (any, error) {
        // Call your system here
        return "4", nil
    }

    // Run evaluation
    runner := eval.NewRunner(eval.DefaultConfig(), evaluators)
    report, _ := runner.RunWithTarget(ctx, dataset, target)

    // Print results
    report.Print()
}

Using the CLI

# Run evaluation with config file
evalaf run --config evalaf.yaml

# List available metrics
evalaf metrics list

# Validate dataset
evalaf datasets validate testdata/datasets/rag_quality.json

Architecture

Core Packages

eval/: Core evaluation library (minimal dependencies)
- Dataset loading and management
- Evaluator interface and built-in metrics
- Evaluation runner and reporting
- Configuration management
genkit/: Genkit integration for LLM-as-judge
- LLM-based evaluators (faithfulness, relevance, safety)
- Support for multiple models (Ollama, OpenAI, Gemini, Claude)
- Streaming evaluation support
rag/: RAG-specific evaluators
- Faithfulness (answer grounded in documents)
- Relevance (answer addresses query)
- Citation accuracy
- Retrieval metrics (NDCG, MRR, Precision@k, Recall@k)
agent/: Agent-specific evaluators
- Query classification accuracy
- Tool selection correctness
- Reasoning quality
ui/: UI embedding helpers
- Result formatting for dashboards
- Visualization data structures
cmd/evalaf/: CLI tool

Built-in Evaluators

Simple Evaluators (no LLM required)

Exact Match: Output exactly matches reference
Regex: Output matches regex pattern
Contains: Output contains substring
Fuzzy Match: Levenshtein distance-based similarity

RAG Evaluators

Faithfulness: Answer grounded in retrieved documents (LLM-as-judge)
Relevance: Answer addresses the query (LLM-as-judge)
Citation Accuracy: Validates citation references like [doc_id X]
NDCG@k: Retrieval ranking quality
MRR: Mean Reciprocal Rank
Precision@k / Recall@k: Retrieval effectiveness

Agent Evaluators

Classification Accuracy: Correct query classification
Tool Selection: Appropriate tool chosen
Reasoning Quality: Logical and coherent reasoning (LLM-as-judge)

Configuration

Create a evalaf.yaml configuration file:

version: 1

evaluators:
  # LLM-as-judge evaluators
  faithfulness:
    type: genkit_llm_judge
    model: ollama/mistral
    temperature: 0.0

  relevance:
    type: genkit_llm_judge
    model: ollama/mistral
    temperature: 0.0

  # RAG evaluators
  citation_accuracy:
    type: rag_citation
    pattern: '\[doc_id\s+([^\]]+)\]'

  retrieval_ndcg:
    type: rag_retrieval_metric
    metric: ndcg
    k: 10

  # Simple evaluators
  exact_match:
    type: exact_match

  regex_check:
    type: regex
    pattern: "(?i)expected_pattern"

datasets:
  - name: rag_quality
    path: testdata/datasets/rag_quality.json
    type: json

output:
  format: json
  path: results/report.json
  pretty: true

execution:
  parallel: true
  max_concurrency: 10
  timeout: 5m

Dataset Format

Datasets are JSON arrays of examples:

[
  {
    "input": "What is the capital of France?",
    "reference": "Paris",
    "context": [
      "France is a country in Europe",
      "Paris is the capital and largest city of France"
    ],
    "metadata": {
      "domain": "geography",
      "difficulty": "easy"
    }
  }
]

Use Cases

1. Antfly RAG Evaluation

Evaluate Antfly's RAG system quality:

// Evaluate RAG endpoint
ragEndpoint := "http://localhost:3210/api/v1/rag"

target := func(ctx context.Context, example eval.Example) (any, error) {
    return callAntflyRAG(ragEndpoint, example.Input)
}

runner.RunWithTarget(ctx, dataset, target)

2. searchaf Customer Monitoring

Monitor customer-specific prompts:

// Evaluate customer's configuration
customerDataset := loadCustomerDataset(customerID)
customerConfig := loadCustomerConfig(customerID)

runner := eval.NewRunner(customerConfig, evaluators)
report, _ := runner.Run(ctx, customerDataset)

// Embed in searchaf dashboard
dashboardData := ui.FormatForDashboard(report)

3. CI/CD Regression Testing

# In CI/CD pipeline
evalaf run --config ci_eval.yaml --dataset testdata/regression.json

# Fail if pass rate < 95%
if [ $(jq '.summary.pass_rate < 0.95' results.json) = "true" ]; then
  exit 1
fi

4. A/B Testing Prompts

// Compare two prompt variations
reportA, _ := runEvaluation(promptA, dataset)
reportB, _ := runEvaluation(promptB, dataset)

// Compare results
if reportB.Summary.AverageScore > reportA.Summary.AverageScore {
    fmt.Println("Prompt B is better!")
}

Examples

See the examples/ directory for complete examples:

examples/simple/: Basic evaluation example
examples/antfly/: Antfly RAG and Answer Agent evaluation
examples/searchaf/: searchaf integration example

Development

Project Structure

evalaf/
├── eval/           # Core library
├── genkit/         # Genkit integration
├── rag/            # RAG evaluators
├── agent/          # Agent evaluators
├── ui/             # UI helpers
├── cmd/evalaf/     # CLI tool
├── examples/       # Usage examples
├── testdata/       # Test datasets
├── work-log/       # Design docs
└── docs/           # Documentation

Running Examples

cd examples/simple
go run main.go

Running Tests

go test ./...

Module Path

This project uses the module path github.com/antflydb/anteval while living in the evalaf/ directory of the antfly repository. This is a nested Go module.

Dependencies

Core (`eval/`)

Minimal dependencies: stdlib + gopkg.in/yaml.v3

Genkit Integration (`genkit/`)

github.com/firebase/genkit/go/genkit
github.com/firebase/genkit/go/ai

CLI (`cmd/evalaf/`)

github.com/spf13/cobra
github.com/spf13/viper

Roadmap

Contributing

This project is part of the Antfly ecosystem. For contribution guidelines, see the main Antfly repository.

License

[Your license here]

Related Projects

Antfly - Distributed key-value store and vector search engine
Firebase Genkit - AI application framework
promptfoo - Inspiration for this project

Support

Issues: GitHub Issues
Documentation: See docs/ directory
Design Docs: See work-log/PLAN.md

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
agent		agent
antfly		antfly
cmd		cmd
eval		eval
examples		examples
genkit		genkit
genkitplugin		genkitplugin
rag		rag
redteam		redteam
testdata/datasets		testdata/datasets
ui		ui
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

License

antflydb/evalaf

Folders and files

Latest commit

History

Repository files navigation

Evalaf - Embeddable LLM/RAG/Agent Evaluation Framework

Features

Quick Start

Installation

Basic Usage

Using the CLI

Architecture

Core Packages

Built-in Evaluators

Simple Evaluators (no LLM required)

RAG Evaluators

Agent Evaluators

Configuration

Dataset Format

Use Cases

1. Antfly RAG Evaluation

2. searchaf Customer Monitoring

3. CI/CD Regression Testing

4. A/B Testing Prompts

Examples

Development

Project Structure

Running Examples

Running Tests

Module Path

Dependencies

Core (eval/)

Genkit Integration (genkit/)

CLI (cmd/evalaf/)

Roadmap

Contributing

License

Related Projects

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Core (`eval/`)

Genkit Integration (`genkit/`)

CLI (`cmd/evalaf/`)

Packages