"promptfoo for golang" - A flexible, embeddable evaluation framework for LLM applications
Evalaf is a comprehensive evaluation framework for testing and monitoring LLM applications, including RAG systems, agents, and general LLM outputs. Built as a reusable Go module, it can be embedded in any Go project or used as a standalone CLI tool.
- Embeddable: Use as a library in your Go projects
- Extensible: Plugin-based architecture for custom evaluators
- Flexible: Evaluate RAG, agents, LLMs, or any system
- Configurable: YAML/JSON configuration with runtime customization
- Parallel: Concurrent evaluation for fast execution
- Multiple Output Formats: JSON, YAML, Markdown, HTML
- LLM-as-Judge: Use LLMs to evaluate LLM outputs (via Genkit integration)
- Production-Ready: Extract and evaluate production traces
go get github.com/antflydb/antevalpackage main
import (
"context"
"github.com/antflydb/anteval/eval"
)
func main() {
ctx := context.Background()
// Create dataset
dataset := eval.NewJSONDatasetFromExamples("test", []eval.Example{
{Input: "What is 2+2?", Reference: "4"},
})
// Create evaluators
evaluators := []eval.Evaluator{
eval.NewExactMatchEvaluator("exact_match"),
}
// Define target function (your LLM/RAG/agent)
target := func(ctx context.Context, example eval.Example) (any, error) {
// Call your system here
return "4", nil
}
// Run evaluation
runner := eval.NewRunner(eval.DefaultConfig(), evaluators)
report, _ := runner.RunWithTarget(ctx, dataset, target)
// Print results
report.Print()
}# Run evaluation with config file
evalaf run --config evalaf.yaml
# List available metrics
evalaf metrics list
# Validate dataset
evalaf datasets validate testdata/datasets/rag_quality.json-
eval/: Core evaluation library (minimal dependencies)- Dataset loading and management
- Evaluator interface and built-in metrics
- Evaluation runner and reporting
- Configuration management
-
genkit/: Genkit integration for LLM-as-judge- LLM-based evaluators (faithfulness, relevance, safety)
- Support for multiple models (Ollama, OpenAI, Gemini, Claude)
- Streaming evaluation support
-
rag/: RAG-specific evaluators- Faithfulness (answer grounded in documents)
- Relevance (answer addresses query)
- Citation accuracy
- Retrieval metrics (NDCG, MRR, Precision@k, Recall@k)
-
agent/: Agent-specific evaluators- Query classification accuracy
- Tool selection correctness
- Reasoning quality
-
ui/: UI embedding helpers- Result formatting for dashboards
- Visualization data structures
-
cmd/evalaf/: CLI tool
- Exact Match: Output exactly matches reference
- Regex: Output matches regex pattern
- Contains: Output contains substring
- Fuzzy Match: Levenshtein distance-based similarity
- Faithfulness: Answer grounded in retrieved documents (LLM-as-judge)
- Relevance: Answer addresses the query (LLM-as-judge)
- Citation Accuracy: Validates citation references like
[doc_id X] - NDCG@k: Retrieval ranking quality
- MRR: Mean Reciprocal Rank
- Precision@k / Recall@k: Retrieval effectiveness
- Classification Accuracy: Correct query classification
- Tool Selection: Appropriate tool chosen
- Reasoning Quality: Logical and coherent reasoning (LLM-as-judge)
Create a evalaf.yaml configuration file:
version: 1
evaluators:
# LLM-as-judge evaluators
faithfulness:
type: genkit_llm_judge
model: ollama/mistral
temperature: 0.0
relevance:
type: genkit_llm_judge
model: ollama/mistral
temperature: 0.0
# RAG evaluators
citation_accuracy:
type: rag_citation
pattern: '\[doc_id\s+([^\]]+)\]'
retrieval_ndcg:
type: rag_retrieval_metric
metric: ndcg
k: 10
# Simple evaluators
exact_match:
type: exact_match
regex_check:
type: regex
pattern: "(?i)expected_pattern"
datasets:
- name: rag_quality
path: testdata/datasets/rag_quality.json
type: json
output:
format: json
path: results/report.json
pretty: true
execution:
parallel: true
max_concurrency: 10
timeout: 5mDatasets are JSON arrays of examples:
[
{
"input": "What is the capital of France?",
"reference": "Paris",
"context": [
"France is a country in Europe",
"Paris is the capital and largest city of France"
],
"metadata": {
"domain": "geography",
"difficulty": "easy"
}
}
]Evaluate Antfly's RAG system quality:
// Evaluate RAG endpoint
ragEndpoint := "http://localhost:3210/api/v1/rag"
target := func(ctx context.Context, example eval.Example) (any, error) {
return callAntflyRAG(ragEndpoint, example.Input)
}
runner.RunWithTarget(ctx, dataset, target)Monitor customer-specific prompts:
// Evaluate customer's configuration
customerDataset := loadCustomerDataset(customerID)
customerConfig := loadCustomerConfig(customerID)
runner := eval.NewRunner(customerConfig, evaluators)
report, _ := runner.Run(ctx, customerDataset)
// Embed in searchaf dashboard
dashboardData := ui.FormatForDashboard(report)# In CI/CD pipeline
evalaf run --config ci_eval.yaml --dataset testdata/regression.json
# Fail if pass rate < 95%
if [ $(jq '.summary.pass_rate < 0.95' results.json) = "true" ]; then
exit 1
fi// Compare two prompt variations
reportA, _ := runEvaluation(promptA, dataset)
reportB, _ := runEvaluation(promptB, dataset)
// Compare results
if reportB.Summary.AverageScore > reportA.Summary.AverageScore {
fmt.Println("Prompt B is better!")
}See the examples/ directory for complete examples:
examples/simple/: Basic evaluation exampleexamples/antfly/: Antfly RAG and Answer Agent evaluationexamples/searchaf/: searchaf integration example
evalaf/
├── eval/ # Core library
├── genkit/ # Genkit integration
├── rag/ # RAG evaluators
├── agent/ # Agent evaluators
├── ui/ # UI helpers
├── cmd/evalaf/ # CLI tool
├── examples/ # Usage examples
├── testdata/ # Test datasets
├── work-log/ # Design docs
└── docs/ # Documentation
cd examples/simple
go run main.gogo test ./...This project uses the module path github.com/antflydb/anteval while living in the evalaf/ directory of the antfly repository. This is a nested Go module.
- Minimal dependencies: stdlib +
gopkg.in/yaml.v3
github.com/firebase/genkit/go/genkitgithub.com/firebase/genkit/go/ai
github.com/spf13/cobragithub.com/spf13/viper
- Core evaluation library
- Genkit integration (LLM-as-judge)
- RAG evaluators
- Agent evaluators
- CLI tool
- UI integration helpers
- Production trace extraction
- Web UI for evaluation management
- Continuous evaluation mode
- Multi-turn conversation evaluation
This project is part of the Antfly ecosystem. For contribution guidelines, see the main Antfly repository.
[Your license here]
- Antfly - Distributed key-value store and vector search engine
- Firebase Genkit - AI application framework
- promptfoo - Inspiration for this project
- Issues: GitHub Issues
- Documentation: See
docs/directory - Design Docs: See
work-log/PLAN.md