promptdev
is a prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers.
Warning
promptdev is in preview and is not ready for production use.
We're working hard to make it stable and feature-complete, but until then, expect to encounter bugs, missing features, and fatal errors.
- π Type Safe - Full Pydantic validation for inputs, outputs, and configurations
- π€ PydanticAI Integration - Native support for PydanticAI agents (in progress) and evaluation framework
- π Multi-Provider Testing - Test across OpenAI, Together.ai, Ollama, Bedrock, and more
- β‘ Performance Optimized - File-based caching with TTL for faster repeated evaluations
- π Rich Reporting - Beautiful console output with detailed failure analysis and provider comparisons
- π§ͺ Promptfoo Compatible - Works with (some) existing promptfoo YAML configs and datasets
- π― Comprehensive Assertions - Built-in evaluators plus custom Python assertion support
pip install promptdev --pre
git clone https://github.com/artefactop/promptdev.git
cd promptdev
pip install -e .
git clone https://github.com/artefactop/promptdev.git
cd promptdev
uv sync
uv run promptdev --help
# Run evaluation (simple demo)
promptdev eval examples/demo/config.yaml
# Run evaluation (advanced example)
promptdev eval examples/calendar_event_summary/config.yaml
# Disable caching for a run
promptdev eval examples/demo/config.yaml --no-cache
# Export results
promptdev eval examples/demo/config.yaml --output json
promptdev eval examples/demo/config.yaml --output html
# Validate configuration
promptdev validate examples/demo/config.yaml
# Cache management
promptdev cache stats
promptdev cache clear
uv run promptdev --help
Promptdev supports a comprehensive set of evaluators for different testing scenarios:
Type | Description |
---|---|
equals |
Checks if the output exactly equals the provided value |
contains |
Checks if the output contains the expected output |
is_instance |
Checks if the output is an instance of a type with the given name |
max_duration |
Checks if the execution time is under the specified maximum |
is_json |
Checks if the output is a valid JSON string (optional json schema validation) |
contains_json |
Checks if the output contains a valid json (optional json schema validation) |
python |
Promptfoo compatible Allows you to provide a custom Python function to validate the LLM output |
Promptdev uses YAML configuration files compatible with Promptfoo format, but only a subset is available for now:
Promptdev maintains compatibility with promptfoo configurations to ease migration:
To migrate if you are using ids with format
provider:chat|completion:model
, just remove the middle partprovider:model
, promptdev only supports chat.Some provider name can change for example
togetherai
is nowtogeher
. Refer to pydantic_ai models for the full list.
- YAML configs - Most promptfoo YAML configs work with minimal changes
- JSONL datasets - Existing test datasets are fully supported
- Python assertions - Custom
get_assert
functions work without modification - JSON schemas - Schema validation uses the same format
Warning
Promptdev can run custom Python assertions. While powerful, running arbitrary Python code always comes with security issues. Use this feature only with code you trust.
Example of a Python assertion:
# tests/data/python_assert.py
from typing import Any
def get_assert(output:str, context:dict) -> bool | float | dict[str, Any]:
"""Test assertion that checks if output contains 'success'."""
return "success" in str(output).lower()
# Setup development environment
uv sync
# Run tests
uv run pytest
# Format and lint code
uv run ruff check . --fix
uv run ruff format .
# Type checking
uv run ty check
- Core evaluation engine with PydanticAI integration
- Multi-provider support for major AI platforms
- YAML configuration loading with promptfoo compatibility
- Comprehensive assertion types (JSON schema, Python, LLM-based)
- File-based caching system with TTL support
- Rich console reporting with failure analysis
- Simple file disk cache
- Better integration with PydanticAI, do not reinvent the wheel
- Concurrent execution using PydanticAI natively, for faster large-scale evaluations
- Code cleanup
- Testing
- Testing promptfoo files
- Native support for PydanticAI agents
- Add support to run multiple config files with one command
- CI/CD integration helpers with change detection
- SQLite persistence for evaluation history and analytics
- Performance benchmarking and regression detection
We welcome contributions! Here's how to get started:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature
- Install development dependencies:
uv sync
- Make your changes and add tests
- Run tests:
uv run pytest
- Commit your changes:
git commit -m 'Add amazing feature'
- Push to the branch:
git push origin feature/amazing-feature
- Open a Pull Request
We use ruff
for code formatting and linting, ty
for type checking, and pytest
for testing. Please ensure your code follows these standards:
uv run ruff check . # Lint code
uv run ruff format . # Format code
uv run ty check # Type checking
uv run pytest # Run tests
This project is licensed under the MIT License - see the LICENSE file for details.
- Built on PydanticAI for type-safe AI agent development
- Inspired by promptfoo for evaluation concepts
- Uses Rich for beautiful console output