evalctl
v1.4.2 · MIT License
A fast, extensible CLI for running and evaluating LLM / AI model outputs from your terminal. Define prompt suites, compare models side by side, track metrics over time, and plug evaluations into your CI pipeline.
$ curl -fsSL https://evalctl.dev/install.sh | sh
# Installing evalctl v1.4.2...
# Detected linux/amd64
# Installed to /usr/local/bin/evalctl
$ evalctl run "What is the capital of France?" --model gpt-4o
→ ParisInstallation
evalctl is distributed as a single binary with zero runtime dependencies. Choose the method that works best for your environment.
macOS
# Homebrew (recommended)
brew install evalctl/evalctl/evalctl
# Or download directly
curl -fsSL https://evalctl.dev/install.sh | shLinux
# x86_64 / amd64
curl -fsSL https://evalctl.dev/install.sh | sh
# Via apt (Debian/Ubuntu)
echo "deb [trusted=yes] https://apt.evalctl.dev /" | sudo tee /etc/apt/sources.list.d/evalctl.list
sudo apt update && sudo apt install evalctlnpm
npm install -g evalctlGo
go install github.com/example/evalctl@latestVerify the Installation
evalctl version
# evalctl v1.4.2 (linux/amd64)
evalctl --help
# A CLI for running and evaluating LLM model outputs
#
# Usage:
# evalctl [command] [flags]
#
# Available Commands:
# run Run a single evaluation
# watch Watch files and re-run on changes
# compare Compare outputs across multiple models
# export Export evaluation resultsQuick Start
Get up and running in under a minute. evalctl evaluates prompts against LLM models and scores the outputs.
Your First Evaluation
evalctl run "What is the capital of France?" --model gpt-4oThis sends the prompt to GPT-4o and prints the response. Add --json for structured output:
evalctl run "Solve: 2x + 5 = 15" --model gpt-4o --jsonUsing a Config File
Create an evalctl.yaml file to define prompt suites and evaluation criteria:
# evalctl.yaml
models:
- id: gpt-4o
provider: openai
temperature: 0.7
- id: claude-3-opus
provider: anthropic
temperature: 0.5
prompts:
- id: translation
template: "Translate this to French: {{.input}}"
cases:
- input: "Hello, how are you?"
- input: "The weather is nice today."
metrics:
- name: exact_match
- name: contains_keywords
params:
keywords: ["bonjour", "comment"]evalctl run --config evalctl.yamlViewing Results
Results are printed to stdout by default. Pass --output results.json to save them, or use the export command to generate reports.
Configuration
evalctl can be configured via a YAML config file, environment variables, or CLI flags. The precedence (highest to lowest) is: CLI flags, environment variables, config file, defaults.
Config File Location
evalctl looks for config files in the following order:
./evalctl.yamlor./evalctl.ymlin the current directory~/.config/evalctl/config.yaml(user-level)/etc/evalctl/config.yaml(system-wide)
# ~/.config/evalctl/config.yaml
defaults:
model: gpt-4o
temperature: 0.7
max_tokens: 2048
timeout: 30s
output:
format: table
color: auto
plugins:
enabled:
- custom_scorer
paths:
- ~/.config/evalctl/pluginsYou can override any config value with an environment variable. For example, EVALCTL_DEFAULTS_MODEL=claude-3-opus changes the default model.
Commands Overview
evalctl provides four main commands for running, monitoring, comparing, and exporting evaluations. Every command respects the config file and supports the common flags listed below.
Global Flags
| Flag | Type | Default | Description |
|---|---|---|---|
--config | string | evalctl.yaml | Path to config file |
--log-level | string | info | Log level: debug, info, warn, error |
--json | bool | Output results as JSON | |
--no-color | bool | Disable colored output | |
--timeout | duration | 30s | Request timeout per prompt |
evalctl run
Execute a single evaluation against one or more models. This is the primary command for ad-hoc testing and development.
evalctl run <prompt> [flags]
# Basic usage
evalctl run "Write a haiku about Rust" --model gpt-4o
# With a system prompt
evalctl run "Sum: 1 + 1" --model gpt-4o --system "You are a math tutor. Respond concisely."
# Multiple prompts from a file
evalctl run --prompts prompts.txt --model gpt-4o
# Against multiple models
evalctl run "Explain quantum computing" --model gpt-4o,claude-3-opusFlags
| Flag | Type | Default | Description |
|---|---|---|---|
--model | string | gpt-4o | Model(s) to evaluate, comma-separated |
--system | string | System prompt for the model | |
--temperature | float | 0.7 | Sampling temperature |
--max-tokens | int | 2048 | Maximum tokens in response |
--prompts | string | Path to file with one prompt per line | |
--output | string | Path to write results (default: stdout) | |
--format | string | text | Output format: text, json, yaml, markdown |
API keys are read from the environment. Set OPENAI_API_KEY, ANTHROPIC_API_KEY, or the provider-specific variable before running. See Environment Variables for details.
evalctl watch
Watch a directory or config file for changes and automatically re-run evaluations. Ideal for iterative development prompt engineering.
evalctl watch [path] [flags]
# Watch current directory for changes
evalctl watch
# Watch a specific config file
evalctl watch --config evalctl.yaml
# Watch with notifications
evalctl watch --notify
# Custom poll interval
evalctl watch --interval 2sFlags
| Flag | Type | Default | Description |
|---|---|---|---|
--interval | duration | 1s | File system poll interval |
--notify | bool | false | Send desktop notifications on completion |
--format | string | table | Output format for each run |
--output | string | Accumulate results to file |
The watch command uses file system events where available (inotify on Linux, FSEvents on macOS) and falls back to polling. Pass --interval 500ms for faster feedback during active editing.
evalctl compare
Run the same prompt against multiple models and compare their outputs side by side. Results are scored using configurable metrics.
evalctl compare <prompt> --models <list> [flags]
# Compare two models
evalctl compare "Write a Python Fibonacci function" --models gpt-4o,claude-3-opus
# Compare with custom metrics
evalctl compare "Translate: Good morning" --models gpt-4o,llama-3 \
--metrics exact_match,contains_keywords
# Compare from config
evalctl compare --config evalctl.yaml --models gpt-4o,claude-3-opus,gemini-proFlags
| Flag | Type | Description | |
|---|---|---|---|
--models | string | Comma-separated list of models | |
--metrics | string | exact_match | Comma-separated metric names |
--output | string | Path to write comparison report | |
--format | string | table | Output format: table, json, markdown, html |
Example Output
$ evalctl compare "Explain polymorphism" --models gpt-4o,claude-3-opus --metrics exact_match,contains_keywords
┌──────────────────┬──────────────────┬──────────────────┐
│ Metric │ gpt-4o │ claude-3-opus │
├──────────────────┼──────────────────┼──────────────────┤
│ exact_match │ 0.72 │ 0.78 │
│ contains_keywords│ 0.91 │ 0.89 │
│ latency (ms) │ 1,240 │ 1,870 │
│ tokens │ 342 │ 298 │
└──────────────────┴──────────────────┴──────────────────┘evalctl export
Export evaluation results to various formats for reporting, dashboards, or archival. Works with previous run results stored on disk or in the evalctl history database.
evalctl export <output-file> [flags]
# Export last run to Markdown
evalctl export report.md --format markdown
# Export to JSON
evalctl export results.json --format json
# Filter by date range
evalctl export weekly.md --from 2024-01-01 --to 2024-01-07
# Filter by model
evalctl export gpt4-results.json --filter model=gpt-4oFlags
| Flag | Type | Default | Description |
|---|---|---|---|
--format | string | json | Output format: json, yaml, markdown, html, csv |
--from | string | Start date (ISO 8601 or RFC 3339) | |
--to | string | End date | |
--filter | string | Key=value filter (e.g. model=gpt-4o) | |
--pretty | bool | false | Pretty-print output |
CI Integration
evalctl is designed to be a first-class citizen in your CI pipeline. Use it to automatically evaluate model outputs on every commit, PR, or deployment.
GitHub Actions
# .github/workflows/eval.yml
name: Evaluate Models
on: [pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install evalctl
run: curl -fsSL https://evalctl.dev/install.sh | sh
- name: Run evaluations
run: evalctl run --config evalctl.yaml --json --output results.json
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Check quality gate
run: evalctl gate --threshold 0.8 --results results.jsonQuality Gates
Use evalctl gate to enforce minimum scores. The command exits with code 1 when scores fall below the threshold, failing the CI step:
# Fail if overall score is below 0.75
evalctl gate --threshold 0.75 --results results.json
# Fail if any individual metric is below 0.6
evalctl gate --min-per-metric 0.6 --results results.jsonevalctl gate requires no separate config - it reads the same config file as evalctl run. Set ci.threshold in your evalctl.yaml for project defaults.
Custom Metrics
evalctl ships with a set of built-in metrics, but you can define your own using JavaScript or Lua. Custom metrics live alongside your evalctl.yaml config and are loaded automatically.
Built-in Metrics
| Flag | Type | Description | |
|---|---|---|---|
--exact_match | builtin | Response exactly matches expected output | |
--contains_keywords | builtin | Response contains required keywords | |
--json_valid | builtin | Response is valid JSON | |
--length | builtin | min/max | Response length within bounds |
--regex | builtin | Response matches a regular expression | |
--semantic_similarity | builtin | Cosine similarity with reference (requires embedding model) |
JavaScript Metric
// metrics/sentiment.js
// Custom metric: classify response sentiment
module.exports = {
name: "sentiment",
description: "Classifies response sentiment (positive/negative/neutral)",
async evaluate({ response, expected, context }) {
const positiveWords = ["good", "great", "excellent", "helpful", "correct"];
const negativeWords = ["bad", "wrong", "incorrect", "poor", "unhelpful"];
const lower = response.toLowerCase();
const pos = positiveWords.filter((w) => lower.includes(w)).length;
const neg = negativeWords.filter((w) => lower.includes(w)).length;
if (pos > neg) return { score: 1.0, label: "positive" };
if (neg > pos) return { score: 0.0, label: "negative" };
return { score: 0.5, label: "neutral" };
},
};# Register the metric in evalctl.yaml
metrics:
- name: sentiment
script: metrics/sentiment.js
- name: exact_matchMetrics run in a sandboxed environment with access to response, expected, context, and config. The return value must include a score between 0 and 1.
Plugin System
Extend evalctl with plugins for custom output formats, model providers, data sources, and notification channels. Plugins are standalone executables or WASM modules discovered in configured paths.
How Plugins Work
evalctl communicates with plugins via stdin/stdout using a JSON-RPC protocol. Every plugin implements a capabilities method and one or more action methods:
| Flag | Type | Description |
|---|---|---|
--provider | plugin | Add a new model provider (e.g. groq, together) |
--formatter | plugin | Custom output format |
--notifier | plugin | Send results to Slack, Discord, email, etc. |
--datasource | plugin | Load prompts from external sources (DB, S3, etc.) |
--scorer | plugin | Custom evaluation metrics (alternative to JS/Lua) |
Configuration
# evalctl.yaml
plugins:
paths:
- ~/.config/evalctl/plugins
- ./plugins
enabled:
- provider-groq
- notifier-slack
- formatter-html
# Per-plugin config
config:
notifier-slack:
webhook: "https://hooks.slack.com/services/..."
channel: "#eval-results
provider-groq:
api_key_env: GROQ_API_KEY# Install a plugin
evalctl plugin install provider-groq
# List installed plugins
evalctl plugin list
# Test a plugin
evalctl plugin test notifier-slack "Evaluation complete"Config File Schema
The evalctl.yaml config file supports the following top-level keys:
| Flag | Type | Description |
|---|---|---|
--defaults | object | Default settings for all evaluations |
--models | array | List of model definitions |
--prompts | object | Prompt suite definitions |
--metrics | array | Metric configurations |
--output | object | Output format and destination |
--plugins | object | Plugin system configuration |
--ci | object | CI-specific settings |
Defaults Schema
# Full defaults structure
defaults:
model: string # Default model ID
temperature: float # 0.0 - 2.0
max_tokens: integer # Max tokens per response
top_p: float # Nucleus sampling
frequency_penalty: float # -2.0 - 2.0
presence_penalty: float # -2.0 - 2.0
stop: string | [string] # Stop sequences
timeout: duration # Per-request timeout
retries: integer # Retry count on failure (default: 3)
retry_delay: duration # Delay between retries (default: 1s)Model Schema
# Model definition
models:
- id: string # Unique identifier
provider: string # Provider name (openai, anthropic, google, custom)
model_name: string # Actual model name sent to API
api_key_env: string # Env var for API key (optional)
base_url: string # Custom API endpoint (optional)
temperature: float # Per-model override (optional)
max_tokens: integer # Per-model override (optional)
metadata: {} # Arbitrary metadata (optional)Environment Variables
evalctl reads configuration from environment variables. CLI flags and config file values override these defaults.
| Flag | Type | Description | |
|---|---|---|---|
--OPENAI_API_KEY | string | API key for OpenAI models | |
--ANTHROPIC_API_KEY | string | API key for Anthropic models | |
--GOOGLE_API_KEY | string | API key for Google / Gemini models | |
--EVALCTL_DEFAULTS_MODEL | string | Override the default model | |
--EVALCTL_DEFAULTS_TEMPERATURE | float | Override default temperature | |
--EVALCTL_OUTPUT_FORMAT | string | Override output format | |
--EVALCTL_LOG_LEVEL | string | info | Set log level |
--EVALCTL_CONFIG | string | Path to config file | |
--EVALCTL_PLUGIN_DIR | string | Plugin directory path | |
--EVALCTL_CACHE_DIR | string | ~/.cache/evalctl | Cache directory |
--EVALCTL_HOME | string | ~/.config/evalctl | evalctl home directory |
--HTTP_PROXY | string | HTTP proxy for API requests | |
--HTTPS_PROXY | string | HTTPS proxy for API requests | |
--NO_PROXY | string | Comma-separated no-proxy domains |
API keys set via environment variables are read at startup. If you rotate keys, restart evalctl or re-source your shell profile. Never commit API keys to version control.
Exit Codes
evalctl uses the following exit codes for programmatic use in scripts and CI pipelines:
| Flag | Type | Description |
|---|---|---|
--0 | code | Success - all evaluations completed |
--1 | code | General error - invalid flags, config, or runtime failure |
--2 | code | Quality gate failure - score below threshold |
--3 | code | Model error - API returned an error or timeout |
--4 | code | Config error - invalid YAML or missing required fields |
--5 | code | Plugin error - plugin crashed or returned invalid data |
--6 | code | File error - input/output file not found or unwritable |
# Use in a shell script
evalctl run --config evalctl.yaml --json --output results.json
case $? in
0) echo "All evaluations passed" ;;
2) echo "Quality gate failed" && exit 1 ;;
3) echo "Model API error" && exit 1 ;;
*) echo "Unknown error" && exit 1 ;;
esacExit code 2 is specifically designed for CI gates. Your pipeline can use evalctl gate --threshold 0.8 to convert a low-scoring run into a build failure.
How to Contribute
evalctl is open source and we welcome contributions. Whether you are fixing a bug, adding a feature, or improving documentation, here is how to get started.
Development Setup
# Clone the repository
git clone https://github.com/example/evalctl
cd evalctl
# Build from source
go build ./cmd/evalctl
# Run tests
go test ./...
# Run linting
golangci-lint run ./...Contribution Process
- Search existing issues and pull requests to avoid duplication
- Open a new issue to discuss significant changes before implementing
- Fork the repository and create a feature branch from
main - Write tests for your changes - coverage should not decrease
- Ensure all tests pass and the linter is clean
- Open a pull request with a clear description of the changes
Standards
- Follow Go idioms and best practices (use
gofmt,go vet) - Write descriptive commit messages in the imperative mood
- Keep pull requests focused - one feature or fix per PR
- Update documentation and
CHANGELOG.mdfor user-facing changes - Add test cases for both success and failure paths
Code of Conduct
evalctl follows the Contributor Covenant code of conduct. We are committed to providing a welcoming and inclusive environment for everyone.
Our Pledge
We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
Our Standards
Examples of behavior that contributes to a positive environment:
- Demonstrating empathy and kindness toward other people
- Being respectful of differing opinions, viewpoints, and experiences
- Giving and gracefully accepting constructive feedback
- Accepting responsibility and apologizing to those affected by our mistakes
Examples of unacceptable behavior:
- Trolling, insulting or derogatory comments, and personal or political attacks
- Public or private harassment
- Publishing others’ private information without explicit permission
- Other conduct which could reasonably be considered inappropriate in a professional setting
Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the project team at conduct@evalctl.dev. All complaints will be reviewed and investigated promptly and fairly.
Changelog
v1.4.2 (2025-05-15)
- Fix: Plugin system now handles WASM modules correctly on ARM64
- Fix:
evalctl watchno longer crashes on symlinked directories - Fix: YAML config parsing handles quoted multi-line strings
- Chore: Updated Go dependencies to patch CVEs
v1.4.1 (2025-04-28)
- Fix: JSON output for
evalctl comparenow includes metric names - Fix: Rate limiting no longer blocks concurrent requests incorrectly
- Improvement: Better error messages for expired API keys
v1.4.0 (2025-04-10)
- Feature: New
evalctl pluginsubcommand for managing plugins - Feature: WASM-based plugins for cross-platform compatibility
- Feature:
evalctl export --format csvfor spreadsheet import - Improvement: 40% faster startup time with lazy provider loading
- Docs: Added CI integration guide for GitLab and CircleCI