evalctl

v1.4.2 · MIT License

A fast, extensible CLI for running and evaluating LLM / AI model outputs from your terminal. Define prompt suites, compare models side by side, track metrics over time, and plug evaluations into your CI pipeline.

bash

$ curl -fsSL https://evalctl.dev/install.sh | sh
# Installing evalctl v1.4.2...
# Detected linux/amd64
# Installed to /usr/local/bin/evalctl
$ evalctl run "What is the capital of France?" --model gpt-4o
→ Paris

Installation

evalctl is distributed as a single binary with zero runtime dependencies. Choose the method that works best for your environment.

macOS

# Homebrew (recommended)
brew install evalctl/evalctl/evalctl

# Or download directly
curl -fsSL https://evalctl.dev/install.sh | sh

Linux

# x86_64 / amd64
curl -fsSL https://evalctl.dev/install.sh | sh

# Via apt (Debian/Ubuntu)
echo "deb [trusted=yes] https://apt.evalctl.dev /" | sudo tee /etc/apt/sources.list.d/evalctl.list
sudo apt update && sudo apt install evalctl

npm

npm install -g evalctl

Go

go install github.com/example/evalctl@latest

Verify the Installation

evalctl version
# evalctl v1.4.2 (linux/amd64)

evalctl --help
# A CLI for running and evaluating LLM model outputs
#
# Usage:
#   evalctl [command] [flags]
#
# Available Commands:
#   run         Run a single evaluation
#   watch       Watch files and re-run on changes
#   compare     Compare outputs across multiple models
#   export      Export evaluation results

Quick Start

Get up and running in under a minute. evalctl evaluates prompts against LLM models and scores the outputs.

Your First Evaluation

evalctl run "What is the capital of France?" --model gpt-4o

This sends the prompt to GPT-4o and prints the response. Add --json for structured output:

evalctl run "Solve: 2x + 5 = 15" --model gpt-4o --json

Using a Config File

Create an evalctl.yaml file to define prompt suites and evaluation criteria:

# evalctl.yaml
models:
  - id: gpt-4o
    provider: openai
    temperature: 0.7
  - id: claude-3-opus
    provider: anthropic
    temperature: 0.5

prompts:
  - id: translation
    template: "Translate this to French: {{.input}}"
    cases:
      - input: "Hello, how are you?"
      - input: "The weather is nice today."

metrics:
  - name: exact_match
  - name: contains_keywords
    params:
      keywords: ["bonjour", "comment"]

evalctl run --config evalctl.yaml

Viewing Results

Results are printed to stdout by default. Pass --output results.json to save them, or use the export command to generate reports.

Configuration

evalctl can be configured via a YAML config file, environment variables, or CLI flags. The precedence (highest to lowest) is: CLI flags, environment variables, config file, defaults.

Config File Location

evalctl looks for config files in the following order:

./evalctl.yaml or ./evalctl.yml in the current directory
~/.config/evalctl/config.yaml (user-level)
/etc/evalctl/config.yaml (system-wide)

# ~/.config/evalctl/config.yaml
defaults:
  model: gpt-4o
  temperature: 0.7
  max_tokens: 2048
  timeout: 30s

output:
  format: table
  color: auto

plugins:
  enabled:
    - custom_scorer
  paths:
    - ~/.config/evalctl/plugins

You can override any config value with an environment variable. For example, EVALCTL_DEFAULTS_MODEL=claude-3-opus changes the default model.

Commands Overview

evalctl provides four main commands for running, monitoring, comparing, and exporting evaluations. Every command respects the config file and supports the common flags listed below.

Global Flags

Flag	Type	Default	Description
`--config`	string	evalctl.yaml	Path to config file
`--log-level`	string	info	Log level: debug, info, warn, error
`--json`	bool	Output results as JSON
`--no-color`	bool	Disable colored output
`--timeout`	duration	30s	Request timeout per prompt

evalctl run

Execute a single evaluation against one or more models. This is the primary command for ad-hoc testing and development.

evalctl run <prompt> [flags]

# Basic usage
evalctl run "Write a haiku about Rust" --model gpt-4o

# With a system prompt
evalctl run "Sum: 1 + 1" --model gpt-4o --system "You are a math tutor. Respond concisely."

# Multiple prompts from a file
evalctl run --prompts prompts.txt --model gpt-4o

# Against multiple models
evalctl run "Explain quantum computing" --model gpt-4o,claude-3-opus

Flags

Flag	Type	Default	Description
`--model`	string	gpt-4o	Model(s) to evaluate, comma-separated
`--system`	string	System prompt for the model
`--temperature`	float	0.7	Sampling temperature
`--max-tokens`	int	2048	Maximum tokens in response
`--prompts`	string	Path to file with one prompt per line
`--output`	string	Path to write results (default: stdout)
`--format`	string	text	Output format: text, json, yaml, markdown

API keys are read from the environment. Set OPENAI_API_KEY, ANTHROPIC_API_KEY, or the provider-specific variable before running. See Environment Variables for details.

evalctl watch

Watch a directory or config file for changes and automatically re-run evaluations. Ideal for iterative development prompt engineering.

evalctl watch [path] [flags]

# Watch current directory for changes
evalctl watch

# Watch a specific config file
evalctl watch --config evalctl.yaml

# Watch with notifications
evalctl watch --notify

# Custom poll interval
evalctl watch --interval 2s

Flags

Flag	Type	Default	Description
`--interval`	duration	1s	File system poll interval
`--notify`	bool	false	Send desktop notifications on completion
`--format`	string	table	Output format for each run
`--output`	string	Accumulate results to file

The watch command uses file system events where available (inotify on Linux, FSEvents on macOS) and falls back to polling. Pass --interval 500ms for faster feedback during active editing.

evalctl compare

Run the same prompt against multiple models and compare their outputs side by side. Results are scored using configurable metrics.

evalctl compare <prompt> --models <list> [flags]

# Compare two models
evalctl compare "Write a Python Fibonacci function" --models gpt-4o,claude-3-opus

# Compare with custom metrics
evalctl compare "Translate: Good morning" --models gpt-4o,llama-3 \
  --metrics exact_match,contains_keywords

# Compare from config
evalctl compare --config evalctl.yaml --models gpt-4o,claude-3-opus,gemini-pro

Flags

Flag	Type	Description
`--models`	string	Comma-separated list of models
`--metrics`	string	exact_match	Comma-separated metric names
`--output`	string	Path to write comparison report
`--format`	string	table	Output format: table, json, markdown, html

Example Output

$ evalctl compare "Explain polymorphism" --models gpt-4o,claude-3-opus --metrics exact_match,contains_keywords

┌──────────────────┬──────────────────┬──────────────────┐
│ Metric           │ gpt-4o           │ claude-3-opus    │
├──────────────────┼──────────────────┼──────────────────┤
│ exact_match      │ 0.72             │ 0.78             │
│ contains_keywords│ 0.91             │ 0.89             │
│ latency (ms)     │ 1,240            │ 1,870            │
│ tokens           │ 342              │ 298              │
└──────────────────┴──────────────────┴──────────────────┘

evalctl export

Export evaluation results to various formats for reporting, dashboards, or archival. Works with previous run results stored on disk or in the evalctl history database.

evalctl export <output-file> [flags]

# Export last run to Markdown
evalctl export report.md --format markdown

# Export to JSON
evalctl export results.json --format json

# Filter by date range
evalctl export weekly.md --from 2024-01-01 --to 2024-01-07

# Filter by model
evalctl export gpt4-results.json --filter model=gpt-4o

Flags

Flag	Type	Default	Description
`--format`	string	json	Output format: json, yaml, markdown, html, csv
`--from`	string	Start date (ISO 8601 or RFC 3339)
`--to`	string	End date
`--filter`	string	Key=value filter (e.g. model=gpt-4o)
`--pretty`	bool	false	Pretty-print output

CI Integration

evalctl is designed to be a first-class citizen in your CI pipeline. Use it to automatically evaluate model outputs on every commit, PR, or deployment.

GitHub Actions

# .github/workflows/eval.yml
name: Evaluate Models
on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install evalctl
        run: curl -fsSL https://evalctl.dev/install.sh | sh
      - name: Run evaluations
        run: evalctl run --config evalctl.yaml --json --output results.json
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - name: Check quality gate
        run: evalctl gate --threshold 0.8 --results results.json

Quality Gates

Use evalctl gate to enforce minimum scores. The command exits with code 1 when scores fall below the threshold, failing the CI step:

# Fail if overall score is below 0.75
evalctl gate --threshold 0.75 --results results.json

# Fail if any individual metric is below 0.6
evalctl gate --min-per-metric 0.6 --results results.json

evalctl gate requires no separate config - it reads the same config file as evalctl run. Set ci.threshold in your evalctl.yaml for project defaults.

Custom Metrics

evalctl ships with a set of built-in metrics, but you can define your own using JavaScript or Lua. Custom metrics live alongside your evalctl.yaml config and are loaded automatically.

Built-in Metrics

Flag	Type	Description
`--exact_match`	builtin	Response exactly matches expected output
`--contains_keywords`	builtin	Response contains required keywords
`--json_valid`	builtin	Response is valid JSON
`--length`	builtin	min/max	Response length within bounds
`--regex`	builtin	Response matches a regular expression
`--semantic_similarity`	builtin	Cosine similarity with reference (requires embedding model)

JavaScript Metric

// metrics/sentiment.js
// Custom metric: classify response sentiment

module.exports = {
  name: "sentiment",
  description: "Classifies response sentiment (positive/negative/neutral)",
  async evaluate({ response, expected, context }) {
    const positiveWords = ["good", "great", "excellent", "helpful", "correct"];
    const negativeWords = ["bad", "wrong", "incorrect", "poor", "unhelpful"];

    const lower = response.toLowerCase();
    const pos = positiveWords.filter((w) => lower.includes(w)).length;
    const neg = negativeWords.filter((w) => lower.includes(w)).length;

    if (pos > neg) return { score: 1.0, label: "positive" };
    if (neg > pos) return { score: 0.0, label: "negative" };
    return { score: 0.5, label: "neutral" };
  },
};

# Register the metric in evalctl.yaml
metrics:
  - name: sentiment
    script: metrics/sentiment.js

  - name: exact_match

Metrics run in a sandboxed environment with access to response, expected, context, and config. The return value must include a score between 0 and 1.

Plugin System

Extend evalctl with plugins for custom output formats, model providers, data sources, and notification channels. Plugins are standalone executables or WASM modules discovered in configured paths.

How Plugins Work

evalctl communicates with plugins via stdin/stdout using a JSON-RPC protocol. Every plugin implements a capabilities method and one or more action methods:

Flag	Type	Description
`--provider`	plugin	Add a new model provider (e.g. groq, together)
`--formatter`	plugin	Custom output format
`--notifier`	plugin	Send results to Slack, Discord, email, etc.
`--datasource`	plugin	Load prompts from external sources (DB, S3, etc.)
`--scorer`	plugin	Custom evaluation metrics (alternative to JS/Lua)

Configuration

# evalctl.yaml
plugins:
  paths:
    - ~/.config/evalctl/plugins
    - ./plugins
  enabled:
    - provider-groq
    - notifier-slack
    - formatter-html

  # Per-plugin config
  config:
    notifier-slack:
      webhook: "https://hooks.slack.com/services/..."
      channel: "#eval-results
    provider-groq:
      api_key_env: GROQ_API_KEY

# Install a plugin
evalctl plugin install provider-groq

# List installed plugins
evalctl plugin list

# Test a plugin
evalctl plugin test notifier-slack "Evaluation complete"

Config File Schema

The evalctl.yaml config file supports the following top-level keys:

Flag	Type	Description
`--defaults`	object	Default settings for all evaluations
`--models`	array	List of model definitions
`--prompts`	object	Prompt suite definitions
`--metrics`	array	Metric configurations
`--output`	object	Output format and destination
`--plugins`	object	Plugin system configuration
`--ci`	object	CI-specific settings

Defaults Schema

# Full defaults structure
defaults:
  model: string              # Default model ID
  temperature: float         # 0.0 - 2.0
  max_tokens: integer        # Max tokens per response
  top_p: float               # Nucleus sampling
  frequency_penalty: float   # -2.0 - 2.0
  presence_penalty: float    # -2.0 - 2.0
  stop: string | [string]    # Stop sequences
  timeout: duration          # Per-request timeout
  retries: integer           # Retry count on failure (default: 3)
  retry_delay: duration      # Delay between retries (default: 1s)

Model Schema

# Model definition
models:
  - id: string               # Unique identifier
    provider: string         # Provider name (openai, anthropic, google, custom)
    model_name: string       # Actual model name sent to API
    api_key_env: string      # Env var for API key (optional)
    base_url: string         # Custom API endpoint (optional)
    temperature: float       # Per-model override (optional)
    max_tokens: integer      # Per-model override (optional)
    metadata: {}             # Arbitrary metadata (optional)

Environment Variables

evalctl reads configuration from environment variables. CLI flags and config file values override these defaults.

Flag	Type	Description
`--OPENAI_API_KEY`	string	API key for OpenAI models
`--ANTHROPIC_API_KEY`	string	API key for Anthropic models
`--GOOGLE_API_KEY`	string	API key for Google / Gemini models
`--EVALCTL_DEFAULTS_MODEL`	string	Override the default model
`--EVALCTL_DEFAULTS_TEMPERATURE`	float	Override default temperature
`--EVALCTL_OUTPUT_FORMAT`	string	Override output format
`--EVALCTL_LOG_LEVEL`	string	info	Set log level
`--EVALCTL_CONFIG`	string	Path to config file
`--EVALCTL_PLUGIN_DIR`	string	Plugin directory path
`--EVALCTL_CACHE_DIR`	string	~/.cache/evalctl	Cache directory
`--EVALCTL_HOME`	string	~/.config/evalctl	evalctl home directory
`--HTTP_PROXY`	string	HTTP proxy for API requests
`--HTTPS_PROXY`	string	HTTPS proxy for API requests
`--NO_PROXY`	string	Comma-separated no-proxy domains

API keys set via environment variables are read at startup. If you rotate keys, restart evalctl or re-source your shell profile. Never commit API keys to version control.

Exit Codes

evalctl uses the following exit codes for programmatic use in scripts and CI pipelines:

Flag	Type	Description
`--0`	code	Success - all evaluations completed
`--1`	code	General error - invalid flags, config, or runtime failure
`--2`	code	Quality gate failure - score below threshold
`--3`	code	Model error - API returned an error or timeout
`--4`	code	Config error - invalid YAML or missing required fields
`--5`	code	Plugin error - plugin crashed or returned invalid data
`--6`	code	File error - input/output file not found or unwritable

# Use in a shell script
evalctl run --config evalctl.yaml --json --output results.json
case $? in
  0)  echo "All evaluations passed" ;;
  2)  echo "Quality gate failed" && exit 1 ;;
  3)  echo "Model API error" && exit 1 ;;
  *)  echo "Unknown error" && exit 1 ;;
esac

Exit code 2 is specifically designed for CI gates. Your pipeline can use evalctl gate --threshold 0.8 to convert a low-scoring run into a build failure.

How to Contribute

evalctl is open source and we welcome contributions. Whether you are fixing a bug, adding a feature, or improving documentation, here is how to get started.

Development Setup

# Clone the repository
git clone https://github.com/example/evalctl
cd evalctl

# Build from source
go build ./cmd/evalctl

# Run tests
go test ./...

# Run linting
golangci-lint run ./...

Contribution Process

Search existing issues and pull requests to avoid duplication
Open a new issue to discuss significant changes before implementing
Fork the repository and create a feature branch from main
Write tests for your changes - coverage should not decrease
Ensure all tests pass and the linter is clean
Open a pull request with a clear description of the changes

Standards

Follow Go idioms and best practices (use gofmt, go vet)
Write descriptive commit messages in the imperative mood
Keep pull requests focused - one feature or fix per PR
Update documentation and CHANGELOG.md for user-facing changes
Add test cases for both success and failure paths

Code of Conduct

evalctl follows the Contributor Covenant code of conduct. We are committed to providing a welcoming and inclusive environment for everyone.

Our Pledge

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

Our Standards

Examples of behavior that contributes to a positive environment:

Demonstrating empathy and kindness toward other people
Being respectful of differing opinions, viewpoints, and experiences
Giving and gracefully accepting constructive feedback
Accepting responsibility and apologizing to those affected by our mistakes

Examples of unacceptable behavior:

Trolling, insulting or derogatory comments, and personal or political attacks
Public or private harassment
Publishing others’ private information without explicit permission
Other conduct which could reasonably be considered inappropriate in a professional setting

Fix: Plugin system now handles WASM modules correctly on ARM64
Fix: evalctl watch no longer crashes on symlinked directories
Fix: YAML config parsing handles quoted multi-line strings
Chore: Updated Go dependencies to patch CVEs

v1.4.1 (2025-04-28)

Fix: JSON output for evalctl compare now includes metric names
Fix: Rate limiting no longer blocks concurrent requests incorrectly
Improvement: Better error messages for expired API keys

v1.4.0 (2025-04-10)

Feature: New evalctl plugin subcommand for managing plugins
Feature: WASM-based plugins for cross-platform compatibility
Feature: evalctl export --format csv for spreadsheet import
Improvement: 40% faster startup time with lazy provider loading
Docs: Added CI integration guide for GitLab and CircleCI