evalctl

v1.4.2 · MIT License

A fast, extensible CLI for running and evaluating LLM / AI model outputs from your terminal. Define prompt suites, compare models side by side, track metrics over time, and plug evaluations into your CI pipeline.

bash
$ curl -fsSL https://evalctl.dev/install.sh | sh
# Installing evalctl v1.4.2...
# Detected linux/amd64
# Installed to /usr/local/bin/evalctl
$ evalctl run "What is the capital of France?" --model gpt-4o
Paris

Installation

evalctl is distributed as a single binary with zero runtime dependencies. Choose the method that works best for your environment.

macOS

# Homebrew (recommended)
brew install evalctl/evalctl/evalctl

# Or download directly
curl -fsSL https://evalctl.dev/install.sh | sh

Linux

# x86_64 / amd64
curl -fsSL https://evalctl.dev/install.sh | sh

# Via apt (Debian/Ubuntu)
echo "deb [trusted=yes] https://apt.evalctl.dev /" | sudo tee /etc/apt/sources.list.d/evalctl.list
sudo apt update && sudo apt install evalctl

npm

npm install -g evalctl

Go

go install github.com/example/evalctl@latest

Verify the Installation

evalctl version
# evalctl v1.4.2 (linux/amd64)

evalctl --help
# A CLI for running and evaluating LLM model outputs
#
# Usage:
#   evalctl [command] [flags]
#
# Available Commands:
#   run         Run a single evaluation
#   watch       Watch files and re-run on changes
#   compare     Compare outputs across multiple models
#   export      Export evaluation results

Quick Start

Get up and running in under a minute. evalctl evaluates prompts against LLM models and scores the outputs.

Your First Evaluation

evalctl run "What is the capital of France?" --model gpt-4o

This sends the prompt to GPT-4o and prints the response. Add --json for structured output:

evalctl run "Solve: 2x + 5 = 15" --model gpt-4o --json

Using a Config File

Create an evalctl.yaml file to define prompt suites and evaluation criteria:

# evalctl.yaml
models:
  - id: gpt-4o
    provider: openai
    temperature: 0.7
  - id: claude-3-opus
    provider: anthropic
    temperature: 0.5

prompts:
  - id: translation
    template: "Translate this to French: {{.input}}"
    cases:
      - input: "Hello, how are you?"
      - input: "The weather is nice today."

metrics:
  - name: exact_match
  - name: contains_keywords
    params:
      keywords: ["bonjour", "comment"]
evalctl run --config evalctl.yaml

Viewing Results

Results are printed to stdout by default. Pass --output results.json to save them, or use the export command to generate reports.

Configuration

evalctl can be configured via a YAML config file, environment variables, or CLI flags. The precedence (highest to lowest) is: CLI flags, environment variables, config file, defaults.

Config File Location

evalctl looks for config files in the following order:

  1. ./evalctl.yaml or ./evalctl.yml in the current directory
  2. ~/.config/evalctl/config.yaml (user-level)
  3. /etc/evalctl/config.yaml (system-wide)
# ~/.config/evalctl/config.yaml
defaults:
  model: gpt-4o
  temperature: 0.7
  max_tokens: 2048
  timeout: 30s

output:
  format: table
  color: auto

plugins:
  enabled:
    - custom_scorer
  paths:
    - ~/.config/evalctl/plugins

You can override any config value with an environment variable. For example, EVALCTL_DEFAULTS_MODEL=claude-3-opus changes the default model.

Commands Overview

evalctl provides four main commands for running, monitoring, comparing, and exporting evaluations. Every command respects the config file and supports the common flags listed below.

Global Flags

FlagTypeDefaultDescription
--configstringevalctl.yamlPath to config file
--log-levelstringinfoLog level: debug, info, warn, error
--jsonboolOutput results as JSON
--no-colorboolDisable colored output
--timeoutduration30sRequest timeout per prompt

evalctl run

Execute a single evaluation against one or more models. This is the primary command for ad-hoc testing and development.

evalctl run <prompt> [flags]

# Basic usage
evalctl run "Write a haiku about Rust" --model gpt-4o

# With a system prompt
evalctl run "Sum: 1 + 1" --model gpt-4o --system "You are a math tutor. Respond concisely."

# Multiple prompts from a file
evalctl run --prompts prompts.txt --model gpt-4o

# Against multiple models
evalctl run "Explain quantum computing" --model gpt-4o,claude-3-opus

Flags

FlagTypeDefaultDescription
--modelstringgpt-4oModel(s) to evaluate, comma-separated
--systemstringSystem prompt for the model
--temperaturefloat0.7Sampling temperature
--max-tokensint2048Maximum tokens in response
--promptsstringPath to file with one prompt per line
--outputstringPath to write results (default: stdout)
--formatstringtextOutput format: text, json, yaml, markdown

API keys are read from the environment. Set OPENAI_API_KEY, ANTHROPIC_API_KEY, or the provider-specific variable before running. See Environment Variables for details.

evalctl watch

Watch a directory or config file for changes and automatically re-run evaluations. Ideal for iterative development prompt engineering.

evalctl watch [path] [flags]

# Watch current directory for changes
evalctl watch

# Watch a specific config file
evalctl watch --config evalctl.yaml

# Watch with notifications
evalctl watch --notify

# Custom poll interval
evalctl watch --interval 2s

Flags

FlagTypeDefaultDescription
--intervalduration1sFile system poll interval
--notifyboolfalseSend desktop notifications on completion
--formatstringtableOutput format for each run
--outputstringAccumulate results to file

The watch command uses file system events where available (inotify on Linux, FSEvents on macOS) and falls back to polling. Pass --interval 500ms for faster feedback during active editing.

evalctl compare

Run the same prompt against multiple models and compare their outputs side by side. Results are scored using configurable metrics.

evalctl compare <prompt> --models <list> [flags]

# Compare two models
evalctl compare "Write a Python Fibonacci function" --models gpt-4o,claude-3-opus

# Compare with custom metrics
evalctl compare "Translate: Good morning" --models gpt-4o,llama-3 \
  --metrics exact_match,contains_keywords

# Compare from config
evalctl compare --config evalctl.yaml --models gpt-4o,claude-3-opus,gemini-pro

Flags

FlagTypeDescription
--modelsstringComma-separated list of models
--metricsstringexact_matchComma-separated metric names
--outputstringPath to write comparison report
--formatstringtableOutput format: table, json, markdown, html

Example Output

$ evalctl compare "Explain polymorphism" --models gpt-4o,claude-3-opus --metrics exact_match,contains_keywords

┌──────────────────┬──────────────────┬──────────────────┐
│ Metric           │ gpt-4o           │ claude-3-opus    │
├──────────────────┼──────────────────┼──────────────────┤
│ exact_match      │ 0.72             │ 0.78             │
│ contains_keywords│ 0.91             │ 0.89             │
│ latency (ms)     │ 1,240            │ 1,870            │
│ tokens           │ 342              │ 298              │
└──────────────────┴──────────────────┴──────────────────┘

evalctl export

Export evaluation results to various formats for reporting, dashboards, or archival. Works with previous run results stored on disk or in the evalctl history database.

evalctl export <output-file> [flags]

# Export last run to Markdown
evalctl export report.md --format markdown

# Export to JSON
evalctl export results.json --format json

# Filter by date range
evalctl export weekly.md --from 2024-01-01 --to 2024-01-07

# Filter by model
evalctl export gpt4-results.json --filter model=gpt-4o

Flags

FlagTypeDefaultDescription
--formatstringjsonOutput format: json, yaml, markdown, html, csv
--fromstringStart date (ISO 8601 or RFC 3339)
--tostringEnd date
--filterstringKey=value filter (e.g. model=gpt-4o)
--prettyboolfalsePretty-print output

CI Integration

evalctl is designed to be a first-class citizen in your CI pipeline. Use it to automatically evaluate model outputs on every commit, PR, or deployment.

GitHub Actions

# .github/workflows/eval.yml
name: Evaluate Models
on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install evalctl
        run: curl -fsSL https://evalctl.dev/install.sh | sh
      - name: Run evaluations
        run: evalctl run --config evalctl.yaml --json --output results.json
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - name: Check quality gate
        run: evalctl gate --threshold 0.8 --results results.json

Quality Gates

Use evalctl gate to enforce minimum scores. The command exits with code 1 when scores fall below the threshold, failing the CI step:

# Fail if overall score is below 0.75
evalctl gate --threshold 0.75 --results results.json

# Fail if any individual metric is below 0.6
evalctl gate --min-per-metric 0.6 --results results.json

evalctl gate requires no separate config - it reads the same config file as evalctl run. Set ci.threshold in your evalctl.yaml for project defaults.

Custom Metrics

evalctl ships with a set of built-in metrics, but you can define your own using JavaScript or Lua. Custom metrics live alongside your evalctl.yaml config and are loaded automatically.

Built-in Metrics

FlagTypeDescription
--exact_matchbuiltinResponse exactly matches expected output
--contains_keywordsbuiltinResponse contains required keywords
--json_validbuiltinResponse is valid JSON
--lengthbuiltinmin/maxResponse length within bounds
--regexbuiltinResponse matches a regular expression
--semantic_similaritybuiltinCosine similarity with reference (requires embedding model)

JavaScript Metric

// metrics/sentiment.js
// Custom metric: classify response sentiment

module.exports = {
  name: "sentiment",
  description: "Classifies response sentiment (positive/negative/neutral)",
  async evaluate({ response, expected, context }) {
    const positiveWords = ["good", "great", "excellent", "helpful", "correct"];
    const negativeWords = ["bad", "wrong", "incorrect", "poor", "unhelpful"];

    const lower = response.toLowerCase();
    const pos = positiveWords.filter((w) => lower.includes(w)).length;
    const neg = negativeWords.filter((w) => lower.includes(w)).length;

    if (pos > neg) return { score: 1.0, label: "positive" };
    if (neg > pos) return { score: 0.0, label: "negative" };
    return { score: 0.5, label: "neutral" };
  },
};
# Register the metric in evalctl.yaml
metrics:
  - name: sentiment
    script: metrics/sentiment.js

  - name: exact_match

Metrics run in a sandboxed environment with access to response, expected, context, and config. The return value must include a score between 0 and 1.

Plugin System

Extend evalctl with plugins for custom output formats, model providers, data sources, and notification channels. Plugins are standalone executables or WASM modules discovered in configured paths.

How Plugins Work

evalctl communicates with plugins via stdin/stdout using a JSON-RPC protocol. Every plugin implements a capabilities method and one or more action methods:

FlagTypeDescription
--providerpluginAdd a new model provider (e.g. groq, together)
--formatterpluginCustom output format
--notifierpluginSend results to Slack, Discord, email, etc.
--datasourcepluginLoad prompts from external sources (DB, S3, etc.)
--scorerpluginCustom evaluation metrics (alternative to JS/Lua)

Configuration

# evalctl.yaml
plugins:
  paths:
    - ~/.config/evalctl/plugins
    - ./plugins
  enabled:
    - provider-groq
    - notifier-slack
    - formatter-html

  # Per-plugin config
  config:
    notifier-slack:
      webhook: "https://hooks.slack.com/services/..."
      channel: "#eval-results
    provider-groq:
      api_key_env: GROQ_API_KEY
# Install a plugin
evalctl plugin install provider-groq

# List installed plugins
evalctl plugin list

# Test a plugin
evalctl plugin test notifier-slack "Evaluation complete"

Config File Schema

The evalctl.yaml config file supports the following top-level keys:

FlagTypeDescription
--defaultsobjectDefault settings for all evaluations
--modelsarrayList of model definitions
--promptsobjectPrompt suite definitions
--metricsarrayMetric configurations
--outputobjectOutput format and destination
--pluginsobjectPlugin system configuration
--ciobjectCI-specific settings

Defaults Schema

# Full defaults structure
defaults:
  model: string              # Default model ID
  temperature: float         # 0.0 - 2.0
  max_tokens: integer        # Max tokens per response
  top_p: float               # Nucleus sampling
  frequency_penalty: float   # -2.0 - 2.0
  presence_penalty: float    # -2.0 - 2.0
  stop: string | [string]    # Stop sequences
  timeout: duration          # Per-request timeout
  retries: integer           # Retry count on failure (default: 3)
  retry_delay: duration      # Delay between retries (default: 1s)

Model Schema

# Model definition
models:
  - id: string               # Unique identifier
    provider: string         # Provider name (openai, anthropic, google, custom)
    model_name: string       # Actual model name sent to API
    api_key_env: string      # Env var for API key (optional)
    base_url: string         # Custom API endpoint (optional)
    temperature: float       # Per-model override (optional)
    max_tokens: integer      # Per-model override (optional)
    metadata: {}             # Arbitrary metadata (optional)

Environment Variables

evalctl reads configuration from environment variables. CLI flags and config file values override these defaults.

FlagTypeDescription
--OPENAI_API_KEYstringAPI key for OpenAI models
--ANTHROPIC_API_KEYstringAPI key for Anthropic models
--GOOGLE_API_KEYstringAPI key for Google / Gemini models
--EVALCTL_DEFAULTS_MODELstringOverride the default model
--EVALCTL_DEFAULTS_TEMPERATUREfloatOverride default temperature
--EVALCTL_OUTPUT_FORMATstringOverride output format
--EVALCTL_LOG_LEVELstringinfoSet log level
--EVALCTL_CONFIGstringPath to config file
--EVALCTL_PLUGIN_DIRstringPlugin directory path
--EVALCTL_CACHE_DIRstring~/.cache/evalctlCache directory
--EVALCTL_HOMEstring~/.config/evalctlevalctl home directory
--HTTP_PROXYstringHTTP proxy for API requests
--HTTPS_PROXYstringHTTPS proxy for API requests
--NO_PROXYstringComma-separated no-proxy domains

API keys set via environment variables are read at startup. If you rotate keys, restart evalctl or re-source your shell profile. Never commit API keys to version control.

Exit Codes

evalctl uses the following exit codes for programmatic use in scripts and CI pipelines:

FlagTypeDescription
--0codeSuccess - all evaluations completed
--1codeGeneral error - invalid flags, config, or runtime failure
--2codeQuality gate failure - score below threshold
--3codeModel error - API returned an error or timeout
--4codeConfig error - invalid YAML or missing required fields
--5codePlugin error - plugin crashed or returned invalid data
--6codeFile error - input/output file not found or unwritable
# Use in a shell script
evalctl run --config evalctl.yaml --json --output results.json
case $? in
  0)  echo "All evaluations passed" ;;
  2)  echo "Quality gate failed" && exit 1 ;;
  3)  echo "Model API error" && exit 1 ;;
  *)  echo "Unknown error" && exit 1 ;;
esac

Exit code 2 is specifically designed for CI gates. Your pipeline can use evalctl gate --threshold 0.8 to convert a low-scoring run into a build failure.

How to Contribute

evalctl is open source and we welcome contributions. Whether you are fixing a bug, adding a feature, or improving documentation, here is how to get started.

Development Setup

# Clone the repository
git clone https://github.com/example/evalctl
cd evalctl

# Build from source
go build ./cmd/evalctl

# Run tests
go test ./...

# Run linting
golangci-lint run ./...

Contribution Process

  1. Search existing issues and pull requests to avoid duplication
  2. Open a new issue to discuss significant changes before implementing
  3. Fork the repository and create a feature branch from main
  4. Write tests for your changes - coverage should not decrease
  5. Ensure all tests pass and the linter is clean
  6. Open a pull request with a clear description of the changes

Standards

  • Follow Go idioms and best practices (use gofmt, go vet)
  • Write descriptive commit messages in the imperative mood
  • Keep pull requests focused - one feature or fix per PR
  • Update documentation and CHANGELOG.md for user-facing changes
  • Add test cases for both success and failure paths

Code of Conduct

evalctl follows the Contributor Covenant code of conduct. We are committed to providing a welcoming and inclusive environment for everyone.

Our Pledge

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

Our Standards

Examples of behavior that contributes to a positive environment:

  • Demonstrating empathy and kindness toward other people
  • Being respectful of differing opinions, viewpoints, and experiences
  • Giving and gracefully accepting constructive feedback
  • Accepting responsibility and apologizing to those affected by our mistakes

Examples of unacceptable behavior:

  • Trolling, insulting or derogatory comments, and personal or political attacks
  • Public or private harassment
  • Publishing others’ private information without explicit permission
  • Other conduct which could reasonably be considered inappropriate in a professional setting

Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the project team at conduct@evalctl.dev. All complaints will be reviewed and investigated promptly and fairly.

Changelog

v1.4.2 (2025-05-15)

  • Fix: Plugin system now handles WASM modules correctly on ARM64
  • Fix: evalctl watch no longer crashes on symlinked directories
  • Fix: YAML config parsing handles quoted multi-line strings
  • Chore: Updated Go dependencies to patch CVEs

v1.4.1 (2025-04-28)

  • Fix: JSON output for evalctl compare now includes metric names
  • Fix: Rate limiting no longer blocks concurrent requests incorrectly
  • Improvement: Better error messages for expired API keys

v1.4.0 (2025-04-10)

  • Feature: New evalctl plugin subcommand for managing plugins
  • Feature: WASM-based plugins for cross-platform compatibility
  • Feature: evalctl export --format csv for spreadsheet import
  • Improvement: 40% faster startup time with lazy provider loading
  • Docs: Added CI integration guide for GitLab and CircleCI