Run Setup

Standard benchmark run with provider, model, prompt, execution, and evaluation controls.

Run & Validate uses the normal benchmark flow, but keeps validator controls in the main path.

Resume is triggered explicitly with --resume. The command recovers the previous run configuration from the output's prompt log and metrics artifacts when available; use --unclassified only when you want to re-prompt rows already labeled unclassified.

Metrics only recomputes artifacts from existing output CSV files and does not emit --model.

Model & Provider
Run python benchmark_agent.py --update-models first, then refresh here.
Prompt Strategy
Validator Setup

Validator flags are emitted only when a validator path is set.

Retry message text comes from the validator script itself. The GUI does not expose a separate retry-message field because that text is returned by the validator, not passed as a benchmark flag.

Execution
Provider controls

Gemini Context Cache Gemini

Evaluation & Metadata
Logging
Inspect Prompt & Cache
Shared prefix estimate: waiting for input...

Shared-prefix estimate ignores row-specific CSV fields; sampling is optional.

Sample a CSV file to preview the first data row inside the prompt payload.

System

[system message preview]

User

[user message preview]
CLI Flag Reference

All CLI flags this GUI can currently emit. Flags not listed here are still available only through the terminal.