LLM Benchmark Agent Configuration

Run Setup

Standard benchmark run with provider, model, prompt, execution, and evaluation controls.

Run & Validate uses the normal benchmark flow, but keeps validator controls in the main path.

Resume is triggered explicitly with --resume. The command recovers the previous run configuration from the output's prompt log and metrics artifacts when available; use --unclassified only when you want to re-prompt rows already labeled unclassified.

Metrics only recomputes artifacts from existing output CSV files and does not emit --model.

Input CSV Path(s) Existing Output CSV Path(s) Required Provide one path per line.

Output CSV Path Optional Leave blank for the default generated output path.

Resume mode: re-prompt only unclassified rows

Auto-repeat remaining unclassified rows until resolved or stable

Model & Provider

Provider Required Model Name Required

Run python benchmark_agent.py --update-models first, then refresh here.

Prompt Strategy System Prompt Multi-line prompts will emit --system_prompt_b64.

Prompt Layout Few-shot Examples

Include explanations in model outputs Encourage chain-of-thought reasoning

Validator Setup

Validator flags are emitted only when a validator path is set.

Retry message text comes from the validator script itself. The GUI does not expose a separate retry-message field because that text is returned by the validator, not passed as a benchmark flag.

Validator path Validator lexicon Validator max distance Distance increment from 3rd try Validator max suggestions Validator timeout (s) Max prompt candidates Max retry prompt chars Exhausted policy

Debug validator traffic

Execution

Temperature Top-p Top-k Request Interval (ms) Threads Request Timeout (s) Max retries Retry delay (s)

Provider controls

OpenAI Service Tier OpenAI Reasoning Effort OpenAIGemini Verbosity OpenAIGPT Gemini Thinking Level Gemini Claude Effort Claude Shared Prefix Token Target Prompt Cache Key

Fail if request controls are rejected Enable Requesty auto cache Requesty

Gemini Context Cache Gemini

Auto-create Gemini cache from system prompt Gemini

Gemini Cached Content Gemini

Evaluation & Metadata

Labels CSV Path Task Name Task Description Tags

Generate calibration plot Generate confusion heatmap Request token logprobs

Logging

Prompt Log Detail Flush Every N Rows Flush Every N Seconds

Inspect Prompt & Cache

Shared prefix estimate: waiting for input...

Shared-prefix estimate ignores row-specific CSV fields; sampling is optional.

Sample a CSV file to preview the first data row inside the prompt payload.

System

[system message preview]

User

[user message preview]

Run

python benchmark_agent.py --input "" --model ""

CLI Flag Reference

All CLI flags this GUI can currently emit. Flags not listed here are still available only through the terminal.