Core Setup

Set these first to build a runnable command.

Refresh

Run python benchmark_agent.py --update-models first, then refresh here.

I/O

Model/Provider

Prompt Behavior
Optional Runtime Controls
Sampling, pacing, and service options
Prompt length and cache controls

Input token controls

Affect prompt size, cacheability, and prefix reuse.

Output/reasoning token controls

Bias completion detail and hidden reasoning token budgets.

Input/output extras and credentials
Advanced Validator (Optional)
Configure external validator and retry behavior

Useful for very large label spaces. The validator can accept, normalize, or request retries with a smaller candidate set.

CLI Help Reference

For the full current reference, run python benchmark_agent.py --help in your terminal.

usage: benchmark_agent.py [-h] [--input INPUT [INPUT ...]] [--labels LABELS]
                          [--output OUTPUT] [--model MODEL]
                          [--temperature TEMPERATURE] [--top_p TOP_P]
                          [--top_k TOP_K]
                          [--service_tier {standard,flex,priority}]
                          [--verbosity {low,medium,high}]
                          [--reasoning_effort {low,medium,high,xhigh}]
                          [--thinking_level {minimal,low,medium,high}]
                          [--effort {low,medium,high,max}]
                          [--strict_control_acceptance]
                          [--provider PROVIDER]
                          [--system_prompt SYSTEM_PROMPT | --system_prompt_b64 SYSTEM_PROMPT_B64]
                          [--few_shot_examples FEW_SHOT_EXAMPLES]
                          [--prompt_layout {standard,compact}]
                          [--cache_pad_target_tokens CACHE_PAD_TARGET_TOKENS]
                          [--prompt_cache_key PROMPT_CACHE_KEY]
                          [--gemini_cached_content GEMINI_CACHED_CONTENT]
                          [--requesty_auto_cache | --no-requesty_auto_cache]
                          [--vertex_auto_adc_login | --no-vertex_auto_adc_login]
                          [--vertex_access_token_refresh_seconds VERTEX_ACCESS_TOKEN_REFRESH_SECONDS]
                          [--enable_cot] [--no_explanation]
                          [--logprobs | --no-logprobs] [--calibration]
                          [--api_key_var API_KEY_VAR]
                          [--api_base_var API_BASE_VAR]
                          [--max_retries MAX_RETRIES]
                          [--retry_delay RETRY_DELAY]
                          [--request_interval_ms REQUEST_INTERVAL_MS]
                          [--threads THREADS]
                          [--prompt_log_detail {full,compact}]
                          [--flush_rows FLUSH_ROWS]
                          [--flush_seconds FLUSH_SECONDS]
                          [--request_timeout_seconds REQUEST_TIMEOUT_SECONDS]
                          [--validator_cmd VALIDATOR_CMD]
                          [--validator_args VALIDATOR_ARGS]
                          [--validator_timeout VALIDATOR_TIMEOUT]
                          [--validator_prompt_max_candidates VALIDATOR_PROMPT_MAX_CANDIDATES]
                          [--validator_prompt_max_chars VALIDATOR_PROMPT_MAX_CHARS]
                          [--validator_exhausted_policy {accept_blank_confidence,unclassified,error}]
                          [--validator_debug] [--log_level LOG_LEVEL]
                          [--update-models] [--models-output MODELS_OUTPUT]
                          [--models-providers MODELS_PROVIDERS [MODELS_PROVIDERS ...]]
                          [--metrics_only]

Benchmark an OpenAI model on a linguistic classification dataset.

options:
  -h, --help            show this help message and exit
  --input INPUT [INPUT ...]
                        Path(s) to input CSV file(s) with examples.
  --labels LABELS       Optional path to CSV file that provides ground-truth
                        labels (ID;truth).
  --output OUTPUT       Optional output CSV path or directory. When omitted,
                        defaults to <input>__<provider>__<model>__<timestamp>.csv
                        alongside each input file. If the resolved output CSV
                        already exists, the run resumes from the first ID not
                        present in that file.
  --model MODEL         Model name (e.g., gpt-4-turbo).
  --metrics_only, --metrics-only
                        Skip model/API calls and compute metrics from existing
                        output CSV file(s) provided via --input.
  --temperature TEMPERATURE
                        Sampling temperature. Omit to let the provider/model
                        use its default.
  --top_p TOP_P         Nucleus sampling parameter. Omit to let the
                        provider/model use its default.
  --top_k TOP_K         Top-k sampling (ignored for APIs that do not support
                        it).
  --service_tier {standard,flex,priority}
                        Optional service-tier hint for providers that support
                        differentiated throughput.
  --verbosity {low,medium,high}
                        Optional output verbosity control for GPT models. Sent
                        as verbosity (Chat Completions) or text.verbosity
                        (Responses API).
  --reasoning_effort {low,medium,high,xhigh}
                        Optional reasoning effort level. Sent as
                        reasoning.effort for OpenAI-style models and as
                        reasoning_effort for Gemini targets.
  --thinking_level {minimal,low,medium,high}
                        Optional Gemini thinking level (minimal applies
                        to Gemini Flash models). Sent as
                        extra_body.google.thinking_config for Gemini OpenAI-compatible
                        targets.
  --effort {low,medium,high,max}
                        Optional Claude effort level. Sent as effort when
                        provided.
  --strict_control_acceptance
                        Fail an example when requested controls are rejected or
                        not present in the final successful request payload.
  --provider PROVIDER   Model provider identifier used to look up default
                        credentials. Known providers are preconfigured; custom
                        providers are inferred from <PROVIDER>_API_KEY (or
                        <PROVIDER>_ACCESS_TOKEN) and <PROVIDER>_BASE_URL.
  --system_prompt SYSTEM_PROMPT
                        System prompt injected into the chat completion.
  --system_prompt_b64 SYSTEM_PROMPT_B64
                        Base64-encoded system prompt (used by the GUI to
                        ensure cross-platform commands).
  --few_shot_examples FEW_SHOT_EXAMPLES
                        Number of labeled examples to prepend as few-shot
                        demonstrations.
  --prompt_layout {standard,compact}
                        Prompt payload layout. standard preserves the current
                        verbose payload; compact removes duplicated fields to
                        improve cache reuse.
  --cache_pad_target_tokens CACHE_PAD_TARGET_TOKENS
                        Optional shared-prefix token target for cache padding.
                        If >0, shared-prefix length is calibrated from early
                        prompt structure; subsequent prompts are padded toward
                        this shared-prefix target.
  --prompt_cache_key PROMPT_CACHE_KEY
                        Optional provider cache-routing key (when supported)
                        to improve prompt-cache hit consistency for stable
                        prompt prefixes.
  --gemini_cached_content GEMINI_CACHED_CONTENT
                        Optional Gemini context-cache resource name passed via
                        extra_body.google.cached_content when targeting Gemini
                        OpenAI-compatible endpoints.
  --requesty_auto_cache, --no-requesty_auto_cache
                        Enable/disable Requesty automatic caching by sending
                        extra_body.requesty.auto_cache. Only used when
                        --provider requesty.
  --vertex_auto_adc_login, --no-vertex_auto_adc_login
                        Enable/disable automatic one-time ADC login for
                        Vertex when credentials are missing (browser-based
                        gcloud auth flow). Only used when --provider vertex.
  --vertex_access_token_refresh_seconds VERTEX_ACCESS_TOKEN_REFRESH_SECONDS
                        Override Vertex access-token refresh interval in
                        seconds. Only used when --provider vertex.
  --enable_cot          If set, encourages the model to reason step-by-step
                        before answering.
  --no_explanation      Skip requesting explanations to reduce token usage.
  --logprobs, --no-logprobs
                        Enable token log probabilities when supported.
                        Disabled by default for better large-run throughput.
  --calibration         Generate a calibration plot using the model's
                        confidences.
  --api_key_var API_KEY_VAR
                        Environment variable name that stores the API key or
                        access token.
  --api_base_var API_BASE_VAR
                        Environment variable name that stores the API base
                        URL.
  --max_retries MAX_RETRIES
                        Maximum number of retry attempts per example on API
                        errors.
  --retry_delay RETRY_DELAY
                        Delay (seconds) between API retries.
  --request_interval_ms REQUEST_INTERVAL_MS
                        Minimum delay in milliseconds between outgoing API
                        requests. Use 0 to disable request pacing.
  --threads THREADS     Number of concurrent worker threads used to classify
                        examples. Use 1 to keep sequential processing.
  --prompt_log_detail {full,compact}
                        Prompt-log detail level. full stores full
                        request/response text; compact omits heavy text
                        fields.
  --flush_rows FLUSH_ROWS
                        Flush CSV and NDJSON prompt log after this many
                        committed rows.
  --flush_seconds FLUSH_SECONDS
                        Flush CSV and NDJSON prompt log after this many
                        seconds even if flush_rows was not reached.
  --request_timeout_seconds REQUEST_TIMEOUT_SECONDS
                        Per-request timeout in seconds for provider API calls.
                        Use 0 or a negative value to disable timeout.
  --validator_cmd VALIDATOR_CMD
                        Optional path to an NDJSON validator
                        executable/script. When provided, the agent will
                        validate each prediction and may retry with extra
                        constraints. If the path ends with .py it will be run
                        via the current Python interpreter.
  --validator_args VALIDATOR_ARGS
                        Optional extra arguments passed to the validator
                        command as a single string (supports quoting).
                        Example: "--lexicon data/lemmas.txt --max_distance 2".
  --validator_timeout VALIDATOR_TIMEOUT
                        Timeout (seconds) for each validator request/response
                        roundtrip.
  --validator_prompt_max_candidates VALIDATOR_PROMPT_MAX_CANDIDATES
                        Maximum number of allowed_labels candidates rendered
                        into a validator retry prompt.
  --validator_prompt_max_chars VALIDATOR_PROMPT_MAX_CHARS
                        Maximum character length of the validator retry
                        instruction appended to the prompt.
  --validator_exhausted_policy {accept_blank_confidence,unclassified,error}
                        What to do when the validator keeps requesting retry
                        but --max_retries is exhausted.
                        accept_blank_confidence keeps the last label but
                        blanks confidence; unclassified forces label to
                        "unclassified"; error aborts the run.
  --validator_debug     Log validator NDJSON send/receive payloads at DEBUG
                        level.
  --log_level LOG_LEVEL
                        Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL).
  --update-models, -updatemodels
                        If set, fetch available models for configured
                        providers and update config_models.js.
  --models-output MODELS_OUTPUT
                        Output path for generated model catalog JS when
                        --update-models is used.
  --models-providers MODELS_PROVIDERS [MODELS_PROVIDERS ...]
                        Optional list of provider slugs to update when
                        --update-models is specified. Custom slugs are
                        allowed; env vars are inferred as <SLUG>_API_KEY (or
                        <SLUG>_ACCESS_TOKEN) and <SLUG>_BASE_URL.

Tip: this GUI is a command builder. You can still edit any generated CLI flag manually before running.