LLM Linguistic Classification Benchmark Agent
Required inputs are highlighted first. Optional and advanced controls are collapsed to reduce visual clutter.
CLI Help Reference
For the full current reference, run python benchmark_agent.py --help in your terminal.
usage: benchmark_agent.py [-h] [--input INPUT [INPUT ...]] [--labels LABELS]
[--output OUTPUT] [--model MODEL]
[--temperature TEMPERATURE] [--top_p TOP_P]
[--top_k TOP_K]
[--service_tier {standard,flex,priority}]
[--verbosity {low,medium,high}]
[--reasoning_effort {low,medium,high,xhigh}]
[--thinking_level {minimal,low,medium,high}]
[--effort {low,medium,high,max}]
[--strict_control_acceptance]
[--provider PROVIDER]
[--system_prompt SYSTEM_PROMPT | --system_prompt_b64 SYSTEM_PROMPT_B64]
[--few_shot_examples FEW_SHOT_EXAMPLES]
[--prompt_layout {standard,compact}]
[--cache_pad_target_tokens CACHE_PAD_TARGET_TOKENS]
[--prompt_cache_key PROMPT_CACHE_KEY]
[--gemini_cached_content GEMINI_CACHED_CONTENT]
[--requesty_auto_cache | --no-requesty_auto_cache]
[--vertex_auto_adc_login | --no-vertex_auto_adc_login]
[--vertex_access_token_refresh_seconds VERTEX_ACCESS_TOKEN_REFRESH_SECONDS]
[--enable_cot] [--no_explanation]
[--logprobs | --no-logprobs] [--calibration]
[--api_key_var API_KEY_VAR]
[--api_base_var API_BASE_VAR]
[--max_retries MAX_RETRIES]
[--retry_delay RETRY_DELAY]
[--request_interval_ms REQUEST_INTERVAL_MS]
[--threads THREADS]
[--prompt_log_detail {full,compact}]
[--flush_rows FLUSH_ROWS]
[--flush_seconds FLUSH_SECONDS]
[--request_timeout_seconds REQUEST_TIMEOUT_SECONDS]
[--validator_cmd VALIDATOR_CMD]
[--validator_args VALIDATOR_ARGS]
[--validator_timeout VALIDATOR_TIMEOUT]
[--validator_prompt_max_candidates VALIDATOR_PROMPT_MAX_CANDIDATES]
[--validator_prompt_max_chars VALIDATOR_PROMPT_MAX_CHARS]
[--validator_exhausted_policy {accept_blank_confidence,unclassified,error}]
[--validator_debug] [--log_level LOG_LEVEL]
[--update-models] [--models-output MODELS_OUTPUT]
[--models-providers MODELS_PROVIDERS [MODELS_PROVIDERS ...]]
[--metrics_only]
Benchmark an OpenAI model on a linguistic classification dataset.
options:
-h, --help show this help message and exit
--input INPUT [INPUT ...]
Path(s) to input CSV file(s) with examples.
--labels LABELS Optional path to CSV file that provides ground-truth
labels (ID;truth).
--output OUTPUT Optional output CSV path or directory. When omitted,
defaults to <input>__<provider>__<model>__<timestamp>.csv
alongside each input file. If the resolved output CSV
already exists, the run resumes from the first ID not
present in that file.
--model MODEL Model name (e.g., gpt-4-turbo).
--metrics_only, --metrics-only
Skip model/API calls and compute metrics from existing
output CSV file(s) provided via --input.
--temperature TEMPERATURE
Sampling temperature. Omit to let the provider/model
use its default.
--top_p TOP_P Nucleus sampling parameter. Omit to let the
provider/model use its default.
--top_k TOP_K Top-k sampling (ignored for APIs that do not support
it).
--service_tier {standard,flex,priority}
Optional service-tier hint for providers that support
differentiated throughput.
--verbosity {low,medium,high}
Optional output verbosity control for GPT models. Sent
as verbosity (Chat Completions) or text.verbosity
(Responses API).
--reasoning_effort {low,medium,high,xhigh}
Optional reasoning effort level. Sent as
reasoning.effort for OpenAI-style models and as
reasoning_effort for Gemini targets.
--thinking_level {minimal,low,medium,high}
Optional Gemini thinking level (minimal applies
to Gemini Flash models). Sent as
extra_body.google.thinking_config for Gemini OpenAI-compatible
targets.
--effort {low,medium,high,max}
Optional Claude effort level. Sent as effort when
provided.
--strict_control_acceptance
Fail an example when requested controls are rejected or
not present in the final successful request payload.
--provider PROVIDER Model provider identifier used to look up default
credentials. Known providers are preconfigured; custom
providers are inferred from <PROVIDER>_API_KEY (or
<PROVIDER>_ACCESS_TOKEN) and <PROVIDER>_BASE_URL.
--system_prompt SYSTEM_PROMPT
System prompt injected into the chat completion.
--system_prompt_b64 SYSTEM_PROMPT_B64
Base64-encoded system prompt (used by the GUI to
ensure cross-platform commands).
--few_shot_examples FEW_SHOT_EXAMPLES
Number of labeled examples to prepend as few-shot
demonstrations.
--prompt_layout {standard,compact}
Prompt payload layout. standard preserves the current
verbose payload; compact removes duplicated fields to
improve cache reuse.
--cache_pad_target_tokens CACHE_PAD_TARGET_TOKENS
Optional shared-prefix token target for cache padding.
If >0, shared-prefix length is calibrated from early
prompt structure; subsequent prompts are padded toward
this shared-prefix target.
--prompt_cache_key PROMPT_CACHE_KEY
Optional provider cache-routing key (when supported)
to improve prompt-cache hit consistency for stable
prompt prefixes.
--gemini_cached_content GEMINI_CACHED_CONTENT
Optional Gemini context-cache resource name passed via
extra_body.google.cached_content when targeting Gemini
OpenAI-compatible endpoints.
--requesty_auto_cache, --no-requesty_auto_cache
Enable/disable Requesty automatic caching by sending
extra_body.requesty.auto_cache. Only used when
--provider requesty.
--vertex_auto_adc_login, --no-vertex_auto_adc_login
Enable/disable automatic one-time ADC login for
Vertex when credentials are missing (browser-based
gcloud auth flow). Only used when --provider vertex.
--vertex_access_token_refresh_seconds VERTEX_ACCESS_TOKEN_REFRESH_SECONDS
Override Vertex access-token refresh interval in
seconds. Only used when --provider vertex.
--enable_cot If set, encourages the model to reason step-by-step
before answering.
--no_explanation Skip requesting explanations to reduce token usage.
--logprobs, --no-logprobs
Enable token log probabilities when supported.
Disabled by default for better large-run throughput.
--calibration Generate a calibration plot using the model's
confidences.
--api_key_var API_KEY_VAR
Environment variable name that stores the API key or
access token.
--api_base_var API_BASE_VAR
Environment variable name that stores the API base
URL.
--max_retries MAX_RETRIES
Maximum number of retry attempts per example on API
errors.
--retry_delay RETRY_DELAY
Delay (seconds) between API retries.
--request_interval_ms REQUEST_INTERVAL_MS
Minimum delay in milliseconds between outgoing API
requests. Use 0 to disable request pacing.
--threads THREADS Number of concurrent worker threads used to classify
examples. Use 1 to keep sequential processing.
--prompt_log_detail {full,compact}
Prompt-log detail level. full stores full
request/response text; compact omits heavy text
fields.
--flush_rows FLUSH_ROWS
Flush CSV and NDJSON prompt log after this many
committed rows.
--flush_seconds FLUSH_SECONDS
Flush CSV and NDJSON prompt log after this many
seconds even if flush_rows was not reached.
--request_timeout_seconds REQUEST_TIMEOUT_SECONDS
Per-request timeout in seconds for provider API calls.
Use 0 or a negative value to disable timeout.
--validator_cmd VALIDATOR_CMD
Optional path to an NDJSON validator
executable/script. When provided, the agent will
validate each prediction and may retry with extra
constraints. If the path ends with .py it will be run
via the current Python interpreter.
--validator_args VALIDATOR_ARGS
Optional extra arguments passed to the validator
command as a single string (supports quoting).
Example: "--lexicon data/lemmas.txt --max_distance 2".
--validator_timeout VALIDATOR_TIMEOUT
Timeout (seconds) for each validator request/response
roundtrip.
--validator_prompt_max_candidates VALIDATOR_PROMPT_MAX_CANDIDATES
Maximum number of allowed_labels candidates rendered
into a validator retry prompt.
--validator_prompt_max_chars VALIDATOR_PROMPT_MAX_CHARS
Maximum character length of the validator retry
instruction appended to the prompt.
--validator_exhausted_policy {accept_blank_confidence,unclassified,error}
What to do when the validator keeps requesting retry
but --max_retries is exhausted.
accept_blank_confidence keeps the last label but
blanks confidence; unclassified forces label to
"unclassified"; error aborts the run.
--validator_debug Log validator NDJSON send/receive payloads at DEBUG
level.
--log_level LOG_LEVEL
Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL).
--update-models, -updatemodels
If set, fetch available models for configured
providers and update config_models.js.
--models-output MODELS_OUTPUT
Output path for generated model catalog JS when
--update-models is used.
--models-providers MODELS_PROVIDERS [MODELS_PROVIDERS ...]
Optional list of provider slugs to update when
--update-models is specified. Custom slugs are
allowed; env vars are inferred as <SLUG>_API_KEY (or
<SLUG>_ACCESS_TOKEN) and <SLUG>_BASE_URL.
Tip: this GUI is a command builder. You can still edit any generated CLI flag manually before running.