# LLM Linguistic Classification Benchmark Agent

## Agent Overview
This agent is a Python-based benchmarking tool designed to evaluate large language models (LLMs) on a linguistic classification task. Each input consists of a text excerpt containing a specific **node word**, and the agent prompts an LLM (via the OpenAI API) to classify that node word according to a given linguistic criterion (e.g. a syntactic category like part-of-speech, or a semantic category). The agent automates the end-to-end process: loading a dataset of labeled examples, querying the LLM for each example, collecting the model’s predicted label along with an explanation and a self-reported confidence, and finally comparing these predictions against the ground truth labels to measure performance. This provides a reproducible way to benchmark an LLM’s accuracy on specialized linguistic classification problems.

... [truncated for brevity in this context]


## Additional Features

### Web-Based Configuration Interface

To enhance usability, the agent includes an HTML/JavaScript-based graphical user interface (GUI) for configuring the Python script. This configuration tool generates a command-line string that reflects all chosen options, making it easier for users to run the benchmarking script with their desired parameters.

### GUI Capabilities

The HTML/JavaScript GUI allows users to set the following parameters:

1. **Model Settings:**
   - **Model Name** (e.g., `gpt-4`, `gpt-3.5-turbo`)
   - **Temperature** (controls randomness; typically set to `0` for deterministic classification tasks)
   - **Top-p** (nucleus sampling; usually set to `1.0` to allow full token probability mass)
   - **Top-k** (optional: limits output to top-k tokens)

2. **Prompt Behavior:**
   - **Chain of Thought Toggle**: Enables or disables a structured reasoning style in the prompt. When enabled, the prompt encourages the model to explain its reasoning step-by-step before choosing a label.
   - **System Prompt**: Allows the user to set a custom system prompt (e.g., “You are a linguistic classifier that excels at semantic disambiguation.”). This will be inserted in the `system` role field for chat models.

3. **Input & Output:**
   - **Path to Input CSV File**
   - **Path to Ground Truth CSV File** (if separate)
   - **Output File Path** (for results)
   - **Enable Calibration Plot**: Checkbox for toggling calibration output

### Output

The GUI dynamically generates a CLI command (e.g., for `benchmark_agent.py`) such as:

```
python benchmark_agent.py --input data.csv --labels labels.csv --output results.csv \
  --model gpt-4 --temperature 0.0 --top_p 1.0 --top_k 5 \
  --system_prompt "You are a linguistic classifier..." \
  --enable_cot --calibration
```

This command can be copied and executed directly in a terminal to run the Python agent with the selected settings. The GUI is designed to run entirely client-side in the browser and does not require server-side components.