# Metrics Dashboard

The static dashboard in `web/` is for exploring `*_metrics.json` artifacts produced by benchmark runs. It is read-only: it does not run benchmarks, edit results, or need a backend service of its own.

## What It Includes

The dashboard loads metrics files and derives a run catalogue with:

- task, model, provider, timestamp, and tags
- accuracy, Cohen's Kappa, macro F1, macro precision, macro recall, and calibration metrics
- repeat-run Krippendorff's alpha when an agreement summary is available
- token and request summaries
- estimated cost and pricing metadata when available
- links back to sibling artifacts such as the heatmap, calibration chart, prompt log, output CSV, input CSV, and raw metrics JSON

The main screen includes:

- a filter sidebar for task, model, tags, time range, and missing-accuracy filtering
- KPI cards for total runs, total tasks, best accuracy, and total requests
- a leaderboard area with multiple views
- an agreement area for repeated-run and cross-model alpha
- a prompt token profile panel
- a runs table
- a run-detail modal with links and previews

## Starting The Dashboard

### Local-Only Mode

Use this when you open the dashboard directly from disk with `file://`.

1. Put your `*_metrics.json` files under `data/metrics/`.
2. Open `web/index.html` in a browser.
3. Click `Open Metrics Folder` and choose the folder that contains the metrics files.

In `file://` mode the dashboard cannot auto-scan local folders, so the one manual folder-selection step is required by the browser.

### Server Mode

From repository root:

```bash
python -m http.server 8000
```

Then open `http://localhost:8000/web/`.

In server mode the `Auto (Server)` source attempts, in order:

1. `web/metrics-manifest.json`
2. fallback directory discovery from `../data/metrics/`

The `Reload` button refreshes the current source.

When `data/metrics/agreement_summary.json` is present, the dashboard loads it alongside the run metrics and enables the Agreement tab inside `Leaderboard & Agreement`. When `data/metrics/agreement_clusters.json` is also present, the Agreement tab can render same-model and cross-model similarity trees.

Pricing metadata for the scatterplot and run details is loaded from `web/config_prices.js`.
If you deploy the dashboard under a rewritten root or any setup that exposes only `web/`, make sure that file is published there as well.
The price update flow writes a mirrored dashboard copy automatically when it generates the root `config_prices.js`.

### Manifest Notes

`web/generate_metrics_manifest.py` is optional but useful for larger collections.

- `metrics_files` is authoritative when present
- if a listed file 404s, the dashboard retries by filename in common metrics directories
- `metrics_base_dirs` can extend those retry directories
- if manifest loading fails, the dashboard falls back to directory discovery when possible

## Loading Modes And Source Status

The header source panel exposes:

- `Auto (Server)`: load from server-hosted metrics discovery
- `Open Metrics Folder`: choose a local folder through the browser file-system picker
- `Reload`: reload the active source

The status line reports:

- current mode, such as `server` or `folder`
- number of loaded files
- warning count

Warnings are also summarized below the status line so you can spot malformed or skipped files quickly.

## Navigating The Dashboard

The dashboard is built around a simple loop: narrow the run set in the sidebar, inspect the summary cards, compare runs in the leaderboard, and open individual runs for deeper details.

## Filters

The left sidebar controls the active subset of runs:

- `Task`: multi-select task filter
- `Model`: multi-select model filter
- `Time Ranges (OR)`: one or more timestamp windows
- `Hide runs without accuracy`: hide runs that do not expose an accuracy metric
- `Tags`: clickable chips derived from semicolon-delimited run tags

Useful behavior:

- desktop multi-select supports Ctrl/Cmd and Shift
- time ranges are additive, not exclusive
- `Reset All Filters` clears the current selection
- the sidebar can be collapsed, reopened, and used from the mobile filter drawer

The dashboard persists most UI state in browser storage, including filters, selected tab, grouping mode, and theme.

## KPIs

The KPI strip gives a fast summary of the current filtered view:

- `Total Runs`
- `Total Tasks`
- `Best Accuracy`
- `Total Requests`

These values update immediately when filters change.

## Leaderboard

The leaderboard is the main analysis area. `Main Metric` changes the ranking basis for the views below:

- Accuracy
- Cohen's Kappa
- Macro F1
- Macro Precision
- Macro Recall
- Calibration ECE

Available tabs:

- `Chart`: ranked bars for the current metric
- `Scatter`: either metric vs price or metric vs time
- `Table`: sortable metric table
- `Radar`: model profiles across tasks or tags

### Chart Tab

The chart tab shows ranked runs or grouped summaries.

- `Group By` supports `None`, `Model`, and `Task`
- grouped rows show averages; when grouped by model or task, repeated runs are averaged within each task/model first so each task/model contributes once to the grouped summary
- a `TOP` badge marks the best individual run for the current metric when that distinction is relevant
- `Best run per task` switches to a compact task-leader view
- clicking a row opens the run-detail modal

For metrics where lower is better, such as calibration error, the dashboard labels that explicitly.

For accuracy-like metrics, it can also draw approximate 95% confidence intervals derived from the evaluated sample size.

### Scatter Tab

The scatter tab has two x-axis modes:

- `Price`: compare the current metric against estimated total cost or average cost per prediction
- `Time`: compare the current metric over run timestamps

Shared controls include:

- `Group By` for `None`, `Model`, and `Task`
- grouped price points use the same balanced averaging rule as the chart tab when grouping by model or task
- `CI` toggle when the metric supports approximate confidence intervals
- `Labels` toggle for point labels
- `Reset Zoom`

You can zoom by dragging a selection over the plotted area. Clicking a point opens the run-detail modal.

### Table Tab

The leaderboard table gives a dense sortable comparison across runs.

- click a column header to sort
- highlighted cells mark the preferred value for that metric in the current selection
- `Repeat α` is filled only for runs that belong to a repeated same-model agreement group
- row labels may include selected tag badges
- clicking a row opens the run-detail modal

On narrow screens or wide metric sets, the table can be scrolled horizontally.

## Agreement

The `Agreement` tab reads the precomputed `agreement_summary.json` artifact.

- the `Agreement` switch selects `Same model` or `Cross-Model`
- `Same model` shows Krippendorff's alpha across repeated runs of one provider/model on the same comparable task variant
- `Cross-Model` shows Krippendorff's alpha across one representative run per provider/model on the same comparable task variant
- `Compare by` appears only in `Cross-Model` mode and switches between the `Latest` and `Best Accuracy` representative policies
- the tab only shows groups fully represented inside the current filter, so restrictive time/model filters can hide otherwise valid agreement groups
- when `agreement_clusters.json` is present, both `Same model` and `Cross-Model` also show similarity trees built from pairwise disagreement distances
- same-model trees cluster repeated runs of one provider/model on the same comparable task variant
- cross-model trees cluster one representative run per provider/model, and they are recomputed in the browser for the currently visible representative models so filters can redraw the clustering even when the full-group alpha row is hidden

### Radar Tab

The radar view compares model profiles across multiple tasks or tags.

- `Group By` becomes an axis selector: `Task` or `Tag`
- the chart plots average metric values for each model across the selected axes
- `Scale` switches between `Linear` and `Contrast`
- at least three axes are required to render the radar
- when many models are present, the dashboard shows the top subset first and lets you load more

For lower-is-better metrics, smaller shapes represent better values.

## Prompt Token Profile

The `Prompt Token Profile` panel shows average tokens per prediction by run, split into:

- input
- cached input
- output
- thinking

Each row also shows prediction count and estimated cost. Clicking a row opens the corresponding run-detail modal.

## Runs Table

The runs table is the quickest raw list view. It includes:

- task
- model
- timestamp
- accuracy
- Cohen's Kappa
- macro F1
- calibration ECE
- requests
- cached input tokens
- source filename

Clicking a row opens the run-detail modal.

## Run Detail Modal

The modal opens from leaderboard rows, scatter points, token-profile rows, and table rows.

It includes:

- run metadata such as task, model, provider, tags, timestamp, and reasoning settings
- metric values and sample counts
- token usage, runtime, pricing, and request totals
- links to metrics JSON, heatmap, calibration plot, log file, output CSV, and input CSV
- chart previews for heatmap and calibration artifacts when present
- an expandable raw JSON view of the loaded metrics file

## Tips

- If `file://` mode appears empty, use `Open Metrics Folder`; auto-loading only works in server mode.
- If a run is missing from the dashboard, check the warning summary first.
- Use tags plus time ranges together when comparing experimental slices.
- Switch `Main Metric` before interpreting rankings; the same filtered dataset can look very different by accuracy versus calibration error.