signals-benchmark

An open leaderboard for the task that matters to a foresight product: generating good "signals of change."

Plug in one OpenRouter key. Get a ranked, four-axis leaderboard across 25+ frontier models, scored over a frozen suite of multi-sector briefs. Re-runnable when new models ship — add the slug, run with --into, publish.

Status: alpha. The methodology is opinionated and intentionally open about its limitations — see Known limitations. Feedback, model additions, and methodology critiques welcome via issues and PRs.

Public leaderboard: signals-strict consumes this data at /benchmark.

What it measures

Each model is run against a frozen suite of 12 briefs spanning healthcare, fintech, defense, climate, retail, biotech, energy, education, geopolitics, AI infra, mobility, food. Every signal it produces is scored on four orthogonal axes:

Axis	Weight	What it checks	Method
Verifiability	0.40	Is the underlying evidence claim real?	Web-grounded judge classifies into one of 6 buckets: grounded / speculative / future / indicative / dubious / fabricated. The `future` bucket protects forward-looking foresight from being marked as hallucination.
Specificity	0.30	Named actors? Concrete events? No hype adjectives?	Separate non-grounded judge against an explicit writing rubric — different vendor from the verifier to reduce judge-family bias.
Currency	0.15	How recent is the supporting evidence?	Newest source date from the verifier's annotations, passed through a decay curve.
Coverage	0.15	Breadth across the brief's categories + uniqueness vs. the cohort	Mean of category-balance and unique-share (token-Jaccard < 0.3 vs. every other model's signals on the same brief).

The composite is a weighted mean. Per-axis sub-scores are emitted alongside — if you weight things differently, re-compute from the raw JSON without re-running anything.

Quickstart

Requires Node 22.6+ (uses --experimental-strip-types, no build step).

pnpm install
cp .env.example .env       # paste your OpenRouter key

Get a key at https://openrouter.ai/keys. ~$5 of credit is plenty to try things out.

# Tiny smoke run — 2 cheap models × 1 brief, ~$0.20, ~1 min
pnpm bench --preset smoke --briefs healthcare-regulated-ai

# Frontier-only on one brief (~$5, ~5 min)
pnpm bench --tiers frontier --briefs healthcare-regulated-ai

# Full top-25 cohort × all 12 briefs (~$15–25, ~60 min)
pnpm bench --preset top25

Output lands in results/<run-id>/:

meta.json — run config + cost
runs/<brief>__<model>.json — raw generations
evals/<brief>__<model>.json — per-signal verdicts + citations
leaderboard.md / leaderboard.json — final ranking

Common workflows

Add a new model to an existing run (cheap!)

When a new model ships:

pnpm bench --models newprovider/new-model --into <existing-run-id>

Reuses every cached run/eval from the existing leaderboard, only generates + judges the new model. Typical cost: $3–5 (vs. $15–25 for a fresh full run).

Resume an interrupted run

Killed phase 2? Switch judges mid-run? Use bench:resume:

pnpm bench:resume --run <id> \
  --judge google/gemini-2.5-flash:online \
  --spec-judge google/gemini-2.5-flash

Skips already-completed (model × brief) pairs. Picks up where it left off.

Re-render after tuning the composite

If you change the weights in src/score.ts, re-score from existing evals — no API calls:

pnpm bench:report --run <id>

Inspect the catalog

pnpm bench:list

Shows all 12 briefs + every model the runner knows about.

Web viewer (optional, local)

A small Next.js app under web/ renders an interactive leaderboard locally — sortable axes, per-brief drill-down, per-signal verdicts + citations. Useful while iterating.

cd web && npm install && npm run dev      # http://localhost:3030

Reads from ../results/ directly. Zero API calls of its own. Public consumers (e.g. signals-strict) generally consume the published JSON instead.

Cost discipline

The dominant cost is the web-grounded verifier. Empirical pricing across providers (per call):

Verifier	$/call	Notes
`openai/gpt-5.4:online`	~$0.040	Most expensive; highest quality
`google/gemini-2.5-pro:online`	~$0.015
`google/gemini-2.5-flash:online`	~$0.005	Default; reliable JSON, good cost/quality
`perplexity/sonar-pro`	~$0.008	Web-grounded by design
`perplexity/sonar`	~$0.003	Cheapest, smaller model

Each (model × brief) pair runs ~16 signals × 2 judge calls. Estimate: top-25 × 12 briefs with Gemini Flash judges ≈ $15–25 actual billed. Same run with gpt-5.4:online verifier would be 4× more.

The runner prints billed-so-far: $X.XX on every progress line — read from OpenRouter's /auth/key usage delta, so it matches your dashboard exactly. Trust that number over the per-call usage.cost sums.

Publishing results

To ship a run to a public site (e.g. the signals-strict /benchmark page):

pnpm bench:publish                  # latest run, today's date
pnpm bench:publish --dry-run        # see what it would do
pnpm bench:publish --date 2026-08-15 # custom date for the filename

The publisher:

Copies results/<run>/leaderboard.json → ../signals-strict/public/benchmark/<date>.json
Writes per-model detail files (signals + verdicts + citations) → ../signals-strict/public/benchmark/<date>/<vendor>_<model>.json
Rewrites CURRENT_BENCHMARK_FILE in the consumer's page so the new run goes live on next deploy

Override the target with --dest /path/to/some/other/repo if you're publishing elsewhere.

Known limitations

Stated up front because the methodology is opinionated:

No accepted market equivalent. The weights and rubric are ours, derived from Signals' product purpose (foresight → substance > style > recency ≈ breadth). Per-axis sub-scores ship raw so consumers can re-weight.
Judges have biases. Same-vendor models can mildly self-favor. We use different vendors for verifier + specificity, and the judges-used are recorded in every leaderboard.
Frozen briefs can be gamed. Once public, briefs could in principle be tuned to. Versioned via BRIEFS_VERSION — cross-version comparisons aren't valid.
The future verdict is judgement-heavy. Forward-looking signals can't be web-verified. Dedicated bucket scored mid-range relies on judge plausibility call.
Maturation needs volume. A single run is one data point. Confidence comes from many dated runs across many judges over time.

Architecture (1-pager)

src/
├── cli.ts             # run | resume | report | list subcommands
├── publish.ts         # bench:publish — copies to a consumer repo
├── briefs.ts          # 12 frozen industry briefs
├── models.ts          # OpenRouter slugs + tier metadata
├── presets.ts         # named cohorts (top25, frontier, smoke)
├── generate.ts        # phase 1 — model produces signals
├── evaluate.ts        # phase 2 — judges score each signal
├── score.ts           # phase 3 — composite + coverage
├── cache.ts           # exact + embedding-based semantic cache
├── openrouter.ts      # thin caller, embeddings, /auth/key probe
├── env.ts             # .env loader (no dotenv dep)
├── report.ts          # markdown leaderboard renderer
└── types.ts
results/               # gitignored — raw per-run outputs
cache/                 # gitignored — semantic eval cache
web/                   # optional local Next.js viewer

Two clean dependencies: p-map (concurrency) and picocolors (CLI output). Everything else is standard Node 22.

Contributing

Adding a new model, proposing a new brief, or critiquing the methodology — see CONTRIBUTING.md.

Briefly: the most common contribution is dropping a new model slug into src/models.ts and src/presets.ts, then running pnpm bench --into <latest-run> to add it to the existing leaderboard cheaply.

License

MIT — © Envisioning. Use, fork, re-rubric, run on your own briefs. If you publish a derivative leaderboard we'd appreciate a citation back to this repo, but it's not required.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
src		src
web		web
.env.example		.env.example
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

signals-benchmark

What it measures

Quickstart

Common workflows

Add a new model to an existing run (cheap!)

Resume an interrupted run

Re-render after tuning the composite

Inspect the catalog

Web viewer (optional, local)

Cost discipline

Publishing results

Known limitations

Architecture (1-pager)

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

signals-benchmark

What it measures

Quickstart

Common workflows

Add a new model to an existing run (cheap!)

Resume an interrupted run

Re-render after tuning the composite

Inspect the catalog

Web viewer (optional, local)

Cost discipline

Publishing results

Known limitations

Architecture (1-pager)

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages