An open leaderboard for the task that matters to a foresight product: generating good "signals of change."
Plug in one OpenRouter key. Get a ranked, four-axis leaderboard across 25+ frontier models, scored over a frozen suite of multi-sector briefs. Re-runnable when new models ship — add the slug, run with --into, publish.
Status: alpha. The methodology is opinionated and intentionally open about its limitations — see Known limitations. Feedback, model additions, and methodology critiques welcome via issues and PRs.
Public leaderboard: signals-strict consumes this data at /benchmark.
Each model is run against a frozen suite of 12 briefs spanning healthcare, fintech, defense, climate, retail, biotech, energy, education, geopolitics, AI infra, mobility, food. Every signal it produces is scored on four orthogonal axes:
| Axis | Weight | What it checks | Method |
|---|---|---|---|
| Verifiability | 0.40 | Is the underlying evidence claim real? | Web-grounded judge classifies into one of 6 buckets: grounded / speculative / future / indicative / dubious / fabricated. The future bucket protects forward-looking foresight from being marked as hallucination. |
| Specificity | 0.30 | Named actors? Concrete events? No hype adjectives? | Separate non-grounded judge against an explicit writing rubric — different vendor from the verifier to reduce judge-family bias. |
| Currency | 0.15 | How recent is the supporting evidence? | Newest source date from the verifier's annotations, passed through a decay curve. |
| Coverage | 0.15 | Breadth across the brief's categories + uniqueness vs. the cohort | Mean of category-balance and unique-share (token-Jaccard < 0.3 vs. every other model's signals on the same brief). |
The composite is a weighted mean. Per-axis sub-scores are emitted alongside — if you weight things differently, re-compute from the raw JSON without re-running anything.
Requires Node 22.6+ (uses --experimental-strip-types, no build step).
pnpm install
cp .env.example .env # paste your OpenRouter keyGet a key at https://openrouter.ai/keys. ~$5 of credit is plenty to try things out.
# Tiny smoke run — 2 cheap models × 1 brief, ~$0.20, ~1 min
pnpm bench --preset smoke --briefs healthcare-regulated-ai
# Frontier-only on one brief (~$5, ~5 min)
pnpm bench --tiers frontier --briefs healthcare-regulated-ai
# Full top-25 cohort × all 12 briefs (~$15–25, ~60 min)
pnpm bench --preset top25Output lands in results/<run-id>/:
meta.json— run config + costruns/<brief>__<model>.json— raw generationsevals/<brief>__<model>.json— per-signal verdicts + citationsleaderboard.md/leaderboard.json— final ranking
When a new model ships:
pnpm bench --models newprovider/new-model --into <existing-run-id>Reuses every cached run/eval from the existing leaderboard, only generates + judges the new model. Typical cost: $3–5 (vs. $15–25 for a fresh full run).
Killed phase 2? Switch judges mid-run? Use bench:resume:
pnpm bench:resume --run <id> \
--judge google/gemini-2.5-flash:online \
--spec-judge google/gemini-2.5-flashSkips already-completed (model × brief) pairs. Picks up where it left off.
If you change the weights in src/score.ts, re-score from existing evals — no API calls:
pnpm bench:report --run <id>pnpm bench:listShows all 12 briefs + every model the runner knows about.
A small Next.js app under web/ renders an interactive leaderboard locally — sortable axes, per-brief drill-down, per-signal verdicts + citations. Useful while iterating.
cd web && npm install && npm run dev # http://localhost:3030Reads from ../results/ directly. Zero API calls of its own. Public consumers (e.g. signals-strict) generally consume the published JSON instead.
The dominant cost is the web-grounded verifier. Empirical pricing across providers (per call):
| Verifier | $/call | Notes |
|---|---|---|
openai/gpt-5.4:online |
~$0.040 | Most expensive; highest quality |
google/gemini-2.5-pro:online |
~$0.015 | |
google/gemini-2.5-flash:online |
~$0.005 | Default; reliable JSON, good cost/quality |
perplexity/sonar-pro |
~$0.008 | Web-grounded by design |
perplexity/sonar |
~$0.003 | Cheapest, smaller model |
Each (model × brief) pair runs ~16 signals × 2 judge calls. Estimate: top-25 × 12 briefs with Gemini Flash judges ≈ $15–25 actual billed. Same run with gpt-5.4:online verifier would be 4× more.
The runner prints billed-so-far: $X.XX on every progress line — read from OpenRouter's /auth/key usage delta, so it matches your dashboard exactly. Trust that number over the per-call usage.cost sums.
To ship a run to a public site (e.g. the signals-strict /benchmark page):
pnpm bench:publish # latest run, today's date
pnpm bench:publish --dry-run # see what it would do
pnpm bench:publish --date 2026-08-15 # custom date for the filenameThe publisher:
- Copies
results/<run>/leaderboard.json→../signals-strict/public/benchmark/<date>.json - Writes per-model detail files (signals + verdicts + citations) →
../signals-strict/public/benchmark/<date>/<vendor>_<model>.json - Rewrites
CURRENT_BENCHMARK_FILEin the consumer's page so the new run goes live on next deploy
Override the target with --dest /path/to/some/other/repo if you're publishing elsewhere.
Stated up front because the methodology is opinionated:
- No accepted market equivalent. The weights and rubric are ours, derived from Signals' product purpose (foresight → substance > style > recency ≈ breadth). Per-axis sub-scores ship raw so consumers can re-weight.
- Judges have biases. Same-vendor models can mildly self-favor. We use different vendors for verifier + specificity, and the judges-used are recorded in every leaderboard.
- Frozen briefs can be gamed. Once public, briefs could in principle be tuned to. Versioned via
BRIEFS_VERSION— cross-version comparisons aren't valid. - The
futureverdict is judgement-heavy. Forward-looking signals can't be web-verified. Dedicated bucket scored mid-range relies on judge plausibility call. - Maturation needs volume. A single run is one data point. Confidence comes from many dated runs across many judges over time.
src/
├── cli.ts # run | resume | report | list subcommands
├── publish.ts # bench:publish — copies to a consumer repo
├── briefs.ts # 12 frozen industry briefs
├── models.ts # OpenRouter slugs + tier metadata
├── presets.ts # named cohorts (top25, frontier, smoke)
├── generate.ts # phase 1 — model produces signals
├── evaluate.ts # phase 2 — judges score each signal
├── score.ts # phase 3 — composite + coverage
├── cache.ts # exact + embedding-based semantic cache
├── openrouter.ts # thin caller, embeddings, /auth/key probe
├── env.ts # .env loader (no dotenv dep)
├── report.ts # markdown leaderboard renderer
└── types.ts
results/ # gitignored — raw per-run outputs
cache/ # gitignored — semantic eval cache
web/ # optional local Next.js viewer
Two clean dependencies: p-map (concurrency) and picocolors (CLI output). Everything else is standard Node 22.
Adding a new model, proposing a new brief, or critiquing the methodology — see CONTRIBUTING.md.
Briefly: the most common contribution is dropping a new model slug into src/models.ts and src/presets.ts, then running pnpm bench --into <latest-run> to add it to the existing leaderboard cheaply.
MIT — © Envisioning. Use, fork, re-rubric, run on your own briefs. If you publish a derivative leaderboard we'd appreciate a citation back to this repo, but it's not required.