Fix all findings from cross-model review (bugs + consistency + defense-in-depth) by kimmingul · Pull Request #7 · kimmingul/samplesize-copilot

kimmingul · 2026-05-28T14:26:45Z

Summary

Resolves every finding from the three-reviewer audit (Claude code-reviewer + Claude security-reviewer + Codex 2nd-opinion) plus a follow-up independent review pass that caught a HIGH that the first wave missed. 18 source files modified, +341/-52 lines.

Critical bugs

Area	Was	Now
`report --kind power-curve`	silent NaN curve for any two-sample method	derives sweep key from registry signature; real curve
`report --kind r-code` / `sas-code`	`TypeError` on CI methods (`achieved_power=None`)	emits `N/A` cleanly
`report --kind protocol` / `grant`	`KeyError`/crash; literal `"None"`/`"α"` placeholders for CI methods	localized `"n/a"` / `"해당 없음"` fallback for power/alpha/target_power
`report --kind sensitivity`	wrong solve mode when source audit was solve-for-power	drops n/n1/n2/power from base before sweeping

High — `decision_tree.yaml` no-fallback rule (per CLAUDE.md)

The prior fix that replaced logrank_kgroup with logrank_freedman violated the explicit "No fallback heuristics" rule in CLAUDE.md by silently downgrading k-group survival to a 2-group calculator. The same class of issue affected the count and ordinal branches (no group-count question; any group count silently routed to a single specific method).

Introduces an unimplemented: terminal marker that samplesize doctor accepts as valid. K-group survival, k-group Poisson, and one-sample ordinal now terminate with an explicit reason instead of routing to an approximate method. count and ordinal gained the missing group-count question; count → two routes to the (already-validated) tests_two_poisson_means; ordinal → three_or_more routes to kruskal_wallis_simulation.

High — Plugin defense-in-depth

samplesize-calculate and samplesize-validate SKILLs now ship explicit "shell-injection defense" preambles instructing the LLM to:

registry-validate method_id against ^[a-z][a-z0-9_]*$
prefer --json-args-file <tmpfile> (new CLI flag) over inline --json-args <json>
never pass raw user text as a shell argument

Backed at the CLI by: --json-args-file, kwarg whitelist against the resolved methods signature, and _resolve_callable module-prefix allowlist (samplesize.tests.).

New `samplesize doctor` checks (11/11; was 9/9)

registry.decision-tree-leaves-exist — every leaf: in decision_tree.yaml is a registered method id (or a deliberate unimplemented: marker).
plugin.skill-kind-choices-exist — every --kind <name> referenced in plugin/*.md is in cmd_reports argparse choices.
The existing plugin.skill-cli-flags-exist now also scans docs/**/*.md and README.md.

Other fixes

reference_intervals_clinical_lab: linear N-search → binary search (5.5s → 0.004s worst case).
audit.py: microsecond suffix + method_id filename sanitization.
gen_method_coverage.py --check: substring → word-boundary regex.
grant_aims now honors --lang.
scipy>=1.17 floor (Anderson-Darling needs method="interpolate").
randomized_block_anova / reference_intervals_clinical_lab inputs_echo adds solve_for.
.github/workflows/release.yml: new verify job (doctor + coverage --check + pytest + ruff) gates build/publish so a v* tag on an unchecked commit cannot ship.

Verification

Full pytest: 1303 passed, 232 skipped, 0 failed.
samplesize doctor: 11/11.
gen_method_coverage.py --check: up to date (234 methods, 819 fixtures).
ruff check samplesize/ tests/ scripts/: clean.
Independent code-review pass (oh-my-claudecode:code-reviewer, opus): all HIGH/MED findings from the first pass either addressed or surfaced as trade-offs.
End-to-end repros confirmed: CI-method protocol text now reads "actual power of n/a" (not crash, not "0.0000", not literal "None"); two-sample power-curve now produces a real curve.

Process notes

4 parallel executor agents (opus) for the initial fix wave with non-overlapping file scopes.
1 follow-up executor (opus) for the reviewer-surfaced issues (the rec["result"] audit shape + None handling for protocol templates + unimplemented markers + audit docstring + doctor regex comment).
Separate code-reviewer (opus) pass between the two waves (writer/reviewer separation per OMC).

🤖 Generated with Claude Code

…e-in-depth) ## Critical bugs fixed - reporting/plots.py: power_curve no longer produces silent NaN curves for two-sample methods (sweep_key now derived from registry signature, not a non-existent `params` field). - reporting/code_export.py: R/SAS export no longer TypeError-crashes on `achieved_power=None` for CI methods; emits "N/A" cleanly. - reporting/protocol.py + templates: protocol/grant text for CI methods no longer crashes and no longer emits literal "None"/"α" placeholders; localized "n/a" / "해당 없음" fallback for power/alpha/target_power. - cli.py cmd_report sensitivity: drops n/n1/n2/power from base before sweeping (was conflicting with solve_for=power audits). ## High (per CLAUDE.md "no fallback heuristics") - registry/decision_tree.yaml: introduces `unimplemented:` markers; k-group survival, k-group Poisson, one-sample ordinal now terminate explicitly instead of silently downgrading. Added missing group-count question to count and ordinal branches. - plugin/skills/samplesize-{calculate,validate}/SKILL.md: added shell-injection defense-in-depth preambles (registry-validated method_id, prefer --json-args-file). - plugin/skills/samplesize-report/SKILL.md: now enumerates all 6 --kind choices (was missing sensitivity, r-code, sas-code). ## Medium - tests/reference_intervals.py: linear N-search -> binary search (5.5s -> 0.004s on worst case). - reporting/audit.py: microsecond suffix + sanitized method_id in filename (no same-second collisions, hardened against path traversal). - cli.py cmd_calc: --json-args-file flag, kwarg whitelist against signature, JSONDecodeError handling. - registry/__init__.py: _resolve_callable module-prefix allowlist (samplesize.tests.). - scripts/gen_method_coverage.py: substring match -> word-boundary regex. ## Low - cli.py: grant_aims now honors --lang. - pyproject.toml: scipy>=1.17 (was 1.12) — Anderson-Darling needs method="interpolate". - tests/randomized_block.py + reference_intervals.py: inputs_echo includes solve_for. ## New - 2 doctor checks: registry.decision-tree-leaves-exist, plugin.skill-kind-choices-exist (doctor now 11/11; was 9/9). - doctor scope: plugin.skill-cli-flags-exist now also scans docs/ and README.md. - .github/workflows/release.yml: pre-build verify job (doctor + coverage --check + pytest + ruff) so a v* tag on an unchecked commit cannot ship. ## Verification Full pytest: 1303 passed, 232 skipped, 0 failed. samplesize doctor: 11/11. gen_method_coverage --check: passes (234 methods, 819 fixtures). ruff check samplesize/ tests/ scripts/: clean. Reproduced end-to-end: ci_one_mean -> protocol now emits "actual power of n/a" (was KeyError or literal "0.0000"); two_sample_t_equal_var -> power-curve PNG now contains a real curve (was 100% NaN). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix all findings from cross-model review (bugs + consistency + defense-in-depth)#7

Fix all findings from cross-model review (bugs + consistency + defense-in-depth)#7
kimmingul wants to merge 1 commit into
mainfrom
fix-review-findings

kimmingul commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kimmingul commented May 28, 2026

Summary

Critical bugs

High — decision_tree.yaml no-fallback rule (per CLAUDE.md)

High — Plugin defense-in-depth

New samplesize doctor checks (11/11; was 9/9)

Other fixes

Verification

Process notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

High — `decision_tree.yaml` no-fallback rule (per CLAUDE.md)

New `samplesize doctor` checks (11/11; was 9/9)