Skip to content

Fix all findings from cross-model review (bugs + consistency + defense-in-depth)#7

Open
kimmingul wants to merge 1 commit into
mainfrom
fix-review-findings
Open

Fix all findings from cross-model review (bugs + consistency + defense-in-depth)#7
kimmingul wants to merge 1 commit into
mainfrom
fix-review-findings

Conversation

@kimmingul

Copy link
Copy Markdown
Owner

Summary

Resolves every finding from the three-reviewer audit (Claude code-reviewer + Claude security-reviewer + Codex 2nd-opinion) plus a follow-up independent review pass that caught a HIGH that the first wave missed. 18 source files modified, +341/-52 lines.

Critical bugs

Area Was Now
report --kind power-curve silent NaN curve for any two-sample method derives sweep key from registry signature; real curve
report --kind r-code / sas-code TypeError on CI methods (achieved_power=None) emits N/A cleanly
report --kind protocol / grant KeyError/crash; literal "None"/"α" placeholders for CI methods localized "n/a" / "해당 없음" fallback for power/alpha/target_power
report --kind sensitivity wrong solve mode when source audit was solve-for-power drops n/n1/n2/power from base before sweeping

High — decision_tree.yaml no-fallback rule (per CLAUDE.md)

The prior fix that replaced logrank_kgroup with logrank_freedman violated the explicit "No fallback heuristics" rule in CLAUDE.md by silently downgrading k-group survival to a 2-group calculator. The same class of issue affected the count and ordinal branches (no group-count question; any group count silently routed to a single specific method).

Introduces an unimplemented: terminal marker that samplesize doctor accepts as valid. K-group survival, k-group Poisson, and one-sample ordinal now terminate with an explicit reason instead of routing to an approximate method. count and ordinal gained the missing group-count question; count → two routes to the (already-validated) tests_two_poisson_means; ordinal → three_or_more routes to kruskal_wallis_simulation.

High — Plugin defense-in-depth

samplesize-calculate and samplesize-validate SKILLs now ship explicit "shell-injection defense" preambles instructing the LLM to:

  • registry-validate method_id against ^[a-z][a-z0-9_]*$
  • prefer --json-args-file <tmpfile> (new CLI flag) over inline --json-args <json>
  • never pass raw user text as a shell argument

Backed at the CLI by: --json-args-file, kwarg whitelist against the resolved methods signature, and _resolve_callable module-prefix allowlist (samplesize.tests.).

New samplesize doctor checks (11/11; was 9/9)

  • registry.decision-tree-leaves-exist — every leaf: in decision_tree.yaml is a registered method id (or a deliberate unimplemented: marker).
  • plugin.skill-kind-choices-exist — every --kind <name> referenced in plugin/*.md is in cmd_reports argparse choices.
  • The existing plugin.skill-cli-flags-exist now also scans docs/**/*.md and README.md.

Other fixes

  • reference_intervals_clinical_lab: linear N-search → binary search (5.5s → 0.004s worst case).
  • audit.py: microsecond suffix + method_id filename sanitization.
  • gen_method_coverage.py --check: substring → word-boundary regex.
  • grant_aims now honors --lang.
  • scipy>=1.17 floor (Anderson-Darling needs method="interpolate").
  • randomized_block_anova / reference_intervals_clinical_lab inputs_echo adds solve_for.
  • .github/workflows/release.yml: new verify job (doctor + coverage --check + pytest + ruff) gates build/publish so a v* tag on an unchecked commit cannot ship.

Verification

  • Full pytest: 1303 passed, 232 skipped, 0 failed.
  • samplesize doctor: 11/11.
  • gen_method_coverage.py --check: up to date (234 methods, 819 fixtures).
  • ruff check samplesize/ tests/ scripts/: clean.
  • Independent code-review pass (oh-my-claudecode:code-reviewer, opus): all HIGH/MED findings from the first pass either addressed or surfaced as trade-offs.
  • End-to-end repros confirmed: CI-method protocol text now reads "actual power of n/a" (not crash, not "0.0000", not literal "None"); two-sample power-curve now produces a real curve.

Process notes

  • 4 parallel executor agents (opus) for the initial fix wave with non-overlapping file scopes.
  • 1 follow-up executor (opus) for the reviewer-surfaced issues (the rec["result"] audit shape + None handling for protocol templates + unimplemented markers + audit docstring + doctor regex comment).
  • Separate code-reviewer (opus) pass between the two waves (writer/reviewer separation per OMC).

🤖 Generated with Claude Code

…e-in-depth)

## Critical bugs fixed
- reporting/plots.py: power_curve no longer produces silent NaN curves for two-sample methods (sweep_key now derived from registry signature, not a non-existent `params` field).
- reporting/code_export.py: R/SAS export no longer TypeError-crashes on `achieved_power=None` for CI methods; emits "N/A" cleanly.
- reporting/protocol.py + templates: protocol/grant text for CI methods no longer crashes and no longer emits literal "None"/"α" placeholders; localized "n/a" / "해당 없음" fallback for power/alpha/target_power.
- cli.py cmd_report sensitivity: drops n/n1/n2/power from base before sweeping (was conflicting with solve_for=power audits).

## High (per CLAUDE.md "no fallback heuristics")
- registry/decision_tree.yaml: introduces `unimplemented:` markers; k-group survival, k-group Poisson, one-sample ordinal now terminate explicitly instead of silently downgrading. Added missing group-count question to count and ordinal branches.
- plugin/skills/samplesize-{calculate,validate}/SKILL.md: added shell-injection defense-in-depth preambles (registry-validated method_id, prefer --json-args-file).
- plugin/skills/samplesize-report/SKILL.md: now enumerates all 6 --kind choices (was missing sensitivity, r-code, sas-code).

## Medium
- tests/reference_intervals.py: linear N-search -> binary search (5.5s -> 0.004s on worst case).
- reporting/audit.py: microsecond suffix + sanitized method_id in filename (no same-second collisions, hardened against path traversal).
- cli.py cmd_calc: --json-args-file flag, kwarg whitelist against signature, JSONDecodeError handling.
- registry/__init__.py: _resolve_callable module-prefix allowlist (samplesize.tests.).
- scripts/gen_method_coverage.py: substring match -> word-boundary regex.

## Low
- cli.py: grant_aims now honors --lang.
- pyproject.toml: scipy>=1.17 (was 1.12) — Anderson-Darling needs method="interpolate".
- tests/randomized_block.py + reference_intervals.py: inputs_echo includes solve_for.

## New
- 2 doctor checks: registry.decision-tree-leaves-exist, plugin.skill-kind-choices-exist (doctor now 11/11; was 9/9).
- doctor scope: plugin.skill-cli-flags-exist now also scans docs/ and README.md.
- .github/workflows/release.yml: pre-build verify job (doctor + coverage --check + pytest + ruff) so a v* tag on an unchecked commit cannot ship.

## Verification
Full pytest: 1303 passed, 232 skipped, 0 failed.
samplesize doctor: 11/11.
gen_method_coverage --check: passes (234 methods, 819 fixtures).
ruff check samplesize/ tests/ scripts/: clean.
Reproduced end-to-end: ci_one_mean -> protocol now emits "actual power of n/a" (was KeyError or literal "0.0000"); two_sample_t_equal_var -> power-curve PNG now contains a real curve (was 100% NaN).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants