Fix all findings from cross-model review (bugs + consistency + defense-in-depth)#7
Open
kimmingul wants to merge 1 commit into
Open
Fix all findings from cross-model review (bugs + consistency + defense-in-depth)#7kimmingul wants to merge 1 commit into
kimmingul wants to merge 1 commit into
Conversation
…e-in-depth)
## Critical bugs fixed
- reporting/plots.py: power_curve no longer produces silent NaN curves for two-sample methods (sweep_key now derived from registry signature, not a non-existent `params` field).
- reporting/code_export.py: R/SAS export no longer TypeError-crashes on `achieved_power=None` for CI methods; emits "N/A" cleanly.
- reporting/protocol.py + templates: protocol/grant text for CI methods no longer crashes and no longer emits literal "None"/"α" placeholders; localized "n/a" / "해당 없음" fallback for power/alpha/target_power.
- cli.py cmd_report sensitivity: drops n/n1/n2/power from base before sweeping (was conflicting with solve_for=power audits).
## High (per CLAUDE.md "no fallback heuristics")
- registry/decision_tree.yaml: introduces `unimplemented:` markers; k-group survival, k-group Poisson, one-sample ordinal now terminate explicitly instead of silently downgrading. Added missing group-count question to count and ordinal branches.
- plugin/skills/samplesize-{calculate,validate}/SKILL.md: added shell-injection defense-in-depth preambles (registry-validated method_id, prefer --json-args-file).
- plugin/skills/samplesize-report/SKILL.md: now enumerates all 6 --kind choices (was missing sensitivity, r-code, sas-code).
## Medium
- tests/reference_intervals.py: linear N-search -> binary search (5.5s -> 0.004s on worst case).
- reporting/audit.py: microsecond suffix + sanitized method_id in filename (no same-second collisions, hardened against path traversal).
- cli.py cmd_calc: --json-args-file flag, kwarg whitelist against signature, JSONDecodeError handling.
- registry/__init__.py: _resolve_callable module-prefix allowlist (samplesize.tests.).
- scripts/gen_method_coverage.py: substring match -> word-boundary regex.
## Low
- cli.py: grant_aims now honors --lang.
- pyproject.toml: scipy>=1.17 (was 1.12) — Anderson-Darling needs method="interpolate".
- tests/randomized_block.py + reference_intervals.py: inputs_echo includes solve_for.
## New
- 2 doctor checks: registry.decision-tree-leaves-exist, plugin.skill-kind-choices-exist (doctor now 11/11; was 9/9).
- doctor scope: plugin.skill-cli-flags-exist now also scans docs/ and README.md.
- .github/workflows/release.yml: pre-build verify job (doctor + coverage --check + pytest + ruff) so a v* tag on an unchecked commit cannot ship.
## Verification
Full pytest: 1303 passed, 232 skipped, 0 failed.
samplesize doctor: 11/11.
gen_method_coverage --check: passes (234 methods, 819 fixtures).
ruff check samplesize/ tests/ scripts/: clean.
Reproduced end-to-end: ci_one_mean -> protocol now emits "actual power of n/a" (was KeyError or literal "0.0000"); two_sample_t_equal_var -> power-curve PNG now contains a real curve (was 100% NaN).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves every finding from the three-reviewer audit (Claude code-reviewer + Claude security-reviewer + Codex 2nd-opinion) plus a follow-up independent review pass that caught a HIGH that the first wave missed. 18 source files modified, +341/-52 lines.
Critical bugs
report --kind power-curvereport --kind r-code/sas-codeTypeErroron CI methods (achieved_power=None)N/Acleanlyreport --kind protocol/grantKeyError/crash; literal"None"/"α"placeholders for CI methods"n/a"/"해당 없음"fallback for power/alpha/target_powerreport --kind sensitivityHigh —
decision_tree.yamlno-fallback rule (per CLAUDE.md)The prior fix that replaced
logrank_kgroupwithlogrank_freedmanviolated the explicit "No fallback heuristics" rule inCLAUDE.mdby silently downgrading k-group survival to a 2-group calculator. The same class of issue affected thecountandordinalbranches (no group-count question; any group count silently routed to a single specific method).Introduces an
unimplemented:terminal marker thatsamplesize doctoraccepts as valid. K-group survival, k-group Poisson, and one-sample ordinal now terminate with an explicit reason instead of routing to an approximate method.countandordinalgained the missing group-count question;count → tworoutes to the (already-validated)tests_two_poisson_means;ordinal → three_or_moreroutes tokruskal_wallis_simulation.High — Plugin defense-in-depth
samplesize-calculateandsamplesize-validateSKILLs now ship explicit "shell-injection defense" preambles instructing the LLM to:method_idagainst^[a-z][a-z0-9_]*$--json-args-file <tmpfile>(new CLI flag) over inline--json-args <json>Backed at the CLI by:
--json-args-file, kwarg whitelist against the resolved methods signature, and_resolve_callablemodule-prefix allowlist (samplesize.tests.).New
samplesize doctorchecks (11/11; was 9/9)registry.decision-tree-leaves-exist— everyleaf:indecision_tree.yamlis a registered method id (or a deliberateunimplemented:marker).plugin.skill-kind-choices-exist— every--kind <name>referenced inplugin/*.mdis incmd_reports argparsechoices.plugin.skill-cli-flags-existnow also scansdocs/**/*.mdandREADME.md.Other fixes
reference_intervals_clinical_lab: linear N-search → binary search (5.5s → 0.004s worst case).audit.py: microsecond suffix +method_idfilename sanitization.gen_method_coverage.py --check: substring → word-boundary regex.grant_aimsnow honors--lang.scipy>=1.17floor (Anderson-Darling needsmethod="interpolate").randomized_block_anova/reference_intervals_clinical_labinputs_echoaddssolve_for..github/workflows/release.yml: newverifyjob (doctor + coverage--check+ pytest + ruff) gatesbuild/publishso av*tag on an unchecked commit cannot ship.Verification
samplesize doctor: 11/11.gen_method_coverage.py --check: up to date (234 methods, 819 fixtures).ruff check samplesize/ tests/ scripts/: clean.protocoltext now reads"actual power of n/a"(not crash, not"0.0000", not literal"None"); two-samplepower-curvenow produces a real curve.Process notes
executoragents (opus) for the initial fix wave with non-overlapping file scopes.executor(opus) for the reviewer-surfaced issues (therec["result"]audit shape + None handling for protocol templates +unimplementedmarkers + audit docstring + doctor regex comment).code-reviewer(opus) pass between the two waves (writer/reviewer separation per OMC).🤖 Generated with Claude Code