riscv_fpu: correct RMM rounding, honor static rounding modes, optimize FP dispatch#233
Open
SolAstrius wants to merge 3 commits into
Open
riscv_fpu: correct RMM rounding, honor static rounding modes, optimize FP dispatch#233SolAstrius wants to merge 3 commits into
SolAstrius wants to merge 3 commits into
Conversation
39e4a8b to
7c6e242
Compare
Signed-off-by: Sol Astrius Phoenix <sol@astrius.ink>
Signed-off-by: Sol Astrius Phoenix <sol@astrius.ink>
Signed-off-by: Sol Astrius Phoenix <sol@astrius.ink>
979050d to
efbdf51
Compare
This was referenced Jun 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #204, and supersedes #234 and #235 (folded in here as separate commits — see below). Three related changes to the scalar FP interpreter (
riscv_emulate_f_opc_op), each its own commit:correct RMM rounding to roundTiesToAway— the actual RMM rounding mode doesn't work correctly #204 fix.honor the static rounding-mode field— rounding modes encoded in the instruction'srmfield, not just the dynamicfrmCSR (was riscv_fpu: honor static RMM rounding-mode field #235).don't size-optimize the FP op dispatch— dropfunc_opt_sizefrom the dispatch (was riscv_fpu: don't size-optimize the FP op dispatch #234). This is what makes the whole thing a net ~2× FP speedup overstagingdespite the extra rounding-mode work.The three are coupled by perf (see Benchmarks), which is why they ship together.
1. Correct RMM rounding
RMM (IEEE 754
roundTiesToAway) was emulated as "always round away from zero" (riscv_prepare_rmm), which is correct only on an exact halfway tie and wrong for every other inexact result — e.g.1.0f + 2⁻²⁵returned1.0000001finstead of1.0f.roundTiesToAwaydiffers from the host's round-to-nearest-even only on an exact tie. So we compute the op in RNE, recover the exact rounding error via the library's existing error-free transforms (fpu_add_error*/fpu_mul_error*— TwoSum / TwoProduct), and step one ULP outward only when that error is exactly half a ULP away from zero.Two corrections over the original version of this patch, both surfaced by the conformance work below:
inf − inf→NV, a term nearFLT_MAX→OF) even when the final result is finite. Those must not leak intofflags— the genuine flags are already set by the base op. The per-op fixups (riscv_rmm_add/mul/div) now snapshot and restore the exception state around the transform, and only run it for a finite result.−2⁻¹⁵⁰under RMM must give−2⁻¹⁴⁹, not−0).fdivnow gets a dedicated subnormal fixup using the exact residualρ = fma(−n, b, a): it is a tie iff2·ρ == gap·b, evaluated exactly (in fp64 for f32; via two exact power-of-two scalings for f64, since the residual is bounded by|b|·2⁻¹⁰⁷⁵).fsqrtgenuinely never needs it —sqrtof even the smallest subnormal is ~2⁻⁷⁵, always normal.2. Honor the static rounding-mode field
RVVM only ever drove the host rounding mode from the dynamic
frmCSR; a staticrmfield on an arithmetic op (fadd.s …,rtz/…,rmm/ etc.) was silently computed in whatever modefrmleft set. The dispatch is split into a thin wrapper +_impl(…, bool rmm): the wrapper computes the effective mode (rm == DYN ? frm : rm), and overrides the host mode around the op only when needed — synthesizing RMM in RNE, or applying a static host-native mode (rne/rtz/rdn/rup) that differs fromfrm. The common dynamic path is untouched.funct3 == rmonly carries a rounding mode on rounding-capable ops, so this never misfires onfsgnj/fcmp/fclass/fmv.3. Don't size-optimize the FP op dispatch
riscv_emulate_f_opc_opwas taggedfunc_opt_size(-Oz), but rvjit does not emit FP, so every guest FP instruction is interpreted through this "slow path" — it is hot for any FP workload. Dropping the size attribute lets it optimize normally;slow_path(cold) is kept.Validation
MPFR vector oracle — harness
rvvm-hal/examples/rmm-test(bare-metal RVVM firmware; main.c, MPFR generator gen_mpfr.c, baked vectors.inc). roundTiesToAway is synthesized from MPFR's directed rounding + an exact-midpoint compare (MPFR has no native ties-away mode). The generator computesdouble-exact ground truth for f32+ − ×(no circular reference). 1503 vectors across all five ops in f32 and f64 — subnormals, the normal/subnormal boundary, infinities, NaNs — each run twice: once withfrm = RMM(dynamic) and once with a static,rmmsuffix, plus a check that a directedfrmsurvives a static-,rmmop (host mode restored). 3006/3006 + restore pass. 40M random f32 division pairs confirmed zero normal-range ties.SerenityOS
Tests/LibC/TestFenvnow passesfloat_round_to_max_magnitudeandsave_restore_roundon a riscv64 guest — the cases originally reported in #204 — booting the nightly image end-to-end.RISC-V architectural conformance (riscv-arch-test / ACT4, Spike reference). The full harness — DUT config, a signature-region UART dump, the per-failure diagnostic, and the categorized analysis — is in
SolAstrius/rvvm-conformance, reproducible via a pinned Nix dev shell on NixOS / nix-darwin / any Linux with Nix:nix develop github:SolAstrius/rvvm-conformance ./setup.sh # clone+pin suite, gem home, z3 wired by the shell ./run.sh F,D,I,M /path/to/rvvm_arm64Pass counts on this build vs
staging:+17 from the rounding-mode fixes (the directed-rounding and subnormal-tie sub-cases, and the spurious-flag regression that this PR's flag isolation removes). The remaining F/D failures are pre-existing, independent gaps confirmed not caused by these changes — NaN-result canonicalization on
fadd/fmul/fdiv(host payload vs canonical NaN),fcvt-to-int inexact-flag handling, and the FMA family (NaN canonicalization + rounding-mode + invalid-flag). Each is a separate follow-up; the ACT4 self-check pinpoints them per-instruction.Benchmarks
Both benchmarks are bare-metal RVVM firmwares:
rvvm-hal/examples/fp-bench(FP-op throughput) andrvvm-hal/examples/linpack(LINPACK LU, MFLOPS). aarch64 host, fp-bench tight loops, ticks — lower is faster:Commit 1 alone is perf-neutral-to-better than
staging(it removes the per-opriscv_prepare_rmmhost-mode churn). Commit 2 (static rounding) adds a small per-op cost from the wrapper. Commit 3 (func_opt_sizedrop) more than pays for everything — net ~2× overstaging. On a realistic mixed workload (LINPACK LU, double) the gain is ~3%, since the integer index/loop/memory ops are JIT-compiled and dominate; the 2× shows only where the FP handler dominates. Code size: the dispatch grows ~1.6 KB__text(+0.67%) from dropping-Oz.What was explored / left out
_implso only the rare static-override path pays.fdivcase is the one place a residual+rescale is needed.rmis now honored for all modes (rtz/rdn/rup/rmm), not only RMM as in the original riscv_fpu: honor static RMM rounding-mode field #235.fcvt-to-int flags, and the FMA family.