riscv_fpu: correct RMM rounding, honor static rounding modes, optimize FP dispatch by SolAstrius · Pull Request #233 · LekKit/RVVM

SolAstrius · 2026-06-18T20:40:53Z

Summary

Fixes #204, and supersedes #234 and #235 (folded in here as separate commits — see below). Three related changes to the scalar FP interpreter (riscv_emulate_f_opc_op), each its own commit:

correct RMM rounding to roundTiesToAway — the actual RMM rounding mode doesn't work correctly #204 fix.
honor the static rounding-mode field — rounding modes encoded in the instruction's rm field, not just the dynamic frm CSR (was riscv_fpu: honor static RMM rounding-mode field #235).
don't size-optimize the FP op dispatch — drop func_opt_size from the dispatch (was riscv_fpu: don't size-optimize the FP op dispatch #234). This is what makes the whole thing a net ~2× FP speedup over staging despite the extra rounding-mode work.

The three are coupled by perf (see Benchmarks), which is why they ship together.

1. Correct RMM rounding

RMM (IEEE 754 roundTiesToAway) was emulated as "always round away from zero" (riscv_prepare_rmm), which is correct only on an exact halfway tie and wrong for every other inexact result — e.g. 1.0f + 2⁻²⁵ returned 1.0000001f instead of 1.0f.

roundTiesToAway differs from the host's round-to-nearest-even only on an exact tie. So we compute the op in RNE, recover the exact rounding error via the library's existing error-free transforms (fpu_add_error* / fpu_mul_error* — TwoSum / TwoProduct), and step one ULP outward only when that error is exactly half a ULP away from zero.

Two corrections over the original version of this patch, both surfaced by the conformance work below:

Flag isolation. The error-free transforms do raw host arithmetic whose intermediate steps can raise spurious exceptions (inf − inf → NV, a term near FLT_MAX → OF) even when the final result is finite. Those must not leak into fflags — the genuine flags are already set by the base op. The per-op fixups (riscv_rmm_add/mul/div) now snapshot and restore the exception state around the transform, and only run it for a finite result.
Subnormal-quotient ties. The original claim "division never produces an exact tie" holds only for normal quotients. A subnormal quotient has reduced precision and can land exactly on a tie (e.g. −2⁻¹⁵⁰ under RMM must give −2⁻¹⁴⁹, not −0). fdiv now gets a dedicated subnormal fixup using the exact residual ρ = fma(−n, b, a): it is a tie iff 2·ρ == gap·b, evaluated exactly (in fp64 for f32; via two exact power-of-two scalings for f64, since the residual is bounded by |b|·2⁻¹⁰⁷⁵). fsqrt genuinely never needs it — sqrt of even the smallest subnormal is ~2⁻⁷⁵, always normal.

2. Honor the static rounding-mode field

RVVM only ever drove the host rounding mode from the dynamic frm CSR; a static rm field on an arithmetic op (fadd.s …,rtz / …,rmm / etc.) was silently computed in whatever mode frm left set. The dispatch is split into a thin wrapper + _impl(…, bool rmm): the wrapper computes the effective mode (rm == DYN ? frm : rm), and overrides the host mode around the op only when needed — synthesizing RMM in RNE, or applying a static host-native mode (rne/rtz/rdn/rup) that differs from frm. The common dynamic path is untouched. funct3 == rm only carries a rounding mode on rounding-capable ops, so this never misfires on fsgnj/fcmp/fclass/fmv.

3. Don't size-optimize the FP op dispatch

riscv_emulate_f_opc_op was tagged func_opt_size (-Oz), but rvjit does not emit FP, so every guest FP instruction is interpreted through this "slow path" — it is hot for any FP workload. Dropping the size attribute lets it optimize normally; slow_path (cold) is kept.

Validation

MPFR vector oracle — harness rvvm-hal/examples/rmm-test (bare-metal RVVM firmware; main.c, MPFR generator gen_mpfr.c, baked vectors.inc). roundTiesToAway is synthesized from MPFR's directed rounding + an exact-midpoint compare (MPFR has no native ties-away mode). The generator computes double-exact ground truth for f32 + − × (no circular reference). 1503 vectors across all five ops in f32 and f64 — subnormals, the normal/subnormal boundary, infinities, NaNs — each run twice: once with frm = RMM (dynamic) and once with a static ,rmm suffix, plus a check that a directed frm survives a static-,rmm op (host mode restored). 3006/3006 + restore pass. 40M random f32 division pairs confirmed zero normal-range ties.

SerenityOS Tests/LibC/TestFenv now passes float_round_to_max_magnitude and save_restore_round on a riscv64 guest — the cases originally reported in #204 — booting the nightly image end-to-end.

RISC-V architectural conformance (riscv-arch-test / ACT4, Spike reference). The full harness — DUT config, a signature-region UART dump, the per-failure diagnostic, and the categorized analysis — is in SolAstrius/rvvm-conformance, reproducible via a pinned Nix dev shell on NixOS / nix-darwin / any Linux with Nix:

nix develop github:SolAstrius/rvvm-conformance
./setup.sh                                  # clone+pin suite, gem home, z3 wired by the shell
./run.sh F,D,I,M  /path/to/rvvm_arm64

Pass counts on this build vs staging:

ext	staging	this PR
I	51/51	51/51
M	13/13	13/13
F	13/82	22/82
D	24/114	32/114

+17 from the rounding-mode fixes (the directed-rounding and subnormal-tie sub-cases, and the spurious-flag regression that this PR's flag isolation removes). The remaining F/D failures are pre-existing, independent gaps confirmed not caused by these changes — NaN-result canonicalization on fadd/fmul/fdiv (host payload vs canonical NaN), fcvt-to-int inexact-flag handling, and the FMA family (NaN canonicalization + rounding-mode + invalid-flag). Each is a separate follow-up; the ACT4 self-check pinpoints them per-instruction.

Benchmarks

Both benchmarks are bare-metal RVVM firmwares: rvvm-hal/examples/fp-bench (FP-op throughput) and rvvm-hal/examples/linpack (LINPACK LU, MFLOPS). aarch64 host, fp-bench tight loops, ticks — lower is faster:

op	staging	this PR	speedup
addmul	20.7M	11.8M	~1.75×
div	13.5M	6.5M	~2.07×
sqrt	31.1M	13.5M	~2.3×

Commit 1 alone is perf-neutral-to-better than staging (it removes the per-op riscv_prepare_rmm host-mode churn). Commit 2 (static rounding) adds a small per-op cost from the wrapper. Commit 3 (func_opt_size drop) more than pays for everything — net ~2× over staging. On a realistic mixed workload (LINPACK LU, double) the gain is ~3%, since the integer index/loop/memory ops are JIT-compiled and dominate; the 2× shows only where the FP handler dominates. Code size: the dispatch grows ~1.6 KB __text (+0.67%) from dropping -Oz.

What was explored / left out

An earlier wrapper-based static-rounding rework that added a real per-op call regressed FP throughput; the shipped form keeps the common dynamic path on a tail-called _impl so only the rare static-override path pays.
The error-free-transform approach was chosen over re-rounding in higher precision because the library already ships TwoSum/TwoProduct and they give the exact tie decision directly; the subnormal fdiv case is the one place a residual+rescale is needed.
Static rm is now honored for all modes (rtz/rdn/rup/rmm), not only RMM as in the original riscv_fpu: honor static RMM rounding-mode field #235.
Out of scope (separate PRs, all pre-existing): NaN-result canonicalization, fcvt-to-int flags, and the FMA family.

Signed-off-by: Sol Astrius Phoenix <sol@astrius.ink>

SolAstrius force-pushed the fix/rmm-rounding branch from 39e4a8b to 7c6e242 Compare June 18, 2026 20:49

SolAstrius mentioned this pull request Jun 18, 2026

riscv_fpu: honor static RMM rounding-mode field #235

Closed

SolAstrius added 3 commits June 19, 2026 02:33

riscv_fpu: correct RMM rounding to roundTiesToAway

0d65212

Signed-off-by: Sol Astrius Phoenix <sol@astrius.ink>

riscv_fpu: honor the static rounding-mode field

638b7db

Signed-off-by: Sol Astrius Phoenix <sol@astrius.ink>

riscv_fpu: don't size-optimize the FP op dispatch

efbdf51

Signed-off-by: Sol Astrius Phoenix <sol@astrius.ink>

SolAstrius force-pushed the fix/rmm-rounding branch 2 times, most recently from 979050d to efbdf51 Compare June 19, 2026 00:37

SolAstrius changed the title ~~riscv_fpu: correct RMM rounding to roundTiesToAway~~ riscv_fpu: correct RMM rounding, honor static rounding modes, optimize FP dispatch Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

riscv_fpu: correct RMM rounding, honor static rounding modes, optimize FP dispatch#233

riscv_fpu: correct RMM rounding, honor static rounding modes, optimize FP dispatch#233
SolAstrius wants to merge 3 commits into
LekKit:stagingfrom
pufit:fix/rmm-rounding

SolAstrius commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SolAstrius commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Correct RMM rounding

2. Honor the static rounding-mode field

3. Don't size-optimize the FP op dispatch

Validation

Benchmarks

What was explored / left out

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SolAstrius commented Jun 18, 2026 •

edited

Loading