Skip to content

[GLUTEN][CI] Add Delta Spark UT pipeline gated against a known-failur…#12371

Draft
felipepessoto wants to merge 1 commit into
apache:mainfrom
felipepessoto:delta_pipeline
Draft

[GLUTEN][CI] Add Delta Spark UT pipeline gated against a known-failur…#12371
felipepessoto wants to merge 1 commit into
apache:mainfrom
felipepessoto:delta_pipeline

Conversation

@felipepessoto

@felipepessoto felipepessoto commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Fix #9296.
I wanted to create this PR to start discussing this, so we can have an idea of how it would work, if this is worth, etc.

What changes are proposed in this pull request?

Adds an CI pipeline that runs delta-io/delta's spark ScalaTest suite against the Gluten Velox bundle, so we can validate Gluten against a real Delta release and catch regressions over time.

Running the Delta UTs on Gluten produces many expected failures (Gluten does not yet offload every Delta code path, and falls back or behaves differently in places). A plain "red on any failure" gate would be useless. Instead, the pipeline keeps a committed baseline of known failures and gates each run against it:

  • regression -- a test fails that is not in the baseline -> the shard fails.
  • expected -- a failing test that is in the baseline -> ignored.
  • now-passing -- a baseline test that starts passing -> fails the shard (keeps the baseline honest), unless fail_on_fixed=false.

How it works

  1. Builds the Velox/Gluten native libs and assembles the gluten-velox-bundle fat jar (Spark 4.1 + Scala 2.13 + JDK 17, Delta profile).
  2. Clones delta-io/delta at a release tag (currently v4.2.0), drops the bundle onto the spark project's test classpath, and patches DeltaSQLCommandTest to register GlutenPlugin.
  3. Runs sbt spark/test sharded by suite across 16 shards, with ScalaTest's JUnit XML reporter enabled, then gates each shard with compare-test-results.py against known-failures.txt. A final job aggregates all shards into a single ready-to-commit baseline and flags stale entries.

Files

File Purpose
.github/workflows/delta_spark_ut.yml The workflow (build bundle -> shard tests -> gate).
.github/workflows/util/delta-spark-ut/setup-delta.sh Clones Delta, injects the Gluten bundle, patches DeltaSQLCommandTest.
.github/workflows/util/delta-spark-ut/compare-test-results.py Parses JUnit XML and enforces / seeds / aggregates against the baseline (stdlib only).
.github/workflows/util/delta-spark-ut/known-failures.txt Committed baseline of currently-expected failures (<suite>#<test> per line).
.github/workflows/util/delta-spark-ut/README.md Documents the gate, bootstrapping, and baseline refresh.

Operational hardening

  • JDK 17 + Arrow/Netty: forked test JVMs get the --add-opens set plus -Dio.netty.tryReflectionSetAccessible=true (otherwise Arrow's allocator fails to initialize).
  • Heap tuning: forked-test heap and the sbt launcher's idle G1 behavior are tuned to keep the ~16 GB runner under the cgroup OOM threshold.
  • Hang watchdog: a per-shard watchdog dumps threads and kills a forked test JVM that has gone silent too long, so a wedged suite can't stall the whole job.
  • DeletionVectorsSuite 2B-row tests: two tests build/read/delete a 2-billion-row table and balloon the fork to ~13 GB of native memory (Velox row-index materialization), OOM-killing it and hanging the shard. They are force-failed (with a clear message) rather than silently ignored, so the gap stays visible until the native memory blow-up is fixed.

Scope / known limitations

  • Velox backend, x86 only; Delta v4.2.0 / Spark 4.1 / Scala 2.13 / JDK 17.
  • The baseline reflects the current set of known Delta-on-Gluten failures; refresh it via a workflow_dispatch run with update_baseline=true.
  • Future work -- Delta 4.3.0: attempted, but the bundle (compiled against Delta 4.1.0) hits a binary-incompatible Delta change (IdentityColumn.logTableWrite first param Snapshot -> SnapshotDescriptor), which NoSuchMethodErrors on every write. Supporting 4.3.0 needs the bundle built against 4.3.0; tracked as follow-up.

How was this patch tested?

This change is CI. The workflow runs automatically on PRs that touch its files and via manual dispatch. In the latest runs all shards pass against the committed baseline (failures limited to known-failures entries; no regressions).

19,073 Delta tests run (18,110 passed / 963 failed)

Delta Spark UT (Gluten) — shard count vs test parallelism

Sharding is by suite (MurmurHash3(suiteName) % NUM_SHARDS), so total test work is fixed (~1250 fork-minutes). The runners are 4-core / ~16 GB.

Config Runner jobs Forks/shard Max shard Wall-clock Billed job-hrs* Outcome
16 shards × 1 fork 16 1 ~110 min ~130 min ~29 ✅ green
4 shards × 4 forks 4 4 158 min 178 min ~10.5 ✅ green
4 shards × 1 fork 4 1 360 min (hit cap) ❌ cancelled

Was this patch authored or co-authored using generative AI tooling?

Generated-by: GitHub Copilot CLI

…es baseline

Run delta-io/delta's `spark` ScalaTest suite against a Gluten Velox bundle in CI
and gate the results against a committed baseline so the many expected Delta-on-
Gluten failures stay manageable and can be fixed incrementally without letting
currently-passing tests silently regress.

What it adds (.github/workflows/util/delta-spark-ut/):
- delta_spark_ut.yml: builds the native lib + Gluten bundle, then runs the Delta
  spark suite sharded by suite into 4 shards x 4 forked test JVMs (~16-way), and
  gates each shard against the baseline.
- compare-test-results.py: the gate. Per shard, regressions (failed not in the
  baseline) fail the build; newly-passing baselined tests are flagged so the
  baseline can be tightened. Also supports seed/aggregate modes.
- known-failures.txt: the committed baseline of expected failures.
- setup-delta.sh: clones Delta, injects the Gluten bundle, patches
  DeltaSQLCommandTest, and force-fails the two DeletionVectorsSuite 2B-row tests
  whose native row-index materialization OOM-kills the runner and hangs the shard.
- README.md: how the pipeline, gating and baseline-refresh work.

The workflow also carries a hang watchdog that thread-dumps and kills a wedged
fork, and tunes the per-fork heap (2G) and off-heap (2G) to fit the ~16G runner.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@felipepessoto

felipepessoto commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

@zhztheplayer @philo-he I re-created the PR, the previous one had so many Copilot comments because of all the updates I did that was impossible to follow.

I'd like to answer the questions you posted:

Can we remove the duplicated tests in Gluten's codebase, if they are covered by the new way?

A: I think it depends on the approach we agree to follow. If we are adding this as a manual Action or for every PR (the best, but expensive)

Is this job a duplicate of the one in velox_backend_x86.yml? If so, we might consider moving the Delta tests into that file to reuse the native artifact it builds, since that artifact can't be shared across different workflows.

This is also a consideration for reducing our GHA usage, see #12288

A: I think having a separate job makes easier to maintain. Everything in this PR is new and could be easily replaced if needed, also the Delta tests takes so long that the time to build native artifacts becomes relatively small.
But if you believe it is better, I don't mind moving to velox_backend_x86.yml.

I'm concerned about the #12288, ideally, we should increase the number of shards for Delta UT, to 8 or 16.
Is waiting 3 hours for CI completion ok?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] [Build] Run Delta unit tests during PR validation

1 participant