[GLUTEN][CI] Add Delta Spark UT pipeline gated against a known-failur… by felipepessoto · Pull Request #12371 · apache/gluten

felipepessoto · 2026-06-25T20:32:20Z

Fix #9296.
I wanted to create this PR to start discussing this, so we can have an idea of how it would work, if this is worth, etc.

What changes are proposed in this pull request?

Adds an CI pipeline that runs delta-io/delta's spark ScalaTest suite against the Gluten Velox bundle, so we can validate Gluten against a real Delta release and catch regressions over time.

Running the Delta UTs on Gluten produces many expected failures (Gluten does not yet offload every Delta code path, and falls back or behaves differently in places). A plain "red on any failure" gate would be useless. Instead, the pipeline keeps a committed baseline of known failures and gates each run against it:

regression -- a test fails that is not in the baseline -> the shard fails.
expected -- a failing test that is in the baseline -> ignored.
now-passing -- a baseline test that starts passing -> fails the shard (keeps the baseline honest), unless fail_on_fixed=false.

How it works

Builds the Velox/Gluten native libs and assembles the gluten-velox-bundle fat jar (Spark 4.1 + Scala 2.13 + JDK 17, Delta profile).
Clones delta-io/delta at a release tag (currently v4.2.0), drops the bundle onto the spark project's test classpath, and patches DeltaSQLCommandTest to register GlutenPlugin.
Runs sbt spark/test sharded by suite across 16 shards, with ScalaTest's JUnit XML reporter enabled, then gates each shard with compare-test-results.py against known-failures.txt. A final job aggregates all shards into a single ready-to-commit baseline and flags stale entries.

Files

File	Purpose
`.github/workflows/delta_spark_ut.yml`	The workflow (build bundle -> shard tests -> gate).
`.github/workflows/util/delta-spark-ut/setup-delta.sh`	Clones Delta, injects the Gluten bundle, patches `DeltaSQLCommandTest`.
`.github/workflows/util/delta-spark-ut/compare-test-results.py`	Parses JUnit XML and enforces / seeds / aggregates against the baseline (stdlib only).
`.github/workflows/util/delta-spark-ut/known-failures.txt`	Committed baseline of currently-expected failures (`<suite>#<test>` per line).
`.github/workflows/util/delta-spark-ut/README.md`	Documents the gate, bootstrapping, and baseline refresh.

Operational hardening

JDK 17 + Arrow/Netty: forked test JVMs get the --add-opens set plus -Dio.netty.tryReflectionSetAccessible=true (otherwise Arrow's allocator fails to initialize).
Heap tuning: forked-test heap and the sbt launcher's idle G1 behavior are tuned to keep the ~16 GB runner under the cgroup OOM threshold.
Hang watchdog: a per-shard watchdog dumps threads and kills a forked test JVM that has gone silent too long, so a wedged suite can't stall the whole job.
DeletionVectorsSuite 2B-row tests: two tests build/read/delete a 2-billion-row table and balloon the fork to ~13 GB of native memory (Velox row-index materialization), OOM-killing it and hanging the shard. They are force-failed (with a clear message) rather than silently ignored, so the gap stays visible until the native memory blow-up is fixed.

Scope / known limitations

Velox backend, x86 only; Delta v4.2.0 / Spark 4.1 / Scala 2.13 / JDK 17.
The baseline reflects the current set of known Delta-on-Gluten failures; refresh it via a workflow_dispatch run with update_baseline=true.
Future work -- Delta 4.3.0: attempted, but the bundle (compiled against Delta 4.1.0) hits a binary-incompatible Delta change (IdentityColumn.logTableWrite first param Snapshot -> SnapshotDescriptor), which NoSuchMethodErrors on every write. Supporting 4.3.0 needs the bundle built against 4.3.0; tracked as follow-up.

How was this patch tested?

This change is CI. The workflow runs automatically on PRs that touch its files and via manual dispatch. In the latest runs all shards pass against the committed baseline (failures limited to known-failures entries; no regressions).

19,073 Delta tests run (18,110 passed / 963 failed)

Delta Spark UT (Gluten) — shard count vs test parallelism

Sharding is by suite (MurmurHash3(suiteName) % NUM_SHARDS), so total test work is fixed (~1250 fork-minutes). The runners are 4-core / ~16 GB.

Config	Runner jobs	Forks/shard	Max shard	Wall-clock	Billed job-hrs*	Outcome
16 shards × 1 fork	16	1	~110 min	~130 min	~29	✅ green
4 shards × 4 forks	4	4	158 min	178 min	~10.5	✅ green
4 shards × 1 fork	4	1	360 min (hit cap)	—	—	❌ cancelled

Was this patch authored or co-authored using generative AI tooling?

Generated-by: GitHub Copilot CLI

…es baseline Run delta-io/delta's `spark` ScalaTest suite against a Gluten Velox bundle in CI and gate the results against a committed baseline so the many expected Delta-on- Gluten failures stay manageable and can be fixed incrementally without letting currently-passing tests silently regress. What it adds (.github/workflows/util/delta-spark-ut/): - delta_spark_ut.yml: builds the native lib + Gluten bundle, then runs the Delta spark suite sharded by suite into 4 shards x 4 forked test JVMs (~16-way), and gates each shard against the baseline. - compare-test-results.py: the gate. Per shard, regressions (failed not in the baseline) fail the build; newly-passing baselined tests are flagged so the baseline can be tightened. Also supports seed/aggregate modes. - known-failures.txt: the committed baseline of expected failures. - setup-delta.sh: clones Delta, injects the Gluten bundle, patches DeltaSQLCommandTest, and force-fails the two DeletionVectorsSuite 2B-row tests whose native row-index materialization OOM-kills the runner and hangs the shard. - README.md: how the pipeline, gating and baseline-refresh work. The workflow also carries a hang watchdog that thread-dumps and kills a wedged fork, and tunes the per-fork heap (2G) and off-heap (2G) to fit the ~16G runner. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

felipepessoto · 2026-06-25T20:42:54Z

@zhztheplayer @philo-he I re-created the PR, the previous one had so many Copilot comments because of all the updates I did that was impossible to follow.

I'd like to answer the questions you posted:

Can we remove the duplicated tests in Gluten's codebase, if they are covered by the new way?

A: I think it depends on the approach we agree to follow. If we are adding this as a manual Action or for every PR (the best, but expensive)

Is this job a duplicate of the one in velox_backend_x86.yml? If so, we might consider moving the Delta tests into that file to reuse the native artifact it builds, since that artifact can't be shared across different workflows.

This is also a consideration for reducing our GHA usage, see #12288

A: I think having a separate job makes easier to maintain. Everything in this PR is new and could be easily replaced if needed, also the Delta tests takes so long that the time to build native artifacts becomes relatively small.
But if you believe it is better, I don't mind moving to velox_backend_x86.yml.

I'm concerned about the #12288, ideally, we should increase the number of shards for Delta UT, to 8 or 16.
Is waiting 3 hours for CI completion ok?

github-actions Bot added INFRA DOCS labels Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GLUTEN][CI] Add Delta Spark UT pipeline gated against a known-failur…#12371

[GLUTEN][CI] Add Delta Spark UT pipeline gated against a known-failur…#12371
felipepessoto wants to merge 1 commit into
apache:mainfrom
felipepessoto:delta_pipeline

felipepessoto commented Jun 25, 2026 •

edited

Loading

Uh oh!

felipepessoto commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

felipepessoto commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How it works

Files

Operational hardening

Scope / known limitations

How was this patch tested?

Delta Spark UT (Gluten) — shard count vs test parallelism

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

felipepessoto commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

felipepessoto commented Jun 25, 2026 •

edited

Loading

felipepessoto commented Jun 25, 2026 •

edited

Loading