[GLUTEN][CI] Add Delta Spark UT pipeline gated against a known-failur…#12371
[GLUTEN][CI] Add Delta Spark UT pipeline gated against a known-failur…#12371felipepessoto wants to merge 1 commit into
Conversation
…es baseline Run delta-io/delta's `spark` ScalaTest suite against a Gluten Velox bundle in CI and gate the results against a committed baseline so the many expected Delta-on- Gluten failures stay manageable and can be fixed incrementally without letting currently-passing tests silently regress. What it adds (.github/workflows/util/delta-spark-ut/): - delta_spark_ut.yml: builds the native lib + Gluten bundle, then runs the Delta spark suite sharded by suite into 4 shards x 4 forked test JVMs (~16-way), and gates each shard against the baseline. - compare-test-results.py: the gate. Per shard, regressions (failed not in the baseline) fail the build; newly-passing baselined tests are flagged so the baseline can be tightened. Also supports seed/aggregate modes. - known-failures.txt: the committed baseline of expected failures. - setup-delta.sh: clones Delta, injects the Gluten bundle, patches DeltaSQLCommandTest, and force-fails the two DeletionVectorsSuite 2B-row tests whose native row-index materialization OOM-kills the runner and hangs the shard. - README.md: how the pipeline, gating and baseline-refresh work. The workflow also carries a hang watchdog that thread-dumps and kills a wedged fork, and tunes the per-fork heap (2G) and off-heap (2G) to fit the ~16G runner. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@zhztheplayer @philo-he I re-created the PR, the previous one had so many Copilot comments because of all the updates I did that was impossible to follow. I'd like to answer the questions you posted:
A: I think it depends on the approach we agree to follow. If we are adding this as a manual Action or for every PR (the best, but expensive)
A: I think having a separate job makes easier to maintain. Everything in this PR is new and could be easily replaced if needed, also the Delta tests takes so long that the time to build native artifacts becomes relatively small. I'm concerned about the #12288, ideally, we should increase the number of shards for Delta UT, to 8 or 16. |
Fix #9296.
I wanted to create this PR to start discussing this, so we can have an idea of how it would work, if this is worth, etc.
What changes are proposed in this pull request?
Adds an CI pipeline that runs delta-io/delta's
sparkScalaTest suite against the Gluten Velox bundle, so we can validate Gluten against a real Delta release and catch regressions over time.Running the Delta UTs on Gluten produces many expected failures (Gluten does not yet offload every Delta code path, and falls back or behaves differently in places). A plain "red on any failure" gate would be useless. Instead, the pipeline keeps a committed baseline of known failures and gates each run against it:
fail_on_fixed=false.How it works
gluten-velox-bundlefat jar (Spark 4.1 + Scala 2.13 + JDK 17, Delta profile).v4.2.0), drops the bundle onto thesparkproject's test classpath, and patchesDeltaSQLCommandTestto registerGlutenPlugin.sbt spark/testsharded by suite across 16 shards, with ScalaTest's JUnit XML reporter enabled, then gates each shard withcompare-test-results.pyagainstknown-failures.txt. A final job aggregates all shards into a single ready-to-commit baseline and flags stale entries.Files
.github/workflows/delta_spark_ut.yml.github/workflows/util/delta-spark-ut/setup-delta.shDeltaSQLCommandTest..github/workflows/util/delta-spark-ut/compare-test-results.py.github/workflows/util/delta-spark-ut/known-failures.txt<suite>#<test>per line)..github/workflows/util/delta-spark-ut/README.mdOperational hardening
--add-opensset plus-Dio.netty.tryReflectionSetAccessible=true(otherwise Arrow's allocator fails to initialize).Scope / known limitations
v4.2.0/ Spark 4.1 / Scala 2.13 / JDK 17.workflow_dispatchrun withupdate_baseline=true.IdentityColumn.logTableWritefirst paramSnapshot->SnapshotDescriptor), whichNoSuchMethodErrors on every write. Supporting 4.3.0 needs the bundle built against 4.3.0; tracked as follow-up.How was this patch tested?
This change is CI. The workflow runs automatically on PRs that touch its files and via manual dispatch. In the latest runs all shards pass against the committed baseline (failures limited to known-failures entries; no regressions).
19,073 Delta tests run (18,110 passed / 963 failed)
Delta Spark UT (Gluten) — shard count vs test parallelism
Sharding is by suite (
MurmurHash3(suiteName) % NUM_SHARDS), so total test work is fixed (~1250 fork-minutes). The runners are 4-core / ~16 GB.Was this patch authored or co-authored using generative AI tooling?
Generated-by: GitHub Copilot CLI