[CH] Implement map_from_entries function by exmy · Pull Request #12349 · apache/gluten

exmy · 2026-06-24T04:07:48Z

What changes are proposed in this pull request?

Add ClickHouse backend support for map_from_entries and its LAST_WIN variant, including Substrait function mapping and tests for null and duplicate-key behavior.

close #12348

How was this patch tested?

Add UT.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex gpt 5.5

github-actions · 2026-06-24T04:08:18Z

Run Gluten Clickhouse CI on x86

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

github-actions · 2026-06-24T04:18:21Z

Run Gluten Clickhouse CI on x86

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

github-actions · 2026-06-24T07:08:25Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-24T10:07:06Z

Run Gluten Clickhouse CI on x86

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Add ClickHouse backend support for map_from_entries and its LAST_WIN variant, including Substrait function mapping and tests for null and duplicate-key behavior.

github-actions · 2026-06-25T01:52:52Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-25T06:35:03Z

Run Gluten Clickhouse CI on x86

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

lgbo-ustc · 2026-06-26T06:31:43Z

+                    for (size_t entry = previous_entry_offset; entry < current_entry_offset; ++entry)
+                    {
+                        if ((*entry_null_map)[entry])
+                        {
+                            if (result_null_map_data)
+                                (*result_null_map_data)[row] = 1;
+                            break;
+                        }
+                    }


bool entry_map_null = false if (entry_null_map) { for (size_t entry = previous_entry_offset; entry < current_entry_offset; ++entry) { if ((*entry_null_map)[entry]) { entry_map_null = true; break; } } } if (result_null_map_data) (*result_null_map_data)[row] = entry_map_null || (input_null_map && (*input_null_map)[row]);

github-actions · 2026-06-26T06:54:01Z

Run Gluten Clickhouse CI on x86

lgbo-ustc · 2026-06-26T06:54:06Z

+                bool has_duplicate_key = false;
+                for (auto & selected_entry : selected_entries)
+                {
+                    if (key_column.compareAt(entry, selected_entry.first, key_column, 1) == 0)


This duplicate-key lookup scans all previously selected entries for every input entry, making the per-row complexity O(n^2). For large input arrays this can become expensive. Could we use a per-row hash map/set, similar to SparkFunctionArrayDistinct, to track seen keys and only fall back to compareAt on hash matches to preserve exact equality semantics?

For example, the duplicate lookup could be structured like this:

std::unordered_map<UInt128, size_t> seen; for (size_t entry = previous_entry_offset; entry < current_entry_offset; ++entry) { if (key_column.isNullAt(entry)) throw Exception(...); SipHash hash_function; key_column.updateHashWithValue(entry, hash_function); UInt128 hash = hash_function.get128(); auto it = seen.find(hash); if (it != seen.end() && key_column.compareAt(entry, it->second, key_column, 1) == 0) { if constexpr (last_win) { // update value index } else { throw Exception(...); } } else { seen.emplace(hash, entry); selected_entries.emplace_back(entry, entry); } }

For LAST_WIN, the stored value may need to point to the selected_entries index instead of the raw entry index, so the duplicate branch can update selected_entries[it->second].second = entry while preserving the first key position/output order.

lgbo-ustc · 2026-06-26T07:02:16Z

+        return map_type;
+    }
+
+    ColumnPtr executeImpl(


executeImpl is getting quite long and currently mixes several concerns: unwrapping nullable input, handling nullable array entries, the Array(Nothing) special case, duplicate-key selection, and final map-column construction. Could we split the row-level null handling / duplicate-key selection into small helpers? That would make the Spark semantics easier to audit and also make the duplicate-key lookup optimization more localized.

For example, the main loop could become closer to:

auto appendNullMap = [&]() { if (result_null_map_data) (*result_null_map_data)[row] = 1; result_offsets.push_back(result_offset); }; for (size_t row = 0; row < input_rows_count; ++row) { const auto current_entry_offset = entries_offsets[row]; if ((input_null_map && (*input_null_map)[row]) || hasNullEntry(entry_null_map, previous_entry_offset, current_entry_offset)) { appendNullMap(); previous_entry_offset = current_entry_offset; continue; } auto selected_entries = selectEntriesForRow( key_column, previous_entry_offset, current_entry_offset); appendSelectedEntries( selected_entries, *key_insert_column, value_column, *result_key_column, *result_value_column, result_offset); result_offsets.push_back(result_offset); previous_entry_offset = current_entry_offset; }

Then selectEntriesForRow could own the duplicate-key policy, including the possible hash-based implementation.

lgbo-ustc · 2026-06-26T07:07:00Z

+        val mapKeyDedupPolicy = SQLConf.get.getConf(SQLConf.MAP_KEY_DEDUP_POLICY)
        BackendsApiManager.getSparkPlanExecApiInstance.genMapFromEntriesTransformer(
-          substraitExprName,
+          if (mapKeyDedupPolicy.toString == SQLConf.MapKeyDedupPolicy.LAST_WIN.toString) {


I am a bit concerned about putting the LAST_WIN function-name rewrite in the common ExpressionConverter. This changes the Substrait function name for all backends when spark.sql.mapKeyDedupPolicy=LAST_WIN, but map_from_entries_last_win is currently a ClickHouse-specific function mapping. Other backends may receive this new function name without supporting it.

Could we keep the common converter backend-agnostic and move this policy-to-function-name decision into CHSparkPlanExecApi.genMapFromEntriesTransformer instead? For example, ExpressionConverter can continue passing substraitExprName:

case m: MapFromEntries => BackendsApiManager.getSparkPlanExecApiInstance.genMapFromEntriesTransformer( substraitExprName, replaceWithExpressionTransformer0(m.child, attributeSeq, expressionsMap), m)

Then ClickHouse can make the backend-specific function-name choice locally:

override def genMapFromEntriesTransformer( substraitExprName: String, child: ExpressionTransformer, expr: Expression): ExpressionTransformer = { val mapKeyDedupPolicy = SQLConf.get.getConf(SQLConf.MAP_KEY_DEDUP_POLICY) val functionName = if (mapKeyDedupPolicy == SQLConf.MapKeyDedupPolicy.LAST_WIN) { ExpressionNames.MAP_FROM_ENTRIES_LAST_WIN } else { substraitExprName } GenericExpressionTransformer(functionName, Seq(child), expr) }

This keeps the common converter backend-agnostic and limits the new map_from_entries_last_win function name to the backend that registers it.

github-actions · 2026-06-26T07:13:42Z

Run Gluten Clickhouse CI on x86

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

github-actions · 2026-06-26T07:25:18Z

Run Gluten Clickhouse CI on x86

Copilot AI review requested due to automatic review settings June 24, 2026 04:07

github-actions Bot added CORE works for Gluten Core CLICKHOUSE labels Jun 24, 2026

Copilot started reviewing on behalf of exmy June 24, 2026 04:08 View session

Copilot AI reviewed Jun 24, 2026

exmy force-pushed the gluten-12348 branch from a739a5b to 536534b Compare June 24, 2026 04:17

exmy requested a review from Copilot June 24, 2026 06:43

Copilot started reviewing on behalf of exmy June 24, 2026 06:44 View session

Copilot AI reviewed Jun 24, 2026

exmy force-pushed the gluten-12348 branch from 536534b to 2690fd7 Compare June 24, 2026 07:07

Copilot AI review requested due to automatic review settings June 24, 2026 10:01

Copilot started reviewing on behalf of exmy June 24, 2026 10:06 View session

Copilot AI reviewed Jun 24, 2026

exmy added 2 commits June 25, 2026 09:51

[CH] Implement map_from_entries function

0f6fb3f

Add ClickHouse backend support for map_from_entries and its LAST_WIN variant, including Substrait function mapping and tests for null and duplicate-key behavior.

[CH] Fix map_from_entries empty array handling

163cdf3

exmy force-pushed the gluten-12348 branch from 565bc73 to 163cdf3 Compare June 25, 2026 01:52

fix

5be7f8b

Copilot AI review requested due to automatic review settings June 25, 2026 06:34

Copilot started reviewing on behalf of exmy June 25, 2026 06:34 View session

Copilot AI reviewed Jun 25, 2026

lgbo-ustc reviewed Jun 26, 2026

View reviewed changes

fix

1a0e0d6

lgbo-ustc reviewed Jun 26, 2026

View reviewed changes

Optimize map_from_entries duplicate key lookup

d91bcb1

Copilot AI review requested due to automatic review settings June 26, 2026 07:13

Copilot started reviewing on behalf of exmy June 26, 2026 07:13 View session

Copilot AI reviewed Jun 26, 2026

Move map_from_entries LAST_WIN rewrite to CH backend

daef642

Uh oh!

Conversation

exmy commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

lgbo-ustc Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

lgbo-ustc Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lgbo-ustc Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

lgbo-ustc Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

exmy commented Jun 24, 2026 •

edited

Loading

lgbo-ustc Jun 26, 2026 •

edited

Loading

lgbo-ustc Jun 26, 2026 •

edited

Loading