Skip to content

[CH] Implement map_from_entries function#12349

Open
exmy wants to merge 6 commits into
apache:mainfrom
exmy:gluten-12348
Open

[CH] Implement map_from_entries function#12349
exmy wants to merge 6 commits into
apache:mainfrom
exmy:gluten-12348

Conversation

@exmy

@exmy exmy commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

What changes are proposed in this pull request?

Add ClickHouse backend support for map_from_entries and its LAST_WIN variant, including Substrait function mapping and tests for null and duplicate-key behavior.

close #12348

How was this patch tested?

Add UT.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex gpt 5.5

Copilot AI review requested due to automatic review settings June 24, 2026 04:07
@github-actions github-actions Bot added CORE works for Gluten Core CLICKHOUSE labels Jun 24, 2026
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copilot AI review requested due to automatic review settings June 24, 2026 10:01
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

exmy added 2 commits June 25, 2026 09:51
Add ClickHouse backend support for map_from_entries and its LAST_WIN variant, including Substrait function mapping and tests for null and duplicate-key behavior.
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copilot AI review requested due to automatic review settings June 25, 2026 06:34
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Comment on lines +156 to +164
for (size_t entry = previous_entry_offset; entry < current_entry_offset; ++entry)
{
if ((*entry_null_map)[entry])
{
if (result_null_map_data)
(*result_null_map_data)[row] = 1;
break;
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool entry_map_null = false
if (entry_null_map)
{
    for (size_t entry = previous_entry_offset; entry < current_entry_offset; ++entry)
    {
        if ((*entry_null_map)[entry])
        {
             entry_map_null = true;
             break;
         }
     }
}
if (result_null_map_data)
    (*result_null_map_data)[row] = entry_map_null || (input_null_map && (*input_null_map)[row]);

@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

bool has_duplicate_key = false;
for (auto & selected_entry : selected_entries)
{
if (key_column.compareAt(entry, selected_entry.first, key_column, 1) == 0)

@lgbo-ustc lgbo-ustc Jun 26, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicate-key lookup scans all previously selected entries for every input entry, making the per-row complexity O(n^2). For large input arrays this can become expensive. Could we use a per-row hash map/set, similar to SparkFunctionArrayDistinct, to track seen keys and only fall back to compareAt on hash matches to preserve exact equality semantics?

For example, the duplicate lookup could be structured like this:

std::unordered_map<UInt128, size_t> seen;

for (size_t entry = previous_entry_offset; entry < current_entry_offset; ++entry)
{
    if (key_column.isNullAt(entry))
        throw Exception(...);

    SipHash hash_function;
    key_column.updateHashWithValue(entry, hash_function);
    UInt128 hash = hash_function.get128();

    auto it = seen.find(hash);
    if (it != seen.end() &&
        key_column.compareAt(entry, it->second, key_column, 1) == 0)
    {
        if constexpr (last_win)
        {
            // update value index
        }
        else
        {
            throw Exception(...);
        }
    }
    else
    {
        seen.emplace(hash, entry);
        selected_entries.emplace_back(entry, entry);
    }
}

For LAST_WIN, the stored value may need to point to the selected_entries index instead of the raw entry index, so the duplicate branch can update selected_entries[it->second].second = entry while preserving the first key position/output order.

return map_type;
}

ColumnPtr executeImpl(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

executeImpl is getting quite long and currently mixes several concerns: unwrapping nullable input, handling nullable array entries, the Array(Nothing) special case, duplicate-key selection, and final map-column construction. Could we split the row-level null handling / duplicate-key selection into small helpers? That would make the Spark semantics easier to audit and also make the duplicate-key lookup optimization more localized.

For example, the main loop could become closer to:

auto appendNullMap = [&]()
{
    if (result_null_map_data)
        (*result_null_map_data)[row] = 1;
    result_offsets.push_back(result_offset);
};

for (size_t row = 0; row < input_rows_count; ++row)
{
    const auto current_entry_offset = entries_offsets[row];

    if ((input_null_map && (*input_null_map)[row])
        || hasNullEntry(entry_null_map, previous_entry_offset, current_entry_offset))
    {
        appendNullMap();
        previous_entry_offset = current_entry_offset;
        continue;
    }

    auto selected_entries = selectEntriesForRow(
        key_column,
        previous_entry_offset,
        current_entry_offset);

    appendSelectedEntries(
        selected_entries,
        *key_insert_column,
        value_column,
        *result_key_column,
        *result_value_column,
        result_offset);

    result_offsets.push_back(result_offset);
    previous_entry_offset = current_entry_offset;
}

Then selectEntriesForRow could own the duplicate-key policy, including the possible hash-based implementation.

val mapKeyDedupPolicy = SQLConf.get.getConf(SQLConf.MAP_KEY_DEDUP_POLICY)
BackendsApiManager.getSparkPlanExecApiInstance.genMapFromEntriesTransformer(
substraitExprName,
if (mapKeyDedupPolicy.toString == SQLConf.MapKeyDedupPolicy.LAST_WIN.toString) {

@lgbo-ustc lgbo-ustc Jun 26, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit concerned about putting the LAST_WIN function-name rewrite in the common ExpressionConverter. This changes the Substrait function name for all backends when spark.sql.mapKeyDedupPolicy=LAST_WIN, but map_from_entries_last_win is currently a ClickHouse-specific function mapping. Other backends may receive this new function name without supporting it.

Could we keep the common converter backend-agnostic and move this policy-to-function-name decision into CHSparkPlanExecApi.genMapFromEntriesTransformer instead? For example, ExpressionConverter can continue passing substraitExprName:

case m: MapFromEntries =>
  BackendsApiManager.getSparkPlanExecApiInstance.genMapFromEntriesTransformer(
    substraitExprName,
    replaceWithExpressionTransformer0(m.child, attributeSeq, expressionsMap),
    m)

Then ClickHouse can make the backend-specific function-name choice locally:

override def genMapFromEntriesTransformer(
    substraitExprName: String,
    child: ExpressionTransformer,
    expr: Expression): ExpressionTransformer = {
  val mapKeyDedupPolicy = SQLConf.get.getConf(SQLConf.MAP_KEY_DEDUP_POLICY)
  val functionName =
    if (mapKeyDedupPolicy == SQLConf.MapKeyDedupPolicy.LAST_WIN) {
      ExpressionNames.MAP_FROM_ENTRIES_LAST_WIN
    } else {
      substraitExprName
    }

  GenericExpressionTransformer(functionName, Seq(child), expr)
}

This keeps the common converter backend-agnostic and limits the new map_from_entries_last_win function name to the backend that registers it.

Copilot AI review requested due to automatic review settings June 26, 2026 07:13
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLICKHOUSE CORE works for Gluten Core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CH] Support map_from_entries function

3 participants