[CH] Implement map_from_entries function#12349
Conversation
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
Add ClickHouse backend support for map_from_entries and its LAST_WIN variant, including Substrait function mapping and tests for null and duplicate-key behavior.
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
| for (size_t entry = previous_entry_offset; entry < current_entry_offset; ++entry) | ||
| { | ||
| if ((*entry_null_map)[entry]) | ||
| { | ||
| if (result_null_map_data) | ||
| (*result_null_map_data)[row] = 1; | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
bool entry_map_null = false
if (entry_null_map)
{
for (size_t entry = previous_entry_offset; entry < current_entry_offset; ++entry)
{
if ((*entry_null_map)[entry])
{
entry_map_null = true;
break;
}
}
}
if (result_null_map_data)
(*result_null_map_data)[row] = entry_map_null || (input_null_map && (*input_null_map)[row]);|
Run Gluten Clickhouse CI on x86 |
| bool has_duplicate_key = false; | ||
| for (auto & selected_entry : selected_entries) | ||
| { | ||
| if (key_column.compareAt(entry, selected_entry.first, key_column, 1) == 0) |
There was a problem hiding this comment.
This duplicate-key lookup scans all previously selected entries for every input entry, making the per-row complexity O(n^2). For large input arrays this can become expensive. Could we use a per-row hash map/set, similar to SparkFunctionArrayDistinct, to track seen keys and only fall back to compareAt on hash matches to preserve exact equality semantics?
For example, the duplicate lookup could be structured like this:
std::unordered_map<UInt128, size_t> seen;
for (size_t entry = previous_entry_offset; entry < current_entry_offset; ++entry)
{
if (key_column.isNullAt(entry))
throw Exception(...);
SipHash hash_function;
key_column.updateHashWithValue(entry, hash_function);
UInt128 hash = hash_function.get128();
auto it = seen.find(hash);
if (it != seen.end() &&
key_column.compareAt(entry, it->second, key_column, 1) == 0)
{
if constexpr (last_win)
{
// update value index
}
else
{
throw Exception(...);
}
}
else
{
seen.emplace(hash, entry);
selected_entries.emplace_back(entry, entry);
}
}For LAST_WIN, the stored value may need to point to the selected_entries index instead of the raw entry index, so the duplicate branch can update selected_entries[it->second].second = entry while preserving the first key position/output order.
| return map_type; | ||
| } | ||
|
|
||
| ColumnPtr executeImpl( |
There was a problem hiding this comment.
executeImpl is getting quite long and currently mixes several concerns: unwrapping nullable input, handling nullable array entries, the Array(Nothing) special case, duplicate-key selection, and final map-column construction. Could we split the row-level null handling / duplicate-key selection into small helpers? That would make the Spark semantics easier to audit and also make the duplicate-key lookup optimization more localized.
For example, the main loop could become closer to:
auto appendNullMap = [&]()
{
if (result_null_map_data)
(*result_null_map_data)[row] = 1;
result_offsets.push_back(result_offset);
};
for (size_t row = 0; row < input_rows_count; ++row)
{
const auto current_entry_offset = entries_offsets[row];
if ((input_null_map && (*input_null_map)[row])
|| hasNullEntry(entry_null_map, previous_entry_offset, current_entry_offset))
{
appendNullMap();
previous_entry_offset = current_entry_offset;
continue;
}
auto selected_entries = selectEntriesForRow(
key_column,
previous_entry_offset,
current_entry_offset);
appendSelectedEntries(
selected_entries,
*key_insert_column,
value_column,
*result_key_column,
*result_value_column,
result_offset);
result_offsets.push_back(result_offset);
previous_entry_offset = current_entry_offset;
}Then selectEntriesForRow could own the duplicate-key policy, including the possible hash-based implementation.
| val mapKeyDedupPolicy = SQLConf.get.getConf(SQLConf.MAP_KEY_DEDUP_POLICY) | ||
| BackendsApiManager.getSparkPlanExecApiInstance.genMapFromEntriesTransformer( | ||
| substraitExprName, | ||
| if (mapKeyDedupPolicy.toString == SQLConf.MapKeyDedupPolicy.LAST_WIN.toString) { |
There was a problem hiding this comment.
I am a bit concerned about putting the LAST_WIN function-name rewrite in the common ExpressionConverter. This changes the Substrait function name for all backends when spark.sql.mapKeyDedupPolicy=LAST_WIN, but map_from_entries_last_win is currently a ClickHouse-specific function mapping. Other backends may receive this new function name without supporting it.
Could we keep the common converter backend-agnostic and move this policy-to-function-name decision into CHSparkPlanExecApi.genMapFromEntriesTransformer instead? For example, ExpressionConverter can continue passing substraitExprName:
case m: MapFromEntries =>
BackendsApiManager.getSparkPlanExecApiInstance.genMapFromEntriesTransformer(
substraitExprName,
replaceWithExpressionTransformer0(m.child, attributeSeq, expressionsMap),
m)Then ClickHouse can make the backend-specific function-name choice locally:
override def genMapFromEntriesTransformer(
substraitExprName: String,
child: ExpressionTransformer,
expr: Expression): ExpressionTransformer = {
val mapKeyDedupPolicy = SQLConf.get.getConf(SQLConf.MAP_KEY_DEDUP_POLICY)
val functionName =
if (mapKeyDedupPolicy == SQLConf.MapKeyDedupPolicy.LAST_WIN) {
ExpressionNames.MAP_FROM_ENTRIES_LAST_WIN
} else {
substraitExprName
}
GenericExpressionTransformer(functionName, Seq(child), expr)
}This keeps the common converter backend-agnostic and limits the new map_from_entries_last_win function name to the backend that registers it.
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
What changes are proposed in this pull request?
Add ClickHouse backend support for map_from_entries and its LAST_WIN variant, including Substrait function mapping and tests for null and duplicate-key behavior.
close #12348
How was this patch tested?
Add UT.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Codex gpt 5.5