Skip to content

GH-45937: [C++][Parquet] support to encode, write and validate variant#50252

Draft
HuaHuaY wants to merge 2 commits into
apache:mainfrom
HuaHuaY:variant
Draft

GH-45937: [C++][Parquet] support to encode, write and validate variant#50252
HuaHuaY wants to merge 2 commits into
apache:mainfrom
HuaHuaY:variant

Conversation

@HuaHuaY

@HuaHuaY HuaHuaY commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Rationale for this change

This PR supports:

  • Mapping between arrow extension variant type and Parquet variant type
  • Encoding unshredded variant arrays
  • Assembling and verifying shredded variant arrays given a typed_value array

This PR does not support:

  • Inferring typed_value shredded data from the value array

What changes are included in this PR?

Are these changes tested?

Yes.

Are there any user-facing changes?

  1. There is a new properties variant_validation_enabled_ in ArrowWriterProperties.
  2. Variant builders at cpp/src/arrow/extension/variant/.

::arrow::internal::Executor* executor_;

bool write_time_adjusted_to_utc_;
bool variant_validation_enabled_;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this belongs to ArrowWriterProperties but not WriterProperties? If users the low-level parquet writer without Arrow API, serialized variant values cannot be validated any more?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variant is a logical type. So its validation happens in arrow's writer.

Comment thread cpp/src/arrow/extension/CMakeLists.txt Outdated
return field->type()->storage_id() == Type::BINARY ||
field->type()->storage_id() == Type::LARGE_BINARY;

bool IsSupportedPrimitiveTypedValue(const std::shared_ptr<DataType>& type) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that some Arrow primitive types are missing from here: https://arrow.apache.org/docs/format/CanonicalExtensions.html#primitive-type-mappings

Comment thread cpp/src/parquet/variant/validate.h
Comment thread cpp/src/arrow/extension/variant/encoding.h Outdated
Comment thread cpp/src/arrow/extension/variant/encoding.h Outdated
@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 26, 2026

struct ARROW_EXPORT VariantObjectField {
std::string_view name;
uint32_t field_id = 0;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use int32_t

uint8_t offset_size() const { return offset_size_; }
uint32_t dictionary_size() const { return static_cast<uint32_t>(strings_.size()); }

std::string_view string(uint32_t id) const;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some comment to help understand that this is the dictionary.

Comment thread cpp/src/arrow/extension/variant/encoding.h Outdated
}

Status PrimitivePayloadSize(std::string_view value, size_t offset,
VariantPrimitiveType primitive, size_t* size) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not returning Result<size_t>?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, if we decide to move to parquet folder, we should use ParquetException instead of arrow::Result/Status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants