feat(schema): represent, serialize and validate v3 column default values (1/4)#746
Conversation
445f7ae to
4aade00
Compare
1ee5b32 to
34470af
Compare
There was a problem hiding this comment.
Pull request overview
Adds Iceberg v3 column default value support at the schema layer by carrying, JSON-(de)serializing, validating, and projecting initial-default / write-default literals (foundation for later read/write-path work).
Changes:
- Extend
SchemaFieldto storeinitial-default/write-defaultas shared immutable literals and include them in equality/ID-reassignment rebuilds. - Add JSON serde for
initial-default/write-defaultusing existing single-value literal serialization. - Update schema projection to use
FieldProjection::Kind::kDefaultwhen an expected field is missing but hasinitial-default, and add/extend unit tests + v3 metadata fixture.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/iceberg/test/schema_util_test.cc | Adds projection tests for missing fields with initial-default and for ignoring defaults when the field is present. |
| src/iceberg/test/schema_test.cc | Adds schema validation/version-gating and ID-reassignment preservation tests for default values. |
| src/iceberg/test/schema_json_test.cc | Adds round-trip and failure tests for JSON serialization/parsing of default values (incl. nested structs). |
| src/iceberg/test/resources/TableMetadataV3Valid.json | Adds a v3-valid table metadata JSON fixture. |
| src/iceberg/schema.cc | Preserves defaults during ID reassignment and adds default-related validation in Schema::Validate. |
| src/iceberg/schema_util.cc | Projects missing fields with initial-default as kDefault rather than error/null. |
| src/iceberg/schema_field.h | Extends SchemaField API/storage to carry default literals and expose them via optional reference accessors. |
| src/iceberg/schema_field.cc | Implements default accessors, validation of defaults, and includes defaults in equality. |
| src/iceberg/json_serde.cc | Serializes/parses initial-default / write-default on schema fields via literal single-value serde. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
First of a multi-part split of column default value support (apache#730) — the schema foundation the read and evolution paths build on. Purely additive; no read/write behavior change on its own. - SchemaField carries `initial-default` / `write-default` (immutable std::shared_ptr<const Literal>) with copy-preserving WithInitialDefault / WithWriteDefault modifiers; getters return optional<reference_wrapper>. - JSON serde reads/writes `initial-default` / `write-default` via the existing single-value serialization. - Schema::Validate rejects default values below format v3 and validates they are non-null primitive literals matching the field type. - Generic schema projection maps a column missing from a data file with an initial-default to FieldProjection::Kind::kDefault. Read-path application (Parquet/Avro) and schema evolution follow in separate PRs. See apache#731 for the full end-to-end proof-of-concept.
34470af to
f663e0e
Compare
This file was added by this PR but is referenced by no test (table-metadata resources are loaded by explicit name via ReadTableMetadataFromResource) and carries no column defaults, so it is dead. Remove it.
Add SchemaField default-value coverage directly to json_serde_test (the binary that exercises json_serde.cc): one test pins the initial-default / write-default wire format (incl. their absence when unset), and one round-trips a field with both defaults through ToJson -> FieldFromJson for every primitive type, exercising the non-trivial date/timestamp/decimal/uuid/binary encodings the default path reuses.
…-schema # Conflicts: # src/iceberg/test/json_serde_test.cc
wgtmac
left a comment
There was a problem hiding this comment.
Static review notes; I did not build or run tests.
Addresses wgtmac's review feedback: - Enforce UTC offset for timestamptz / timestamptz_ns default values in FieldFromJson. The shared timestamp parser accepts any offset and silently normalizes to UTC, so C++ could accept default metadata Java rejects and rewrite the offset on write. Added TemporalUtils::IsUtcOffset, which reuses the existing timezone-suffix parser. - ValidateDefault now casts the default literal to the field type (Literal::CastTo) instead of requiring an exact type match, matching Java's Types.NestedField (e.g. an int default on a long field). Uncastable or out-of-range defaults are still rejected. - Documented that non-primitive (struct/list/map) defaults remain unsupported, matching Java's current model. - Extended the json_serde round-trip tests with time/timestamptz/timestamp_ns/ timestamptz_ns and a negative case for non-UTC timestamptz defaults; added a schema test for the cast-to-field-type behavior.
Directly cover the UTC/non-UTC/unparseable offset cases the timestamptz default-value enforcement relies on.
The AMD64 Windows job failed in its cleanup step (rm -rf example/build: Device or resource busy) after the example compiled and linked successfully; re-run CI.
wgtmac
left a comment
There was a problem hiding this comment.
Two remaining static review notes; I did not build or run tests.
| // be cast to the field type or fall outside its range (CastTo signals out-of-range as | ||
| // an above-max/below-min sentinel). | ||
| auto field_type = std::static_pointer_cast<PrimitiveType>(field.type()); | ||
| auto cast = value.CastTo(field_type); |
There was a problem hiding this comment.
This validates the cast but keeps storing the original literal. Java stores the result of castDefault, so a programmatic default like Literal::Long(...) on a timestamp field would validate here, then later serialize or project as a long instead of a timestamp default. Either the field needs to retain the casted literal, or this should keep requiring an exact type match.
|
|
||
| Status ValidateDefault(const SchemaField& field, const Literal& value, | ||
| std::string_view kind) { | ||
| if (value.IsNull() || value.IsAboveMax() || value.IsBelowMin()) { |
There was a problem hiding this comment.
This rejects explicit null defaults. The spec allows optional field defaults to be null and says they may be explicitly set. Java mostly models a null default as absence, but C++ now has a present-null Literal path that will fail validation and, for initial-default, also trips the format-version gate by presence rather than by non-null value.
wgtmac
left a comment
There was a problem hiding this comment.
I've just finished a human review over non-test files. Let me know what you think. Thanks!
| /// | ||
| /// The returned reference is a non-owning view into a value owned by this field; | ||
| /// it remains valid for the lifetime of this SchemaField. | ||
| [[nodiscard]] std::optional<std::reference_wrapper<const Literal>> initial_default() |
There was a problem hiding this comment.
I think we should remove [[nodiscard]] because users should always know what they are doing when these are called. I think we need also clean up existing functions as well.
| /// \brief Get the shared owning pointer to the `initial-default` value, or null if | ||
| /// absent. Prefer initial_default() for reading; this exists so a rebuilt field can | ||
| /// share the (immutable) value rather than copy it. | ||
| [[nodiscard]] const std::shared_ptr<const Literal>& initial_default_ptr() const; |
There was a problem hiding this comment.
It looks odd to provide both initial_default and initial_default_ptr. Should we keep one, like const std::shared_ptr<const Literal>& initial_default() const?
| std::shared_ptr<Type> type_; | ||
| bool optional_; | ||
| std::string doc_; | ||
| // Default values are owned by this field and never mutated after being set; copies |
There was a problem hiding this comment.
Not a strong opinion but the comment looks too verbose.
| } | ||
|
|
||
| Result<bool> TemporalUtils::IsUtcOffset(std::string_view str) { | ||
| ICEBERG_ASSIGN_OR_RAISE(auto timestamp_with_offset, ParseTimestampWithZoneSuffix(str)); |
There was a problem hiding this comment.
This allows Z and -00:00 as well. This is not allowed by the spec, right? However, Java impl seems doing the same thing. Perhaps we need to document this behavior explicitly?
There was a problem hiding this comment.
Thanks for raising this — I dug into both the spec and the reference implementation, and I think accepting Z / +00:00 / -00:00 is correct here.
What the spec says: default values use the single-value JSON serialization (Appendix D). For timestamptz / timestamptz_ns the value must be UTC, serialized as an ISO-8601 timestamp with a +00:00 offset (e.g. 2017-11-16T22:31:08.123456+00:00). The requirement is that the offset is UTC; Z and -00:00 are just equivalent zero-offset spellings of the same instant.
What the reference impl does: SingleValueParser gates timestamptz defaults on DateTimeUtil.isUTCTimestamptz, which is:
OffsetDateTime odt = OffsetDateTime.parse(s, DateTimeFormatter.ISO_DATE_TIME);
return odt.getOffset().equals(ZoneOffset.UTC);So it checks the parsed offset equals UTC — Z, +00:00 and -00:00 all pass, and only a genuinely non-zero offset (e.g. +05:00) is rejected. That's exactly what IsUtcOffset does here. (Java's error string reads "offset must be +00:00", but the underlying check is offset == UTC.)
Refs:
- Spec, Appendix D — Single-value serialization: https://iceberg.apache.org/spec/#appendix-d-single-value-serialization
DateTimeUtil.isUTCTimestamptz: https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/util/DateTimeUtil.java
Happy to tighten it to +00:00-only if you'd prefer, though that would reject UTC values the Java reference accepts.
| #include "iceberg/util/macros.h" | ||
|
|
||
| namespace iceberg { | ||
|
|
There was a problem hiding this comment.
This always create a new literal even when their types are the same which is unlikely?
There was a problem hiding this comment.
Done, remove the normalization method
| // an above-max/below-min sentinel). | ||
| auto field_type = std::static_pointer_cast<PrimitiveType>(field.type()); | ||
| auto cast = value.CastTo(field_type); | ||
| if (!cast.has_value() || cast->IsAboveMax() || cast->IsBelowMin()) { |
There was a problem hiding this comment.
Perhaps it is better to use ICEBERG_ASSIGN_OR_RAISE to respect the original error instead of checking cast.has_value()?
| // Defaults are validated to be primitive literals matching the field type, so | ||
| // single-value serialization cannot fail here. | ||
| if (field.initial_default().has_value()) { | ||
| ICEBERG_ASSIGN_OR_THROW(json[kInitialDefault], |
There was a problem hiding this comment.
Though it cannot not fail, we shouldn't use ICEBERG_ASSIGN_OR_THROW here. Let's change the signature to return Result<nlohmann::json> just in case any error.
There was a problem hiding this comment.
| if (!is_utc) { | ||
| return JsonParseError( | ||
| "Invalid timestamptz default '{}' for {}: default values must use UTC " | ||
| "(offset 'Z' or '+00:00')", |
There was a problem hiding this comment.
As I have commented earlier, this can also accept -00:00 so the error message is not consistent.
| return {}; | ||
| } | ||
| if (!value.is_string()) { | ||
| // Let LiteralFromJson report the type mismatch. |
There was a problem hiding this comment.
Why not directly error out here?
| ICEBERG_ASSIGN_OR_RAISE(auto name, GetJsonValue<std::string>(json, kName)); | ||
| ICEBERG_ASSIGN_OR_RAISE(auto required, GetJsonValue<bool>(json, kRequired)); | ||
| ICEBERG_ASSIGN_OR_RAISE(auto doc, GetJsonValueOrDefault<std::string>(json, kDoc)); | ||
| ICEBERG_ASSIGN_OR_RAISE(std::optional<nlohmann::json> initial_default_json, |
- ValidateDefault: use internal::checked_pointer_cast and ICEBERG_ASSIGN_OR_RAISE so an uncastable default surfaces the original CastTo error instead of a generic message; keep the out-of-range (sentinel) case as an invalid-schema error. - json_serde: error out directly when a timestamptz default is not a string, drop the offset-specific wording from the UTC error message (it also accepts -00:00), and use auto for the parsed default JSON. - Update schema tests for the propagated cast error and the out-of-range case.
…erializers Column default values make single-value (Literal) serialization fallible, so the schema/type serializers must propagate errors instead of throwing. Change ToJson for SchemaField, Type, Schema, TableMetadata and TableUpdate (plus the REST CreateTableRequest/CommitTableRequest/LoadTableResult/CommitTableResponse, which embed them) to return Result<nlohmann::json>, replacing the ICEBERG_ASSIGN_OR_THROW in SchemaField default serialization with ICEBERG_ASSIGN_OR_RAISE. - The shared ToJsonList template (also used by infallible serializers) is left unchanged; TableMetadata serializes its schema list with an explicit loop. - REST adds an ICEBERG_DECLARE_JSON_SERDE_FALLIBLE macro variant for the four models that embed a schema/metadata; the other REST models keep the bare-json ToJson. - Callers bottom out at existing Result boundaries (ToJsonString wrappers, rest_catalog.cc, TableMetadataUtil::Write). - Tests unwrap via ICEBERG_UNWRAP_OR_FAIL.
Extract the FromJson declarations into ICEBERG_DECLARE_FROM_JSON so the four schema/metadata-bearing models declare a Result-returning ToJson without duplicating the whole serde macro.
Keep ICEBERG_DECLARE_JSON_SERDE identical to upstream and declare the four schema/metadata-bearing models (LoadTableResult, CreateTableRequest, CommitTableRequest, CommitTableResponse) explicitly, since only their ToJson return type (Result) differs.
…-schema # Conflicts: # src/iceberg/catalog/rest/rest_catalog.cc
…lizers (#785) ## What Change the schema/type/metadata `ToJson` serializers from returning bare `nlohmann::json` to `Result<nlohmann::json>`: - `json_serde`: `ToJson(SchemaField | Type | Schema | TableMetadata | TableUpdate)` - REST: `ToJson(CreateTableRequest | CommitTableRequest | LoadTableResult | CommitTableResponse)` Errors propagate via `ICEBERG_ASSIGN_OR_RAISE`; callers bottom out at the existing `Result`-returning boundaries (`ToJsonString`, `rest_catalog`, `TableMetadataUtil::Write`). ## Why Preparation for v3 column default values (#730 / #746). Single-value (`Literal`) serialization is fallible, and column defaults invoke it from schema serialization, so these serializers need to propagate the error instead of throwing. Splitting it out keeps the feature PR focused on the feature. ## Notes - **No behavior change** — every conversion still succeeds today; only the return type changes. - The shared `ToJsonList` template stays bare (it also serializes infallible types such as partition specs and snapshots); `TableMetadata` serializes its schema list with an explicit loop. - The REST `ICEBERG_DECLARE_JSON_SERDE` macro is unchanged; the four schema/metadata-bearing models are declared explicitly so only their `ToJson` return type differs. ## Testing No behavior change; existing tests are adapted to the new return type. Full build and `ctest` pass locally.
| if (lhs == nullptr || rhs == nullptr) { | ||
| return lhs == rhs; | ||
| } | ||
| return *lhs == *rhs; |
There was a problem hiding this comment.
Just a reminder that Literal::operator<=> returns unordered when any side IsNull() returns true so two null defaults do not equal. I think we should fix Literal::operator<=> to be null safe?
There was a problem hiding this comment.
Or add a overload Literal::NullSafeEquals()
…-schema # Conflicts: # src/iceberg/catalog/rest/json_serde_internal.h # src/iceberg/json_serde.cc # src/iceberg/test/schema_json_test.cc
… comment Addresses review feedback: document that any zero-offset spelling (Z, +00:00, -00:00) is accepted by TemporalUtils::IsUtcOffset, and shorten the verbose comment on the default-value members.
A cross-type default (e.g. an int literal on a long field, accepted to match Java's cast behavior) was stored un-normalized, so projection materialized a wrong-typed fill value and a ToJson/FieldFromJson round-trip compared unequal (Int(34) vs Long(34)). Cast the default to the field type when constructing the field; values that already match are untouched and uncastable ones are left for Validate() to reject. Fixes the projection and round-trip-equality paths.
…::Make factory Move default-value normalization out of the constructor (which cannot report errors) into a fallible factory, per review feedback. SchemaField::Make casts a cross-type default to the field type and propagates the cast error / rejects out-of-range values instead of swallowing them; it is a no-op when the default already matches the field type. The constructor stores defaults verbatim, and ValidateDefault requires a stored default to match the field type exactly (Make is the cast-accepting path). This keeps the cross-type-accept behavior while ensuring stored defaults are always field-typed, so projection and JSON round-trip equality are correct.
Per review: replace the four default-value getters (optional<reference_wrapper> readers + _ptr sharers) with a single shared_ptr-returning getter per default, and remove the [[nodiscard]] attributes from SchemaField's getters.
Per review, a present-null default (Literal::Null) is no longer rejected: it is modeled as the absence of a default (matching Java), and dropped at construction so it is never stored. This also makes two such fields compare equal with no change to Literal::operator<=> (no null Literal is ever stored or compared), so the global comparison semantics used by expression predicates are left untouched.
…-null types, tests - LiteralFromJson accepts an integer JSON token for float/double defaults (Java writes a float default as a bare integer), so Java-produced v3 metadata parses in C++. - ValidateDefault rejects a non-null default on unknown/variant/geometry/geography fields (the spec requires those to default to null). - Tests: full TableMetadata-level v3 default round-trip; non-primitive + must-be-null default rejection; SchemaField equality distinguishes default values; present-null write-default dropped to absence; integer-encoded float default parses.
Revert the float/double LiteralFromJson leniency change and its test: that is the pre-existing shared literal parser, not code this PR introduces. The integer-encoded float-default interop gap (Java writes float defaults as bare ints) is a separate follow-up in a PR that owns the expression serde.
Per scope feedback, only drop [[nodiscard]] from this PR's own additions (the default getters). Restore it on the pre-existing field_id/name/type/optional/ToString/Equals; the broader [[nodiscard]] cleanup belongs in a separate PR.
The unknown/variant/geometry/geography branch reported "{type} fields must
default to null", which is ambiguous for the unknown type (reads as
"unrecognized fields") and awkwardly interpolates a parametrized type
(e.g. geometry(srid:...)) into a categorical sentence. Reword to
"type {} cannot have a default value" and align with the sibling
"Invalid {} value for {}: ..." messages.
SchemaField::Make and the file-local NormalizeDefault cast a cross-type default (e.g. an int literal on a long field) to the field type, but they have no production callers: defaults arrive already at the field type (FieldFromJson parses with the field type), and ValidateDefault's exact type match already rejects any cross-type default. The cast was therefore dead in production and, as wgtmac noted in review, allocated a new Literal even in the same-type case. Remove Make and NormalizeDefault (and their tests). The constructor stores defaults verbatim and Validate rejects type mismatches. The cast belongs in the follow-up that adds a real caller (e.g. schema evolution), where Java performs it in the NestedField constructor (castDefault).
wgtmac
left a comment
There was a problem hiding this comment.
Thanks for addressing all the feedback! Now this looks good to me.
One minor issue is that ValidateDefault now rejects any default whose literal type is not exactly the field type. That still diverges from Java: Types.NestedField.castDefault casts the provided literal to the field type and stores the converted literal. I don't think this is a blocker for now.
Part 1 of a multi-part split of #730 (column default values, item 2 of #637). The full
end-to-end implementation is in #731, kept open as the proof-of-concept; this series
lands it in reviewable pieces.
This PR is the schema foundation — representing, serializing and validating v3
column default values. It is purely additive and changes no read or write behavior on
its own.
What's in this PR
SchemaFieldcarriesinitial-default/write-default, stored asstd::shared_ptr<const Literal>(immutable payload shared across copies, like theadjacent
type_; the C++ analog of Java'sfinal Literal<?>). They are set via theconstructor. Getters return
std::optional<std::reference_wrapper<const Literal>>forreading (the
Schema::FindFieldByNameidiom);initial_default_ptr()/write_default_ptr()expose the shared pointer so a rebuilt field (e.g. IDreassignment) shares the value instead of copying it.
initial-default/write-defaultusing the existingsingle-value serialization (all primitive types).
Schema::Validate: version-gates theinitial-defaultto format v3(
kMinFormatVersionDefaultValues) — it reinterprets how existing data files are read,so it requires the v3 reader contract. The
write-defaultonly affects values writtengoing forward and is not version-gated (matching Java's
Schema.checkCompatibility,which gates only the initial default). Both defaults are otherwise validated to be
non-null primitive literals matching the field type.
initial-defaultmaps to
FieldProjection::Kind::kDefaultcarrying the literal (the per-format readersconsume this in the follow-up PRs).
Follow-ups (stacked on this PR)
literal_util+ parquet projection/materialization)UpdateSchemaadd/update column defaults)Testing
Added tests