Skip to content

refactor(arrow/avro): migrate from hamba/avro to twmb/avro#830

Open
prochac wants to merge 2 commits into
apache:mainfrom
prochac:refactor/avro-lib-migration
Open

refactor(arrow/avro): migrate from hamba/avro to twmb/avro#830
prochac wants to merge 2 commits into
apache:mainfrom
prochac:refactor/avro-lib-migration

Conversation

@prochac

@prochac prochac commented May 26, 2026

Copy link
Copy Markdown

Rationale for this change

Closes #813

What changes are included in this PR?

Primarily, the lib was changed. New UUID type support was added. Parsing simplified, that's thanks to the new lib that doesn't use map[string]any wrapper.

Are these changes tested?

All tests pass, new test for the UUID type was added

Are there any user-facing changes?

The ArrowSchemaFromAvro function was deprecated, as it causes coupling with 3rd-party Avro module

@prochac prochac requested a review from zeroshade as a code owner May 26, 2026 23:09

@zeroshade zeroshade left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rebase and resolve the conflicts please?

Comment thread arrow/avro/schema.go Outdated
Comment on lines +91 to +94
node.namedCache[s.Name] = s
if s.Namespace != "" {
node.namedCache[s.Namespace+"."+s.Name] = s
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that the hamba SchemaCache keyed on full name only, so for schemas with two records named Foo in different namespaces a bare {"type": "Foo"} will resolve to whichever was defined latest as opposed to being an error (the previous behavior under hamba).

We should either restrict to full-name lookups or document the change in behavior here that where a malformed schema may now silently produce something where hamba would have errored.

Comment on lines -663 to -676
case []byte:
buf := bytes.NewBuffer(dt)
if len(dt) <= 38 {
var intData int64
err := binary.Read(buf, binary.BigEndian, &intData)
if err != nil {
return err
}
b.Append(decimal128.FromI64(intData))
} else {
var bigIntData big.Int
b.Append(decimal128.FromBigInt(bigIntData.SetBytes(buf.Bytes())))
}
case map[string]any:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anyone was using custom-decoders that emitted []byte or map[string]any etc. they are now going to see "unsupported value of type X" errors where old code would have accepted the input. Let's update the description on the PR (and therefore the resulting changelog later) to note this change in behavior for anyone who was using custom-decoders.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was possible only if they used the global type registry. The map[string]any shouldn't affect people who were using only the arrow-go API. But those who used https://pkg.go.dev/github.com/hamba/avro/v2#Register ... There is no good way to handle that :/

Comment on lines +570 to +571
case 36:
return b.AppendValueFromString(string(dt))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a comment here explaining why we have a case for 36 (i guess, hex-dash UUID string?)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I'm aware, I don't think twmb ever yields a 36-byte []byte for a uuid case, so what scenario is this testing? If this is unreachable can we just remove it?

Comment thread arrow/avro/schema.go Outdated
Comment on lines +350 to +351
for i := range s.Branches {
b := s.Branches[i]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for i := range s.Branches {
b := s.Branches[i]
for _, b := range s.Branches {

Comment thread arrow/avro/schema.go
Comment on lines +144 to +147
root := schema.Root()
n := newSchemaNode()
n.schema = schema
c := n.newChild(n.schema.(avro.NamedSchema).Name(), n.schema)
n.node = root
c := n.newChild(root.Name, root)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OCF requires a named record at the top, we should probably add a guard or explicit "must be named record" check here (the hamba code would have panic'd).

@zeroshade

Copy link
Copy Markdown
Member

@prochac are you still working on this?

prochac and others added 2 commits June 20, 2026 00:34
Switch the avro reader from hamba/avro to twmb/avro for schema parsing
and value decoding. The new library hands back typed Go values (e.g.
*big.Rat for decimals, [16]byte for fixed-uuid) instead of []byte, which
shortens the data-side helpers and lets us fix a stack of bugs the old
[]byte arms were masking.

The exported arrow/avro surface is preserved: NewOCFReader, OCFReader
and its methods, the Option/With… helpers, and ArrowSchemaFromAvro
all keep their signatures. A new ArrowSchemaFromAvroJSON is added
alongside as the recommended entry point — it doesn't couple callers
to a particular Avro library through its type signature.
ArrowSchemaFromAvro is now marked Deprecated and serializes the hamba
schema via json.Marshal before re-parsing with twmb; the
github.com/hamba/avro/v2 dependency is kept so downstream callers
continue to compile.

Newly supported logical types (the mapping had them commented out
under hamba):
- local-timestamp-millis
- local-timestamp-micros
- local-timestamp-nanos
- timestamp-nanos

Fixes:
- fixed(16)+uuid rows are no longer silently dropped — the
  UUIDBuilder now handles [16]byte directly.
- ArrowSchemaFromAvro (deprecated wrapper) preserves logical-type
  annotations on fixed (uuid/decimal/duration). The wrapper now uses
  json.Marshal, which dispatches to each schema type's MarshalJSON;
  hambaAvro.Schema.String() returns Avro Parsing Canonical Form,
  which strips logical types by spec, and was the wrong serializer
  to bridge between libraries.
- Heterogeneous and non-nullable multi-branch unions (e.g.
  ["null", A, B], ["int", "string"]) used to silently mis-decode
  against the wrong branch or fall through to an opaque downstream
  nil-field error. They now fail upfront with "unsupported avro
  union at <path>".
- Nullable unions written as ["T", "null"] (with null in the second
  position) now resolve to T as expected; the old code always took
  Types()[1] and produced a Null-typed column for that ordering.
- Map-of-primitive schemas (e.g. {"type":"map","values":"int"}) now
  parse; the old code asserted the value type to NamedSchema and
  panicked on primitive values.
- Unknown SchemaNode types used to leave a nameless nil-typed field
  that exploded in the record builder; now they fail with
  "unhandled avro type %q", surfaced as "invalid avro schema".
- Duration's []byte arm read Uint16 with gaps and overflowed uint32
  on the milli multiply — gone; only the avro.Duration arm remains.
- append*Data helpers no longer silently no-op or fmt.Sprint-coerce
  unknown inputs; they return "unsupported value of type %T".
- Enum-symbol mismatches in appendBinaryDictData surface a clear
  error instead of a generic dictionary builder failure.
- Schema-conversion errors wrap with %w so callers can
  errors.Unwrap past arrow.ErrInvalid.

OCF wire format is unchanged: twmb/avro/ocf supports the same codecs
(null, deflate, snappy, zstd) and files written by hamba remain
readable.
- Reject ambiguous bare named-type references: key the named-type cache by
  full name and error when an unqualified reference matches a short name
  defined under multiple namespaces, instead of resolving to an arbitrary one.
- Guard that the top-level schema is a record (OCF requirement) and fail up
  front otherwise; the parser already guarantees a record has a name.
- Simplify the union-branch loop in nullableBranch.
- Document the 36-byte (hex-dash UUID text) case in appendUUIDData.
- Add tests for the root guard (non-record and parser-rejected empty name) and
  the ambiguous-reference rejection.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@prochac prochac force-pushed the refactor/avro-lib-migration branch from ce682aa to 6092d57 Compare June 20, 2026 14:08

@zeroshade zeroshade left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Only the loadDatum comment is blocking, the rest are nitpicks.

Also can we add some test scenarios for:

  • timestamp-nanos/local-timestamp-nanos: you added them but they aren't in the alltypes.avsc or testdata.go fixtures.
  • non-utc local-timestamp: a small unit test to assert the local-timestamp value in a non-utc TZ
  • heterogeneous-union rejection: something like ["int", "string"], ["null", A, B] will properly panic -> error, but the test TestComplexUnionReportsError was removed and not replaced, let's add a test for it and assert the new "unsupported avro union" message
  • We need tests for Reuse(), a positive WithChunk(n) and enum-only columns. The code looks correct, but there's no tests for them currently.

return err
}

func (d *dataLoader) loadDatum(data any) error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this currently propagates errors for top-level scalar fields, but discards them everywhere else:

  • mapValue struct loader (line 96)
  • recursive calls to a child's loadDatum (lines 137, 155, 160, 162, 164, 195, 198, 222-223)
  • list append and items: lines 172, 174, 177, 203, 206, 227-228
  • map branch append/key/value: lines 234, 237, 239, 241, 243

The original test being replaced asserted that the map-value and list-item type mismatches would have surfaced as errors. Can we re-wrap the nested appendFunc/loadDatum calls with the same error/ErrNullStructData handling used in the scalar loops? And can we restore a TestLoadDatumPropagatesNestedAppendErrors unit test (using twmb of course)

Comment on lines +570 to +571
case 36:
return b.AppendValueFromString(string(dt))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I'm aware, I don't think twmb ever yields a 36-byte []byte for a uuid case, so what scenario is this testing? If this is unreachable can we just remove it?

Comment thread arrow/avro/reader.go
if err != nil {
return fmt.Errorf("%w: could not parse avro header", arrow.ErrInvalid)
}
if rr.avroSchema != schema.String() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we change this to doing an actual schema check instead? just so that we don't end up rejecting cases with different whitespace/formatting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Avro] hamba/avro is abandoned

2 participants