Skip to content

[Python] PyArrow copies large BinaryView buffers during construction and scalar extraction #50246

Description

@paleolimbot

Describe the enhancement requested

In SedonaDB we're using binaryview to pass around large buffers, and in the process discovered a number of places where these buffers are copied. This is a bit of a non-standard use of the BinaryView so it's no problem, but the copy during scalar extraction was surprising (it seems like this would be a useful feature to directly access the view's buffer, possibly for regular binary/string arrays as well). Our workaround is at apache/sedona-db#999 but here's a more minimal reproducer:

import pyarrow as pa
import numpy as np

big_bytes_array = b"124938ls" * 100
buf = np.arange(1000, dtype=np.uint8)

# Creating via a memoryview doesn't keep the original memory
pa_array = pa.array([memoryview(buf)], pa.binary_view())
buf_from_pa_array_via_memoryview = np.frombuffer(pa_array.buffers()[2])
np.shares_memory(buf_from_pa_array_via_memoryview, buf)
#> False

# You can force this by creating a binary array manually and casting to a view
pa_array = pa.Array.from_buffers(
    type=pa.binary(),
    length=1,
    buffers=[
        None,
        pa.py_buffer(np.array([0, 1000], dtype=np.int32())),
        pa.py_buffer(buf),
    ],
).cast(pa.binary_view())

buf_from_pa_array_via_memoryview = np.frombuffer(pa_array.buffers()[2])
np.shares_memory(buf_from_pa_array_via_memoryview, buf)

# However, the act of extracting a scalar forces a copy
buf_from_pa_array_scalar = np.frombuffer(pa_array[0].as_buffer())
np.shares_memory(buf_from_pa_array_scalar, buf)
#> False

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions