Skip to content

GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code#45998

Open
AlenkaF wants to merge 39 commits into
apache:mainfrom
AlenkaF:gh-45747-remove-deprecated-ObjectType-FileStatistics
Open

GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code#45998
AlenkaF wants to merge 39 commits into
apache:mainfrom
AlenkaF:gh-45747-remove-deprecated-ObjectType-FileStatistics

Conversation

@AlenkaF

@AlenkaF AlenkaF commented Apr 1, 2025

Copy link
Copy Markdown
Member

Rationale for this change

ObjectType and FileStatistics in io/hdfs.h have been deprecated for a while and can be removed.

What changes are included in this PR?

ObjectType and FileStatistics structs are removed and instead FileSystem API in arrow::fs is used. Together with this change, the hdfs connected code is moved from cpp/src/arrow/io to cpp/src/arrow/filesystem merging FileSystem and HadoopFileSystem classes from arrow::io into the public HadoopFileSystem class.

Are these changes tested?

Existing tests should pass.

Are there any user-facing changes?

Deprecated structs are removed and all hdfs related code is now a part of the filesystem module.

Also closes: #22457 (not sure about io/interfaces.h?)

@AlenkaF

AlenkaF commented Apr 1, 2025

Copy link
Copy Markdown
Member Author

Ah, this will not work. If we want to remove ObjectType and FileStatistics from io/hdfs.h and instead use FileSystem API, then filesystem component would have to be enabled by default. Also, if I understand correctly, we want to do a refactoring of io/hdfs anyways which would include the changes in this PR? (#22457)

cc @pitrou

@pitrou

pitrou commented Apr 2, 2025

Copy link
Copy Markdown
Member

Ah, this will not work. If we want to remove ObjectType and FileStatistics from io/hdfs.h and instead use FileSystem API, then filesystem component would have to be enabled by default.

Well, ARROW_HDFS=ON could imply ARROW_FILESYSTEM=ON. I don't think that's a problem.

Also, if I understand correctly, we want to do a refactoring of io/hdfs anyways which would include the changes in this PR?

Yes, indeed. io/hdfs could be moved to filesystem/hdfs_internal or something similar.

@AlenkaF

AlenkaF commented Apr 2, 2025

Copy link
Copy Markdown
Member Author

OK, I will then move io/hdfs to filesystem/hdfs_internal. Thanks!

@AlenkaF AlenkaF force-pushed the gh-45747-remove-deprecated-ObjectType-FileStatistics branch from e95f3f3 to 287cb9b Compare April 8, 2025 13:24
@AlenkaF AlenkaF changed the title GH-45747: [C++] Remove deprecated ObjectType and FileStatistics GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code Apr 9, 2025
@AlenkaF AlenkaF marked this pull request as ready for review April 9, 2025 11:40
@AlenkaF

AlenkaF commented Apr 9, 2025

Copy link
Copy Markdown
Member Author

@pitrou I think this is ready for review. The failing builds have an issue opened: #46077

@AlenkaF AlenkaF force-pushed the gh-45747-remove-deprecated-ObjectType-FileStatistics branch from 1bd3f43 to 8e26730 Compare April 28, 2025 05:10
@AlenkaF AlenkaF requested review from raulcd and rok as code owners April 28, 2025 05:10

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @AlenkaF! Here are some comments.

Comment thread python/pyarrow/includes/libarrow_fs.pxd Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these declarations actually needed by PyArrow?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, most of them aren't and are copied from libarrow.pxd. I can remove the unused ones - but am not sure if some external application can actually use them?

Comment thread cpp/src/arrow/CMakeLists.txt Outdated
Comment thread cpp/src/arrow/CMakeLists.txt Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to link to arrow::hadoop as was done above? cc @kou for advice

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yeah. I will add a link as above as it makes sense.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it is there already (that explains why nothing failed =) )

if(ARROW_HDFS)
foreach(ARROW_FILESYSTEM_TARGET ${ARROW_FILESYSTEM_TARGETS})
target_link_libraries(${ARROW_FILESYSTEM_TARGET} PRIVATE arrow::hadoop)
endforeach()
endif()

Not sure if the line with CMAKE_DL_LIBS is also needed here then?

Comment thread cpp/src/arrow/filesystem/CMakeLists.txt Outdated
Comment thread cpp/src/arrow/filesystem/api.h Outdated
Comment thread cpp/src/arrow/filesystem/hdfs_io.h Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but we don't want to keep those two unofficial FileSystem and HadoopFileSystem classes which create confusion with the other (public) filesystem classes.

Ideally, those two classes disappear and their implementation code gets folded into the public HadoopFileSystem class.

If that's too annoying, we should at least merge those two classes and give them a less ambiguous name, for example HdfsClient.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will go with the disappearing =)
IIUC hdfs_io.h will be removed altogether:

  • FileSystem and HadoopFileSystem will go into hdfs.cc, folded into the public HadoopFileSystem
  • HdfsConnectionConfig will also go into hdfs.cc
  • declarations that are left will go into hdfs_internal.cc

Comment thread cpp/src/arrow/filesystem/hdfs_io.h Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these declarations should IMHO go into the arrow::filesystem::internal namespace, except for HdfsConnectionConfig which can go into arrow::filesystem.

@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 29, 2025
@AlenkaF AlenkaF marked this pull request as draft May 19, 2025 07:36
@AlenkaF

AlenkaF commented May 19, 2025

Copy link
Copy Markdown
Member Author

Hi @pitrou, could you please take a quick look at the changes when you have a moment?

I've done my best to implement the suggested changes, but am sure there's still room for improvement.
A couple of issues I could use your input on:

  • Some C++ tests are failing in hdfs_test, with the following error:

    /arrow/cpp/src/arrow/filesystem/hdfs_test.cc:90: HadoopFileSystem::Make failed, it is possible when we don't have proper 
    driver on this node, err msg is IOError: Unable to load libjvm
    

    I'm not sure how best to resolve this — any guidance would be appreciated.

  • The MSVC compiler is complaining about a forward-declared friend function I'm using in hdfs.cc. Do you have any advice on how to better organise this?

The Python and MATLAB test failures are not related.
Thanks in advance!

@pitrou

pitrou commented May 19, 2025

Copy link
Copy Markdown
Member

Hi @AlenkaF

  • Some C++ tests are failing in hdfs_test, with the following error:

I think you're misreading the output, the test is actually skipped when the driver fails unloading, which is normal:

(...)
dlopen(/usr/java/latest//lib/amd64/server/libjvm.so) failed: /usr/java/latest//lib/amd64/server/libjvm.so: cannot open shared object file: No such file or directory
/arrow/cpp/src/arrow/filesystem/hdfs_internal.cc:294  try_dlopen(libjvm_potential_paths, "libjvm")
/arrow/cpp/src/arrow/filesystem/hdfs.cc:95  ConnectLibHdfs(&driver_)
/arrow/cpp/src/arrow/filesystem/hdfs.cc:725  ptr->impl_->Init()
/arrow/cpp/src/arrow/filesystem/hdfs_test.cc:205: Skipped
Driver not loaded, skipping

https://github.com/apache/arrow/actions/runs/15109276550/job/42464862030?pr=45998#step:7:3277

The problem is in the other tests, because it seems a destructor crashes:


[ RUN      ] TestHadoopFileSystem.DeleteDirContents
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Running '/build/cpp/debug/arrow-hdfs-test' produced core dump at '/tmp/core.arrow-hdfs-test.25630', printing backtrace:
[New LWP 25630]
[New LWP 25631]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/build/cpp/debug/arrow-hdfs-test'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fc37b23d5be in arrow::io::internal::LibHdfsShim::Disconnect () at /arrow/cpp/src/arrow/filesystem/hdfs_internal.cc:340
340	int LibHdfsShim::Disconnect(hdfsFS fs) { return this->hdfsDisconnect(fs); }
[Current thread is 1 (Thread 0x7fc374633dc0 (LWP 25630))]
(...)

https://github.com/apache/arrow/actions/runs/15109276550/job/42464862030?pr=45998#step:7:3281

@pitrou

pitrou commented May 19, 2025

Copy link
Copy Markdown
Member
  • The MSVC compiler is complaining about a forward-declared friend function I'm using in hdfs.cc. Do you have any advice on how to better organise this?

Hmm, rather than trying to find the exact explanation, a simple solution would be to change these functions into static methods, for example this:

ARROW_EXPORT Status MakeReadableFile(const std::string& path, int32_t buffer_size,
  const io::IOContext& io_context, LibHdfsShim* driver,
  hdfsFS fs, hdfsFile file,
  std::shared_ptr<HdfsReadableFile>* out);

would become:

class ARROW_EXPORT HdfsReadableFile : public RandomAccessFile {
 public:
   (...)

  static Result<std::shared_ptr<HdfsReadableFile>> Make(
      const std::string& path, int32_t buffer_size,
      const io::IOContext& io_context, LibHdfsShim* driver,
      hdfsFS fs, hdfsFile file);

@AlenkaF

AlenkaF commented May 19, 2025

Copy link
Copy Markdown
Member Author

Aha, I see! Thanks, will look into it.

@AlenkaF

AlenkaF commented May 20, 2025

Copy link
Copy Markdown
Member Author

@pitrou I cleaned up the CI failures (others are not related) and am hoping this changes will not be too bad to review :)

@AlenkaF AlenkaF marked this pull request as ready for review May 27, 2025 09:37
@AlenkaF AlenkaF requested a review from pitrou May 27, 2025 09:37
@AlenkaF AlenkaF force-pushed the gh-45747-remove-deprecated-ObjectType-FileStatistics branch from 72eae6e to 7810940 Compare June 5, 2025 08:38

@benibus benibus left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks pretty good to me. Just a few comments.

Comment thread cpp/src/arrow/filesystem/hdfs_internal.h Outdated
Comment thread cpp/src/arrow/filesystem/hdfs.h Outdated
Comment thread python/pyarrow/includes/libarrow_fs.pxd Outdated
@AlenkaF

AlenkaF commented Jun 10, 2025

Copy link
Copy Markdown
Member Author

@benibus @pitrou would you mind taking another look?

@AlenkaF AlenkaF force-pushed the gh-45747-remove-deprecated-ObjectType-FileStatistics branch from c7beefc to 281b51a Compare June 23, 2025 13:13
@AlenkaF

AlenkaF commented Jun 23, 2025

Copy link
Copy Markdown
Member Author

@pitrou gentle ping. Would I be too optimistic to try to get it into 21.0.0?

Copilot AI review requested due to automatic review settings June 8, 2026 14:31
@AlenkaF AlenkaF force-pushed the gh-45747-remove-deprecated-ObjectType-FileStatistics branch from f09d6d9 to 6d1ec18 Compare June 8, 2026 14:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes long-deprecated HDFS types (ObjectType, FileStatistics) and shifts HDFS integration away from arrow::io toward the arrow::fs FileSystem APIs, consolidating/relocating implementation and updating Python bindings accordingly.

Changes:

  • Deletes the deprecated arrow::io HDFS header/implementation and stops exporting it from arrow/io/api.h.
  • Refactors HDFS implementation into arrow::fs::HadoopFileSystem (and moves the internal libhdfs shim / stream implementations under arrow/filesystem).
  • Updates PyArrow bindings to use the filesystem HDFS API and rehomes have_libhdfs() to the top-level pyarrow module.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
python/pyarrow/io.pxi Removes the have_libhdfs() helper from the pyarrow.io extension module.
python/pyarrow/includes/libarrow.pxd Drops deprecated ObjectType / FileStatistics and removes arrow::io HDFS bindings.
python/pyarrow/includes/libarrow_fs.pxd Adds filesystem-level HaveLibHdfs() and HdfsConnectionConfig declarations for PyArrow.
python/pyarrow/_hdfs.pyx Adds an internal _have_libhdfs() implemented via the filesystem API.
python/pyarrow/init.py Reintroduces have_libhdfs() at the top-level Python API (delegating to _hdfs).
cpp/src/arrow/meson.build Removes io/hdfs*.cc from the IO component sources under Meson.
cpp/src/arrow/io/hdfs.h Deletes the deprecated arrow::io HDFS API (including deprecated structs).
cpp/src/arrow/io/hdfs.cc Deletes the deprecated arrow::io HDFS implementation.
cpp/src/arrow/io/CMakeLists.txt Removes the arrow-io HDFS test registration from the IO test suite.
cpp/src/arrow/io/api.h Stops exporting HDFS via arrow/io/api.h.
cpp/src/arrow/filesystem/hdfs.h Makes filesystem HDFS self-contained (new config struct + extra methods + HaveLibHdfs).
cpp/src/arrow/filesystem/hdfs.cc Refactors implementation to directly use the libhdfs shim and new internal stream types.
cpp/src/arrow/filesystem/hdfs_internal.h Moves/expands internal shim/types/streams into filesystem internals.
cpp/src/arrow/filesystem/hdfs_internal.cc Moves stream implementations and related logic into filesystem internals.
cpp/src/arrow/filesystem/hdfs_internal_test.cc Ports the internal HDFS tests to the filesystem implementation.
cpp/src/arrow/filesystem/CMakeLists.txt Adds hdfs_internal_test to the filesystem test suite and links required libs.
cpp/src/arrow/CMakeLists.txt Removes IO-level HDFS sources and adds filesystem-level hdfs_internal.cc to build when ARROW_HDFS=ON.
Comments suppressed due to low confidence (2)

cpp/src/arrow/filesystem/hdfs_internal_test.cc:165

  • ConnectsAgain declares a local client but then assigns the new filesystem to client_ instead, leaving client unused. With common warning settings this can break the build (-Wunused-variable).
    cpp/src/arrow/filesystem/hdfs_internal_test.cc:74
  • WriteDummyFile() no longer calls Close() on the HdfsOutputStream. Relying on the destructor to close means close/flush failures won't fail the test (they're only warned), and it can make test behavior depend on destructor timing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cpp/src/arrow/filesystem/hdfs.h
Comment thread cpp/src/arrow/filesystem/hdfs.h Outdated
Comment thread cpp/src/arrow/filesystem/hdfs.cc
Comment thread cpp/src/arrow/filesystem/hdfs.cc
Comment thread cpp/src/arrow/meson.build
Copilot AI review requested due to automatic review settings June 8, 2026 15:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

cpp/src/arrow/filesystem/hdfs_internal_test.cc:165

  • ConnectsAgain declares a local client but assigns the new filesystem instance to the fixture member client_ instead. This makes client unused (potential -Werror build break) and also mutates the shared fixture state unexpectedly.
    cpp/src/arrow/filesystem/hdfs_internal_test.cc:74
  • WriteDummyFile no longer calls Close() on the output stream, so close/flush errors won’t be asserted and could be silently ignored (the destructor only warns). It’s better for the test helper to explicitly close and propagate any failure.

Comment thread cpp/src/arrow/filesystem/hdfs.h
Comment thread cpp/src/arrow/filesystem/hdfs.h Outdated
@AlenkaF

AlenkaF commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

@raulcd @pitrou would you mind having a look at this refactor? Copilot run couple of reviews and actually found one bug =)

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this @AlenkaF . I see this can be further simplified, see below.

Comment thread cpp/src/arrow/filesystem/hdfs_internal_test.cc Outdated
Comment thread cpp/src/arrow/filesystem/hdfs_internal.cc

Status DeleteFile(const std::string& path) override;

Status MakeDirectory(const std::string& path);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand these methods existed on the legacy HDFS filesystem class, but I think we should only implement the standard FileSystem methods.

Comment on lines +30 to +33
class HdfsReadableFile;
class HdfsOutputStream;

struct HdfsPathInfo;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should expose these?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct 😬 Will move to internal.

Copilot AI review requested due to automatic review settings June 26, 2026 07:45

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (2)

cpp/src/arrow/filesystem/hdfs_internal.cc:39

  • This file now uses std::min and std::numeric_limits in the newly added HDFS file implementations, but it doesn’t include <algorithm> or <limits>. This will fail to compile on toolchains that don’t transitively include these headers.
    cpp/src/arrow/filesystem/hdfs_internal.cc:570
  • GetPathInfoFailed is defined inside an anonymous namespace, so it isn’t actually arrow::fs::internal::GetPathInfoFailed. The later call internal::GetPathInfoFailed(path_) therefore won’t compile. Define the helper directly in arrow::fs::internal (or change the call to match the actual scope).

RETURN_NOT_OK(ConnectLibHdfs(&driver_shim));
RETURN_NOT_OK(io::HadoopFileSystem::Connect(&options_.connection_config, &client_));
const HdfsConnectionConfig* config = &options_.connection_config;
RETURN_NOT_OK(ConnectLibHdfs(&driver_));
Comment on lines +103 to +107
bool Exists(const std::string& path);

Status GetPathInfoStatus(const std::string& path, HdfsPathInfo* info);

Status ListDirectory(const std::string& path, std::vector<HdfsPathInfo>* listing);
Comment thread python/pyarrow/io.pxi
from queue import Queue, Empty as QueueEmpty

from pyarrow.lib cimport check_status, HaveLibHdfs
from pyarrow.lib cimport check_status
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++] Refactor arrow/io/hdfs.h to use common FileSystem API

4 participants