Skip to content

Extract: Selective chunk extraction for regular files (#5638)#9703

Open
alighazi288 wants to merge 3 commits into
borgbackup:masterfrom
alighazi288:issue-5638-selective-extract
Open

Extract: Selective chunk extraction for regular files (#5638)#9703
alighazi288 wants to merge 3 commits into
borgbackup:masterfrom
alighazi288:issue-5638-selective-extract

Conversation

@alighazi288

Copy link
Copy Markdown
Contributor

Supersedes #8632 (which stalled as a draft). This is a clean re-implementation off current master that addresses all of the review feedback from that PR.

Description

This PR makes extraction of an existing regular file update it in place: it hashes the on-disk content using the archived chunk sizes, compares chunk-by-chunk against the archive's chunk list, and fetches only the chunks that actually differ.

Changes (Archive.compare_and_extract_chunks)

  1. Read pass: read each archived chunk's size bytes from the existing file and build a parallel fs_chunks list of ChunkListEntry(id=id_hash(data), size=...).
  2. zip(fs_chunks, item.chunks) → the chunks whose ids differ are the ones to fetch.
  3. fetch_many() only those (non-preloaded) chunks.
  4. Write pass: for each archived chunk, seek(size, SEEK_CUR) if it already matches, otherwise write() the freshly fetched data. Truncate to the archived length.
  5. Clear stale metadata, then restore_attrs().

There is no new CLI option: the behavior is driven purely by the filesystem state, as requested in review.

Safety: in-place updating is a pure optimization that never changes observable semantics

It is only used when it is provably safe, otherwise we fall back to the normal unlink+recreate path (which behaves exactly as before). can_patch_in_place() requires:

  • destination is a regular file and the archive item is a regular file.
  • not a hard-linked archive item (hlid): keeps preload bookkeeping correct.
  • st_nlink == 1, otherwise an in-place write would alter content seen through other hard links to the same inode; unlink+recreate gives this path its own fresh inode.
  • no extended ACL on the existing file (see "metadata" below).

Preloading

Since in-place updating fetches only the differing chunks, the command now skips preloading for patch candidates (will_patch_in_place()), and a new preloaded flag on extract_item() ensures the full-extraction fallback does not wait on preloaded chunks that were never requested.

Metadata of the existing file

For an in-place update, a new clear_attrs() (next to restore_attrs()) wipes pre-existing extended attributes and BSD flags first. xattr removal required a new removexattr platform primitive (added across linux/darwin/freebsd/netbsd + the base fallback), and a resilient xattr.clear_all() that skips attributes it isn't allowed to drop (e.g. security.* namespaces, or filesystems without xattr support) rather than aborting the extraction.

ACLs are intentionally not cleared. Instead, files that carry an extended ACL fall back to the normal extraction path, so they always get a clean metadata state. This is irrelevant for the big-file/block-device target of #5638.

Resolution of the #8632 review comments

Review comment Resolution
"not sure we need an option … determined by the fs state, not by an option" No CLI option; decision is purely fs-state based.
"read-only pass → build fs chunks → zip → fetch_many → write/seek pass" Implemented in exactly that order.
"if one preloads some chunks, one must fetch all of them" Preload skipped for patch candidates; preloaded flag prevents a deadlock on fallback.
"don't catch Exception … notice how backup_io() is used" All I/O wrapped in backup_io(...); no broad except.
"seek in if-block, write in else-block" / "don't track offset" / "use os.SEEK_CUR" Done; no offset variable.
"progress is made in the if-block also → do it after if/else" (+ unbound chunk_data) pi.show(increase=item_chunk.size) after the if/else; no unbound variable.
"caller already determined st; don't do unneeded os calls" can_patch_in_place(item, path, st) reuses the caller's st.
"remove default value for st" st is keyword-only and required.
"create a clear_attrs method close to restore_attrs" Added.
"setting an xattr to empty is not the same as removing an xattr" Real removexattr primitive, not set-to-empty.
"you didn't read the acl_set code" / "there can be already acls or xattrs that don't match" xattrs+flags cleared; ACL-bearing files fall back to normal extraction.
"close the fs_file (you removed the with)" / "fileno() called too often" with fs_file:; fileno() taken once.
"use small chunk size & visual contents; add these edge cases" Tests use chunk_size=4, literal contents, and all four requested cases.
"no prints / move test code out of archive.py" No prints; tests live in the test modules.

The old draft's extra chunks_healthy check was deliberately dropped. The normal extract path does not have it, so adding it only here would introduce an inconsistency.

Testing

  • Unit tests for compare_and_extract_chunks (parametrized: no-change, single/both/cross-boundary changes, partial last chunk, fs shorter/longer, empty fs, empty item) asserting both the resulting content and that only the differing chunks were fetched.
  • Unit tests for the safety fallbacks (non-regular, st=None, hard-linked item, st_nlink > 1, extended ACL) and for stale-xattr clearing.
  • End-to-end archiver test: extract → modify bytes on disk → extract again → content fully restored and the inode preserved (proving in-place update), passing for both the local and remote archiver.

Notes

Closes #5638 (regular-file case).

Checklist

  • PR is against master (or maintenance branch if only applicable there)
  • New code has tests and docs where appropriate
  • Tests pass (run tox or the relevant test subset)
  • Commit messages are clean and reference related issues

Signed-off-by: alighazi288 <51366992+alighazi288@users.noreply.github.com>
@codecov

codecov Bot commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 84.61538% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.86%. Comparing base (d8564b9) to head (f0fb796).
⚠️ Report is 3 commits behind head on master.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/borg/xattr.py 41.17% 9 Missing and 1 partial ⚠️
src/borg/archive.py 91.76% 3 Missing and 4 partials ⚠️
src/borg/platform/__init__.py 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9703      +/-   ##
==========================================
- Coverage   83.88%   83.86%   -0.02%     
==========================================
  Files          93       93              
  Lines       15659    15764     +105     
  Branches     2351     2374      +23     
==========================================
+ Hits        13135    13221      +86     
- Misses       1790     1803      +13     
- Partials      734      740       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…backup#5638)

Signed-off-by: alighazi288 <51366992+alighazi288@users.noreply.github.com>
Signed-off-by: alighazi288 <51366992+alighazi288@users.noreply.github.com>
@alighazi288 alighazi288 force-pushed the issue-5638-selective-extract branch from 11b13fd to f0fb796 Compare June 2, 2026 05:04
@alighazi288

Copy link
Copy Markdown
Contributor Author

Hey @ThomasWaldmann, hope you're doing well. Just wanted to follow up on this PR and I would love to get your thoughts when you have a chance. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

restore big files/block devices with considering already present data

1 participant