Extract: Selective chunk extraction for regular files (#5638)#9703
Open
alighazi288 wants to merge 3 commits into
Open
Extract: Selective chunk extraction for regular files (#5638)#9703alighazi288 wants to merge 3 commits into
alighazi288 wants to merge 3 commits into
Conversation
Signed-off-by: alighazi288 <51366992+alighazi288@users.noreply.github.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #9703 +/- ##
==========================================
- Coverage 83.88% 83.86% -0.02%
==========================================
Files 93 93
Lines 15659 15764 +105
Branches 2351 2374 +23
==========================================
+ Hits 13135 13221 +86
- Misses 1790 1803 +13
- Partials 734 740 +6 ☔ View full report in Codecov by Sentry. |
…backup#5638) Signed-off-by: alighazi288 <51366992+alighazi288@users.noreply.github.com>
Signed-off-by: alighazi288 <51366992+alighazi288@users.noreply.github.com>
11b13fd to
f0fb796
Compare
Contributor
Author
|
Hey @ThomasWaldmann, hope you're doing well. Just wanted to follow up on this PR and I would love to get your thoughts when you have a chance. Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Supersedes #8632 (which stalled as a draft). This is a clean re-implementation off current master that addresses all of the review feedback from that PR.
Description
This PR makes extraction of an existing regular file update it in place: it hashes the on-disk content using the archived chunk sizes, compares chunk-by-chunk against the archive's chunk list, and fetches only the chunks that actually differ.
Changes (
Archive.compare_and_extract_chunks)sizebytes from the existing file and build a parallelfs_chunkslist ofChunkListEntry(id=id_hash(data), size=...).zip(fs_chunks, item.chunks)→ the chunks whose ids differ are the ones to fetch.fetch_many()only those (non-preloaded) chunks.seek(size, SEEK_CUR)if it already matches, otherwisewrite()the freshly fetched data. Truncate to the archived length.restore_attrs().There is no new CLI option: the behavior is driven purely by the filesystem state, as requested in review.
Safety: in-place updating is a pure optimization that never changes observable semantics
It is only used when it is provably safe, otherwise we fall back to the normal unlink+recreate path (which behaves exactly as before).
can_patch_in_place()requires:hlid): keeps preload bookkeeping correct.st_nlink == 1, otherwise an in-place write would alter content seen through other hard links to the same inode; unlink+recreate gives this path its own fresh inode.Preloading
Since in-place updating fetches only the differing chunks, the command now skips preloading for patch candidates (
will_patch_in_place()), and a newpreloadedflag onextract_item()ensures the full-extraction fallback does not wait on preloaded chunks that were never requested.Metadata of the existing file
For an in-place update, a new
clear_attrs()(next torestore_attrs()) wipes pre-existing extended attributes and BSD flags first. xattr removal required a newremovexattrplatform primitive (added across linux/darwin/freebsd/netbsd + the base fallback), and a resilientxattr.clear_all()that skips attributes it isn't allowed to drop (e.g.security.*namespaces, or filesystems without xattr support) rather than aborting the extraction.ACLs are intentionally not cleared. Instead, files that carry an extended ACL fall back to the normal extraction path, so they always get a clean metadata state. This is irrelevant for the big-file/block-device target of #5638.
Resolution of the #8632 review comments
preloadedflag prevents a deadlock on fallback.backup_io(...); no broadexcept.chunk_data)pi.show(increase=item_chunk.size)after the if/else; no unbound variable.can_patch_in_place(item, path, st)reuses the caller'sst.stis keyword-only and required.removexattrprimitive, not set-to-empty.with fs_file:;fileno()taken once.chunk_size=4, literal contents, and all four requested cases.The old draft's extra
chunks_healthycheck was deliberately dropped. The normal extract path does not have it, so adding it only here would introduce an inconsistency.Testing
compare_and_extract_chunks(parametrized: no-change, single/both/cross-boundary changes, partial last chunk, fs shorter/longer, empty fs, empty item) asserting both the resulting content and that only the differing chunks were fetched.st=None, hard-linked item,st_nlink > 1, extended ACL) and for stale-xattr clearing.Notes
removexattrbinding locally. The linux/freebsd/netbsd bindings were written by mirroring the existingsetxattrpatterns.Closes #5638 (regular-file case).
Checklist
master(or maintenance branch if only applicable there)toxor the relevant test subset)