feat(pipeline): add source-change fingerprint and reprocess decision#451
Open
ddeboer wants to merge 2 commits into
Open
feat(pipeline): add source-change fingerprint and reprocess decision#451ddeboer wants to merge 2 commits into
ddeboer wants to merge 2 commits into
Conversation
- Add sourceSignal(distribution, probeResult): derive a cheap, opaque source-change signal from probe metadata. Live SPARQL endpoints yield null (always reprocess); data dumps combine max(dct:modified, Last-Modified) with byte size (probe Content-Length, falling back to declared dcat:byteSize). - Add shouldReprocess(current, stored): pure equality on the two change fields, with a null signal that never compares equal so it always reprocesses. - Add the ProcessingRecord type (records status for failure handling, though the skip rule is equality-only and ignores it). - Export the new provenance module from the package index. Pure building blocks for issue #450; the ProvenanceStore, two-phase resolver split, and pipeline gate follow in later changes.
…lformed dates - Rename sourceSignal() to sourceFingerprint() (function and file): the value is an opaque composite of max(date) and byte size, derived from observed source metadata, not a timestamp. 'Fingerprint' conveys equality-only, don't-parse semantics; 'signal' was vaguer and invited treating it as a date. - Rename ProcessingRecord.sourceModified to sourceFingerprint to match. - Rename ChangeFields to ChangeKey and define it as Pick<ProcessingRecord, 'sourceFingerprint' | 'pipelineVersion'> so there is one source of truth and it expresses the change-determining subset. - Keep pipelineVersion a separate field, never folded into the fingerprint: the data side is observed, the logic side is declared. - Fix a RangeError: a malformed HTTP Last-Modified or dct:modified (an Invalid Date) was selected by mostRecent and crashed toISOString(). Invalid Dates are now skipped, which also stops an invalid date from sticking ahead of a valid one (validDate > invalidDate is num > NaN, always false). - Treat a non-numeric Content-Length (NaN) as absent, falling back to the declared dcat:byteSize, so the fingerprint stays stable. - Add regression tests for the malformed-metadata paths.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Implements the two pure building blocks of the skip-unchanged-datasets mechanism from #450 — the source-change fingerprint and the reprocess decision — in a new
provenancemodule. These carry no I/O and are not yet wired into the pipeline; theProvenanceStore, the two-phase resolver split, and the pipeline gate follow in later changes.Public API
sourceFingerprint(distribution, probeResult): string | null— derives a cheap, opaque source-change fingerprint from metadata the probe already collected (no body download):null(always reprocess; it exposes no signal).dct:modifiedand the artifact’s HTTPLast-Modified, combined with byte size (probeContent-Length, falling back to declareddcat:byteSize).null.ImportResolveralready computes for the downloader, so the skip layer and the download/import layer agree.shouldReprocess(current, stored): boolean— pure equality on the two change fields (sourceFingerprint,pipelineVersion); no record, a mismatch, or anullfingerprint ⇒ reprocess. Anullfingerprint never compares equal, even against a storednull.ProcessingRecord— the per-dataset memory, includingstatusfor failure handling. The skip rule is equality-only and does not gate onstatus: a failed-but-unchanged dataset is skipped until its source changes or the version rotates.ChangeKey—Pick<ProcessingRecord, 'sourceFingerprint' | 'pipelineVersion'>, the change-determining subset compared byshouldReprocess.The fingerprint and version strings are opaque — only ever compared for equality, never parsed or ordered.
pipelineVersionis kept as a separate field, never folded into the fingerprint: the data side is observed automatically, the logic side is intentionally declared.Robustness
sourceFingerprintis hardened against malformed third-party metadata:Last-Modifiedordct:modified(an Invalid Date) is treated as absent rather than crashingtoISOString()with aRangeError, and can no longer stick ahead of a valid date (validDate > invalidDateisnumber > NaN, alwaysfalse).Content-Length(NaN) is treated as absent, falling back to the declareddcat:byteSize, so the fingerprint stays stable.Tests
Unit tests covering both modules through their public interface: date precedence (most-recent wins, both orderings), byte-size participation and Content-Length-over-declared precedence,
nullcases, the malformed-metadata paths above, and every branch of the reprocess decision including thenull-never-equal guard and status being ignored.Part of #450.