Skip to content

feat(pipeline): add source-change fingerprint and reprocess decision#451

Open
ddeboer wants to merge 2 commits into
mainfrom
feat/pipeline-source-signal-reprocess-decision
Open

feat(pipeline): add source-change fingerprint and reprocess decision#451
ddeboer wants to merge 2 commits into
mainfrom
feat/pipeline-source-signal-reprocess-decision

Conversation

@ddeboer

@ddeboer ddeboer commented Jun 10, 2026

Copy link
Copy Markdown
Member

What

Implements the two pure building blocks of the skip-unchanged-datasets mechanism from #450 — the source-change fingerprint and the reprocess decision — in a new provenance module. These carry no I/O and are not yet wired into the pipeline; the ProvenanceStore, the two-phase resolver split, and the pipeline gate follow in later changes.

Public API

  • sourceFingerprint(distribution, probeResult): string | null — derives a cheap, opaque source-change fingerprint from metadata the probe already collected (no body download):
    • Live SPARQL endpoint → null (always reprocess; it exposes no signal).
    • Data dump → most recent of the register’s dct:modified and the artifact’s HTTP Last-Modified, combined with byte size (probe Content-Length, falling back to declared dcat:byteSize).
    • No usable date and no size → null.
    • Mirrors the change signal ImportResolver already computes for the downloader, so the skip layer and the download/import layer agree.
  • shouldReprocess(current, stored): boolean — pure equality on the two change fields (sourceFingerprint, pipelineVersion); no record, a mismatch, or a null fingerprint ⇒ reprocess. A null fingerprint never compares equal, even against a stored null.
  • ProcessingRecord — the per-dataset memory, including status for failure handling. The skip rule is equality-only and does not gate on status: a failed-but-unchanged dataset is skipped until its source changes or the version rotates.
  • ChangeKeyPick<ProcessingRecord, 'sourceFingerprint' | 'pipelineVersion'>, the change-determining subset compared by shouldReprocess.

The fingerprint and version strings are opaque — only ever compared for equality, never parsed or ordered. pipelineVersion is kept as a separate field, never folded into the fingerprint: the data side is observed automatically, the logic side is intentionally declared.

Robustness

sourceFingerprint is hardened against malformed third-party metadata:

  • A malformed HTTP Last-Modified or dct:modified (an Invalid Date) is treated as absent rather than crashing toISOString() with a RangeError, and can no longer stick ahead of a valid date (validDate > invalidDate is number > NaN, always false).
  • A non-numeric Content-Length (NaN) is treated as absent, falling back to the declared dcat:byteSize, so the fingerprint stays stable.

Tests

Unit tests covering both modules through their public interface: date precedence (most-recent wins, both orderings), byte-size participation and Content-Length-over-declared precedence, null cases, the malformed-metadata paths above, and every branch of the reprocess decision including the null-never-equal guard and status being ignored.

Part of #450.

ddeboer added 2 commits June 10, 2026 15:26
- Add sourceSignal(distribution, probeResult): derive a cheap, opaque
  source-change signal from probe metadata. Live SPARQL endpoints yield
  null (always reprocess); data dumps combine max(dct:modified,
  Last-Modified) with byte size (probe Content-Length, falling back to
  declared dcat:byteSize).
- Add shouldReprocess(current, stored): pure equality on the two change
  fields, with a null signal that never compares equal so it always
  reprocesses.
- Add the ProcessingRecord type (records status for failure handling,
  though the skip rule is equality-only and ignores it).
- Export the new provenance module from the package index.

Pure building blocks for issue #450; the ProvenanceStore, two-phase
resolver split, and pipeline gate follow in later changes.
…lformed dates

- Rename sourceSignal() to sourceFingerprint() (function and file): the
  value is an opaque composite of max(date) and byte size, derived from
  observed source metadata, not a timestamp. 'Fingerprint' conveys
  equality-only, don't-parse semantics; 'signal' was vaguer and invited
  treating it as a date.
- Rename ProcessingRecord.sourceModified to sourceFingerprint to match.
- Rename ChangeFields to ChangeKey and define it as
  Pick<ProcessingRecord, 'sourceFingerprint' | 'pipelineVersion'> so there
  is one source of truth and it expresses the change-determining subset.
- Keep pipelineVersion a separate field, never folded into the
  fingerprint: the data side is observed, the logic side is declared.
- Fix a RangeError: a malformed HTTP Last-Modified or dct:modified (an
  Invalid Date) was selected by mostRecent and crashed toISOString().
  Invalid Dates are now skipped, which also stops an invalid date from
  sticking ahead of a valid one (validDate > invalidDate is num > NaN,
  always false).
- Treat a non-numeric Content-Length (NaN) as absent, falling back to the
  declared dcat:byteSize, so the fingerprint stays stable.
- Add regression tests for the malformed-metadata paths.
@ddeboer ddeboer changed the title feat(pipeline): add source-change signal and reprocess decision feat(pipeline): add source-change fingerprint and reprocess decision Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant