feat(fingerprint): switch to new fingerprint algorithm [3/3]#243
feat(fingerprint): switch to new fingerprint algorithm [3/3]#243dmcilvaney wants to merge 8 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR switches the component fingerprinting substrate from hashstructure over live Go structs to a frozen, canonical v1 projection (projectV1) hashed with stdlib sha256, and stores fingerprints as an atomic content-version token (v1:sha256:<digest>). It also tightens/clarifies the fingerprint field-decision model by requiring explicit fingerprint tags on all fingerprinted fields and removes the hashstructure module dependency.
Changes:
- Rewired
ComputeIdentityto hashprojectV1(canonicalizeForFingerprint(cfg))and to emit the atomicv1:sha256:token; removed the old config-hash artifact andhashstructuredependency. - Introduced and guarded the v1 projection substrate (canonical encoder, version-set tag parser, golden vectors, emission probe) and enforced mandatory per-field fingerprint decisions.
- Relaxed lockfile read gating to accept format versions in
[1..currentVersion]while explicitly pinning formatVersion == 1independent of token content version.
Reviewed changes
Copilot reviewed 44 out of 46 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| report/schema-version-parts/report-phase1.md | Phase 1 implementation report (encoder + tag parser + combiner). |
| report/schema-version-parts/report-phase2.md | Phase 2 implementation report (projectV1 + golden vectors + tags). |
| report/schema-version-parts/report-phase3.md | Phase 3 implementation report (cutover wiring + atomic token + hashstructure removal). |
| report/schema-version-parts/.gitkeep | Documents expected phase report files for this workstream. |
| plan/schema-version-parts/phase1-encoder-tag-parser.md | Phase 1 plan/checklist updated to completed. |
| plan/schema-version-parts/phase2-projection-golden-vectors.md | Phase 2 plan/checklist updated to completed. |
| plan/schema-version-parts/phase3-reset-cutover.md | Phase 3 plan/checklist updated to completed. |
| plan/schema-version-parts/overview.md | Workstream overview updated to completed, including guardrails and phase status. |
| plan/schema-version-parts/handoff-prompt.md | Handoff prompt for phased execution of the workstream. |
| internal/projectconfig/component.go | Adds fingerprint:"v1..*"/"-" tags and hazard comments on pruned subtrees. |
| internal/projectconfig/build.go | Adds fingerprint tags and hazard comments for excluded composites. |
| internal/projectconfig/distro.go | Adds fingerprint tags to distro reference fields. |
| internal/projectconfig/overlay.go | Adds fingerprint tags to overlay fields that affect build inputs. |
| internal/projectconfig/package.go | Adds hazard comment for excluded publish subtree in package config. |
| internal/projectconfig/render.go | Adds fingerprint tag to render config field. |
| internal/projectconfig/specsource.go | Adds fingerprint tags to spec source fields; keeps path excluded. |
| internal/projectconfig/fingerprint_test.go | Enforces mandatory fingerprint-tag decisions via fingerprint.ValidateFieldTag and central type list. |
| internal/fingerprint/fingerprint.go | Switches ComputeIdentity to projection-based hashing and stamps v1:sha256: token. |
| internal/fingerprint/combine.go | Defines combineProjection for folding projection bytes + non-config inputs. |
| internal/fingerprint/combine_internal_test.go | Unit tests for combineProjection. |
| internal/fingerprint/canonical.go | Canonical length-prefixed encoder for projection bytes. |
| internal/fingerprint/canonical_internal_test.go | Unit tests for canonical encoder behaviors and edge cases. |
| internal/fingerprint/versiontag.go | Version-set fingerprint tag parser (vN..*, !, key=) and validation. |
| internal/fingerprint/versiontag_internal_test.go | Unit tests for tag parsing/validation and emit-key resolution. |
| internal/fingerprint/project.go | Implements projectV1, canonicalizer, tag validation, and the fingerprinted-type list. |
| internal/fingerprint/project_internal_test.go | Tests canonicalization, projection behaviors, emission probe, and composite-! placeholder. |
| internal/fingerprint/golden_internal_test.go | Golden-vector freeze + append-only guard + -update-golden support. |
| internal/fingerprint/testdata/golden_v1.json | Frozen v1 (config -> digest) golden vector table. |
| internal/fingerprint/fingerprint_test.go | Updates identity tests to assert v1:sha256: token shape. |
| internal/lockfile/lockfile.go | Relaxes format-version read gate to [1..currentVersion] with updated error message. |
| internal/lockfile/lockfile_test.go | Adds format-version pinning/round-trip test independent of token content version. |
| internal/app/azldev/core/components/resolver.go | Documents force-rehash behavior via string inequality for legacy tokens. |
| internal/app/azldev/cmds/component/update.go | Documents force-rehash behavior via string inequality at update restamp site. |
| internal/app/azldev/cmds/component/update_test.go | Adds test verifying legacy prefix-less tokens force-rehash to v1:sha256: on update. |
| docs/developer/reference/component-identity-and-locking.md | Updates developer reference to projection substrate + version-set tags + atomic token. |
| docs/developer/schema-migration/README.md | Adds executive summary for the RFC/workstream. |
| docs/developer/schema-migration/problem-and-motivation.md | Adds problem statement and motivation summary doc. |
| docs/developer/schema-migration/part-1-the-reset.md | Adds Part 1 (reset) summary doc. |
| docs/developer/schema-migration/part-2-lazy-migration.md | Adds Part 2 (deferred) summary doc. |
| docs/developer/schema-migration/delivery-plan.md | Adds delivery plan summary doc. |
| .github/instructions/projectconfig-fingerprint.instructions.md | Adds repo guidance for safe edits to config structs/fingerprint substrate. |
| .github/instructions/go.instructions.md | Updates Go instructions to point at the new fingerprint/config guidance. |
| .github/copilot-instructions.md | Adds a critical note to read the fingerprint/config instruction doc before such edits. |
| go.mod | Removes github.com/mitchellh/hashstructure/v2 dependency. |
| go.sum | Removes hashstructure checksums. |
| // maximalConfig returns a config with every measured scalar-leaf and map field | ||
| // maximalConfig is the frozen v1-cutover field set: every field measured at the | ||
| // cutover, each set to a distinct non-zero value, golden-vectored as "maximal". |
d33009a to
0c2a32a
Compare
| // ComponentIdentity holds the computed fingerprint for a single component. | ||
| type ComponentIdentity struct { | ||
| // Fingerprint is the overall SHA256 hash combining all inputs. | ||
| // Fingerprint is the atomic "v<N>:sha256:..." content token combining the | ||
| // canonical config projection with the non-config inputs. | ||
| Fingerprint string `json:"fingerprint"` | ||
| // Inputs provides the individual input hashes that were combined. | ||
| Inputs ComponentInputs `json:"inputs"` | ||
| } |
| 1. **Config projection digest** - `sha256` of the canonical `projectV1` projection of the resolved `ComponentConfig` (after all merging). Only fields whose `fingerprint` tag measures them at v1 are emitted; `fingerprint:"-"` fields are excluded. A nil-or-empty scalar slice is treated as zero and omitted by the projection's omit predicate, so a merge-order nil-vs-`[]` difference never moves the digest. | ||
| 2. **Source identity** - content hash for local specs (all files in the spec directory), commit hash for upstream. | ||
| 3. **Overlay file hashes** - SHA256 of each file referenced by overlay `Source` fields. | ||
| 4. **Distro name + version** | ||
| 5. **Manual release bump counter** — increments with each manual release bump, ensuring a new fingerprint even if there are no config or source changes. | ||
| 5. **Manual release bump counter** - increments with each manual release bump, ensuring a new fingerprint even if there are no config or source changes. |
| from the module. Because Phases 1-2 already shipped `canonicalizeForFingerprint`, | ||
| `projectV1`, and `combineProjection` beside the live path, this phase is | ||
| **deletion-heavy rewiring** (net +139/-95 across source), not new machinery: | ||
|
|
||
| 1. **3.1 substrate swap** - the `hashstructure.Hash` config-hash step is replaced | ||
| by `projectV1(canonicalizeForFingerprint(component))` invoked inside the hash | ||
| boundary, so every path into the hasher is canonicalized. The `uint64` | ||
| `ComponentInputs.ConfigHash` artifact and the old `combineInputs` fold are | ||
| deleted; everything is `sha256` now. |
0c2a32a to
f44abbf
Compare
| // | ||
| // A golden vector is a frozen (config -> digest) pair that pins the v1 byte | ||
| // encoding irreversibly. The freeze is append-only and these rules are load-bearing | ||
| // because the encoding becomes a one-way door the moment Phase 3 ships: |
There was a problem hiding this comment.
Drop phase3 wording
There was a problem hiding this comment.
or reference RFC
|
|
||
| const goldenPath = "testdata/golden_v1.json" | ||
|
|
||
| // goldenConfigs is the v1 freeze corpus: one maximal config (every measured field |
There was a problem hiding this comment.
instructions should be worded to account for v2 etc.
| // -update-golden flag only APPENDS; a moved existing digest is a deliberate | ||
| // FATAL. If an existing digest moves, your change was NOT output-preserving for | ||
| // the fleet - that is a bug to fix, not a value to -update away. | ||
| // - maximalConfig (the "maximal" vector) is FROZEN: it is a corpus config, so |
There was a problem hiding this comment.
maximalconfig name still feels weird, pick a better name.
| // - edge cases: name the property exercised ("defines-empty-value", | ||
| // "single-overlay", ...). | ||
| // - an additive field: "<toml-key>-set", one isolated vector per field. | ||
| func goldenConfigs() map[string]projectconfig.ComponentConfig { |
There was a problem hiding this comment.
as above, should this be v1 only, or all versions? should the generator add new files for each version and leave this as-is?
| // currentContentVersion is the highest content version that exists. The reset | ||
| // establishes v1, so projectV1 is the current (and only) projection and the tag | ||
| // parser rejects any field that references a future version. | ||
| const currentContentVersion = 1 |
There was a problem hiding this comment.
should each projector have its own file v1.go, v2.go, etc. with common bits in project.go?
| // proj is a thin accumulator over [canonicalBuf] that defers the first emit | ||
| // error (bufio-style), so the hand-written projectors above stay readable | ||
| // instead of checking every call. The deferred error surfaces from [proj.bytes]. | ||
| type proj struct { |
There was a problem hiding this comment.
make this v1 specific as well, to allow for bug fixes?
| // Includes the reset's force-rehash case: a pre-reset prefix-less token | ||
| // never equals the recomputed v1:sha256: token, so it reads as Stale and | ||
| // is re-stamped on the next update. Inequality IS the reconciliation; do | ||
| // not make it version-aware before the PR C replay registry. |
There was a problem hiding this comment.
what is PR C? Reference RFC if needed.
| // with a domain-separating label. The config contribution is sha256(projection), | ||
| // unifying everything on sha256. [ComputeIdentity] stamps the returned | ||
| // "sha256:..." digest into the atomic "v<version>:sha256:..." content token. | ||
| func combineProjection(projection []byte, inputs componentInputs) string { |
There was a problem hiding this comment.
pr 3 seems to edit this, why add then remove the code?
| // so it is rejected here rather than silently guessing an encoding. | ||
| func ValidateFieldTag(field reflect.StructField) error { | ||
| set, err := parseVersionSet(field.Tag.Get(hashstructureTagName), currentContentVersion) | ||
| set, err := parseVersionSet(field.Tag.Get(fingerprintTagName), currentContentVersion) |
There was a problem hiding this comment.
drop hashstrctureTagName?
035a29e to
a89ecc5
Compare
a89ecc5 to
2a07072
Compare
| 1. **Config projection digest** - `sha256` of the canonical `projectV1` projection of the resolved `ComponentConfig` (after all merging). Only fields whose `fingerprint` tag measures them at v1 are emitted; `fingerprint:"-"` fields are excluded. A nil-or-empty scalar slice is treated as zero and omitted by the projection's omit predicate, so a merge-order nil-vs-`[]` difference never moves the digest. | ||
| 2. **Source identity** - content hash for local specs (all files in the spec directory), commit hash for upstream. | ||
| 3. **Overlay file hashes** - SHA256 of each file referenced by overlay `Source` fields. | ||
| 4. **Distro name + version** | ||
| 5. **Manual release bump counter** — increments with each manual release bump, ensuring a new fingerprint even if there are no config or source changes. | ||
| 5. **Manual release bump counter** - increments with each manual release bump, ensuring a new fingerprint even if there are no config or source changes. |
| ## Known Limitations | ||
|
|
||
| - It is difficult to determine WHY a diff occurred (e.g., which specific field changed) since the fingerprint is a single opaque hash. The JSON output includes an `inputs` breakdown (`configHash`, `sourceIdentity`, `overlayFileHashes`, etc.) that can help narrow it down by comparing the two identity files manually. | ||
| - It is difficult to determine WHY a diff occurred (e.g., which specific field changed) since the fingerprint is a single opaque token. The JSON output includes an `inputs` breakdown (`sourceIdentity`, `overlayFileHashes`, manual bump, release ver) that can help narrow it down by comparing two identity files manually. |
| - **`ComponentIdentity.Inputs` kept, marked with a `ponytail:` note.** The breakdown | ||
| has no production reader today (only tests inspect it); left in place rather than | ||
| expanding scope, flagged for removal with its call sites if it stays unused. |
| - Depends on: Phase 1 (`canonicalBuf`, `combineProjection`) and Phase 2 | ||
| (`projectV1`, `canonicalizeForFingerprint`, golden vectors, the | ||
| `currentContentVersion = 1` constant). |
| | `internal/fingerprint/` | New canonical encoder, tag parser, `projectV1`, `canonicalizeForFingerprint`, sha256 combiner; `ComputeIdentity` switch-over; remove `hashstructure`. (`fingerprint.go`) | | ||
| | `internal/projectconfig/` | `fingerprint:"vN..*"` tags on `ComponentConfig` and nested structs (`component.go`, `build.go`); extend `TestAllFingerprintedFieldsHaveDecision` (`fingerprint_test.go`); `Packages` mandatory-tag correction; resolver scalar-slice canonicalization (`component.go`). | |
| | ---- | ---------- | | ||
| | Hand-written `projectV1` silently forgets to emit a measured field (G5 stale) | Emission probe (sentinel-filled config asserts every measured emit-key appears) + extended decision test; recoverable by shipping a corrected version. | | ||
| | Wrong/accidental byte encoding frozen at the reset (irreversible after Phase 3) | Pin the full v1 encoding table up front; append-only golden vectors authored in Phase 2 before the switch; encoding decisions are RFC-settled, not discovered. | | ||
| | nil-vs-empty scalar slice produces non-deterministic bytes | `canonicalizeForFingerprint` collapses nil-or-empty scalar slices to one canonical form at the hash boundary; its canonical-form test is written FIRST, before any golden vector. | |
| Replace the `hashstructure.Hash(component, ...)` config-hash step with | ||
| `canonicalizeForFingerprint(cfg)` immediately followed by `projectV1` + `sha256`, invoked | ||
| **inside the hash boundary** so every path into the hasher is canonicalized regardless of how | ||
| the caller obtained the config. Retire the `uint64` `ConfigHash` field on `ComponentInputs`; | ||
| one hash format (`sha256`) everywhere. |
| 1. A scalar-slice **canonicalizer** (`canonicalizeForFingerprint`) - a generic | ||
| reflective normalizer that collapses every nil-or-empty scalar slice to nil at | ||
| the hash boundary, with no hand-maintained field inventory. Its canonical-form | ||
| test was written **first**, before any golden vector (the ordering gate). |
| - `project.go` - `currentContentVersion = 1`; `projectV1` + nested sub-projectors; | ||
| the `proj` accumulator (defers the first emit error, bufio-style); | ||
| `canonicalizeForFingerprint` / `canonicalizeInto` / `canonicalizeSlice`; | ||
| `ValidateFieldTag` (mandatory-decision + composite-`!` gate) and `tomlKeyOf`. |
f96eb74 to
9352243
Compare
9352243 to
29ee99d
Compare
| 4. **If measured, emit it in `projectV1`** (`internal/fingerprint/project.go`, | ||
| hand-written as of v1) inside the correct sub-projector, under its **frozen `toml` | ||
| emit-key** (or an explicit `key=` in the tag - never the Go field name). |
| 1. A scalar-slice **canonicalizer** (`canonicalizeForFingerprint`) - a generic | ||
| reflective normalizer that collapses every nil-or-empty scalar slice to nil at | ||
| the hash boundary, with no hand-maintained field inventory. Its canonical-form | ||
| test was written **first**, before any golden vector (the ordering gate). |
| - `internal/fingerprint/fingerprint.go` - `ComputeIdentity` calls | ||
| `projectV1(canonicalizeForFingerprint(...))` + `combineProjection`, stamps the | ||
| `v1:` token; `ComponentInputs.ConfigHash` and `combineInputs` deleted; | ||
| `hashstructure` import removed; `hashstructureTagName` const renamed | ||
| `fingerprintTagName`. |
| - `internal/fingerprint/combine.go` - `combineProjection` retyped to take | ||
| `ComponentInputs` directly; the duplicate private `projectionInputs` mirror | ||
| struct deleted (the two were identical once `ConfigHash` was dropped). |
29ee99d to
97f425a
Compare
| - **Build optimization**: only rebuild changed components and their dependents, skipping unchanged ones. | ||
| - **Automatic release bumping**: increment the release tag of changed packages automatically, and include the commits that caused the change in the changelog. | ||
|
|
||
| > **Design & how-to.** The substrate design (version-set tags, the `v1:sha256:` content token, force-rehash reconciliation, and the future lazy schema migration) is specified in the [Lock-File Fingerprint Reset RFC](../rfc/lazy-schema-migration.md). The step-by-step rules for adding a fingerprinted field live in [`projectconfig-fingerprint.instructions.md`](../../../.github/instructions/projectconfig-fingerprint.instructions.md). |
…256 combiner Phase 1 (PR A1) of the schema-version-parts cutover. Adds the pure projection-substrate primitives in internal/fingerprint, beside the existing hashstructure path: the canonicalBuf length-prefixed encoder with the split omit-predicate, the fingerprint version-set tag parser, and the sha256 combiner step. Nothing is wired into ComputeIdentity and hashstructure is untouched, so no lock byte or scenario snapshot changes. Includes in-package unit tests and the phase 1 report; updates plan status.
…er, sha256 combiner
97f425a to
29dc657
Compare
…er, sha256 combiner
29dc657 to
9ca8774
Compare
| // failure is the "you forgot to emit the new key" guard working); | ||
| // 3. append a new named vector that sets foo in isolation, then run: | ||
| // go test ./internal/fingerprint -run TestGoldenVectors -update-golden | ||
| // 4. confirm the run APPENDS ONLY - every existing digest must be byte-identical. |
| // updateGolden, when set, appends newly-named golden vectors to the frozen table. | ||
| // It MUST NOT change an existing entry's digest - see goldenPath's header comment. | ||
| // Bootstrap a missing table with: go test ./internal/fingerprint -run TestGoldenVectors -update-golden | ||
| // |
| 6. **Append a `<toml-key>-set` golden vector** via | ||
| `go test ./internal/fingerprint -run TestGoldenVectors -update-golden`. | ||
| 7. **Regenerate the schema in all three places** if the field is user-facing (has |
…er, sha256 combiner
9ca8774 to
ca017ea
Compare
…er, sha256 combiner
…er, sha256 combiner
Add the v1 projection layer beside the existing hashstructure path, additive and not yet wired into ComputeIdentity: - projectV1 + frozen nested sub-projectors, emitting measured fields by literal Go path under their frozen TOML emit-key - canonicalizeForFingerprint: reflective nil-or-empty scalar-slice normalizer at the hash boundary, pruning at fingerprint:"-" edges (cycle-safe, no field inventory) - fingerprint:"v1..*" tags on every measured field across the 10 fingerprinted structs; hazard comments on each '-'-pruned composite; Packages kept measured - golden vectors with an append-only -update-golden guard; emission probe; composite-'!' placeholder gate; mandatory-tag decision test - frozen maximalConfig vs growing emissionProbeConfig split + documented additive-field workflow and naming convention No lock byte or scenario snapshot moves; hashstructure untouched.
…a256 token (PR B) The one-time substrate swap: ComputeIdentity hashes the canonical projectV1 projection via sha256 and stamps the atomic v1:sha256: content token; hashstructure is removed; lock format Version stays 1 with force-rehash reconciliation of pre-reset tokens. Includes the review-driven consolidations and fixups (squashed): - single FingerprintedStructTypes() source of truth for the decision test and emission probe, with a completeness/reachability guard (TestFingerprintedStructTypesIsComplete); - dropped the dead ComponentIdentity.Inputs breakdown (componentInputs is now the internal combiner input only); - SkipReason documented as deliberately unmeasured (render-only comment); - v1: token test-hardening (HasPrefix(v1:sha256:), meaningful bump placeholder); - config/fingerprinting guardrail instructions, the as-built RFC reconciliation (read-gate floor; parser-free reconciliation), and the component-identity doc pointer to the RFC.
ca017ea to
ddad6ff
Compare
| 4. **If measured, emit it in `projectV1`** (`internal/fingerprint/project.go`, | ||
| hand-written as of v1) inside the correct sub-projector, under its **frozen `toml` | ||
| emit-key** (or an explicit `key=` in the tag - never the Go field name). |
| - `maximalConfig()` is **FROZEN** - it is a golden-corpus config at the time of | ||
| the initial implementation of that fingerprint algorithm, so its digest is | ||
| pinned. **Never add a new field to `maximalConfig()`** (it would move the frozen | ||
| `maximal` digest, a hard CI failure). Field growth goes in | ||
| `emissionProbeConfig()` only. |
Full e2e set of all changes. Designed to be split into multiple PRs