Voyage · EventFarm Parity Methodology
No feature ships unless it passes all four axes. Each axis catches a different class of failure that the others miss. The methodology earned each axis the hard way — by something escaping detection that one of the existing axes should have caught, and adding the new axis as the documented response.
Each is an independent dimension of "is this shippable?" — a feature that passes three of four still fails.
Axis 01
Functional
Does it work?
Axis 02
Visual
Does it look right?
Axis 03
Semantic
Does it make sense?
Axis 04
Honesty
Did we cheat to pass the others?
Axis 01 · Product surface
Strict probes verify transport correctness AND end-state correctness against the deployed surface.
The story-runner (Playwright) executes each story's happyPath, failureModes, and pageEvaluation probes against the deployed URL. Each probe's expect array is evaluated under the strict-predicate iron rule: every passOrFail() requires both the network/transport to succeed AND the resulting end-state (DOM, D1 row, audit log, response payload) to be correct.
Every probe in the story's happyPath + failureModes + pageEvaluation returns { ok: true } under the strict iron rule. No soft-passes. No probe is allowed to require less than transport + end-state correctness.
A page can pass every functional probe and still ship broken UI (visual axis catches that), display nonsense values (semantic axis catches that), or be the product of a cycle that gamed its own probes (honesty axis catches that).
Axis 02 · Product surface
A vision evaluator scores the rendered page on five aesthetic and accessibility axes; axe-clean codifies WCAG. Both must pass.
The harness captures full-page screenshots at canonical viewports (1920×1080 for big-screen visualizers, 1280×800 for laptop, 375×812 for mobile). Each screenshot is sent to a locally-installed agent CLI (codex) with a strict structured-output prompt that scores 5 axes 1–5 and returns a would_a_designer_ship boolean. axe-clean runs in parallel and asserts zero serious WCAG violations.
All 5 sub-axes ≥ 4 AND would_a_designer_ship === true AND axe-clean passes at severityAtLeast: serious. Below 4 on any sub-axis, OR designer-ship false, fails the predicate.
The vision evaluator scores visual quality — it doesn't audit arithmetic on rendered values (semantic axis), and it can't tell whether the page's beauty was achieved by hardcoding the right-looking values into a fixture (honesty axis).
Axis 03 · Product surface
Domain invariants assert that displayed values reconcile arithmetically and respect their natural constraints.
Each story declares a semanticInvariants block listing the domain-specific constraints its surface must satisfy. The harness extracts the values via Playwright DOM selectors, applies the invariant's check, and emits a structured pass/fail. Initial supported kinds: percentages-sum-to-100, displayed-total-equals-sum-of-parts, non-negative-count, monotonic-order, value-equals-attribute. The vocabulary grows as new bug classes are caught.
Every semanticInvariants entry returns ok: true. No soft-passes. The fix must address the underlying correctness bug, not adjust the display to mask the math (which would be an honesty axis violation).
The vision evaluator can score a page 5/5 on visual polish while it displays 37% + 27% + 21% + 16% = 101%. The model sees pretty bars; it doesn't audit the numbers. Semantic invariants are the audit. The two axes catch genuinely different failures and must run together.
Axis 04 · Cycle integrity
A meta-axis that audits the implementation and the messaging for cheating, gaming, or ownership-shirking — the failure modes the other three axes can't see.
Static code scan. Audits the cycle's diff + test corpus for hallmarks of cheating — hardcoded values that match test expectations, algorithms that force data into expected shapes by hammering rather than by correctness, tests that always return true, mocks that bypass the actual logic, comments that betray gaming intent.
Post-hoc message audit. Audits the cycle's commit messages + closure markdown + exit reports for ownership-shirking patterns — "pre-existing condition" excuses applied to in-scope failures, illegitimate scope-shifting, verdicts broader than the underlying evidence supports, suppressed surfacing of findings that appear in the diff but not in the closure.
tally = [37, 27, 21, 16] in a fixture when the test asserts those exact percentages — the implementation is hardcoded to passif (sum !== 100) values[last] += 100 - sum hammers percentages to sum without fixing the rounding bugexpect(true).toBe(true) dressed up as something meaningful; if (errors > 0) expect(...).toBe(...) where the if makes the assertion optionalexpect(errors).toBeLessThan(5) when the strict expectation should be expect(errors).toBe(0)Per cycle: 0 critical findings, 0 unaddressed serious findings. Critical = automatic cycle-level fail. Serious = fails unless explicitly addressed within the cycle. Minor = noted, doesn't gate the verdict.
Without it, an agent can pass functional + visual + semantic by gaming each one independently — and the only line of defense is human discerning eye. Codifying the human's audit pattern as a fourth axis closes that loop. The methodology audits itself.
What the methodology obligates us to do, every time.
No factory completes without producing per-axis verdicts. No row promotes to Shippable until all four axes pass for it. Three of four is a fail. The cycle's third commit is responsible for the matrix re-stamp AND the per-axis evidence bundle.
Pre-cycle measure, post-cycle measure, screenshots, evaluator JSON, closure narrative, deployed-version pointer. All published to the deployed stories surface, all linked from the cycle's roadmap card and findings ledger entry. The trail is persistent — past evidence stays browsable.
If an axis fails and can't be closed within the time budget, the row demotes with documented reason. Demotion is not a failure of the methodology — it's the methodology working. Soft-passing is the failure mode.
When a human catches a bug the methodology should have caught, the case is logged in the escape ledger. Each escape requires a healing plan: what was missed, root cause, the change to the methodology (new axis, tightened predicate, new invariant kind, new honesty hallmark) that prevents the same class of escape next time. Healing plans are tracked to completion. The audit trail visualizes escape rate over time so we can see whether the methodology is getting better.
The EF-074 cycle 3 over-claim is the founding escape: a row was declared Shippable based on probe coverage that didn't include aggregate visual quality. The visual axis was added in response. The methodology now treats "Shippable verdict" as a claim that must be proven across all four axes, not a checkbox on probe execution.