← All stories

Voyage · EventFarm Parity Methodology

Four axes of analysis. Every story. Every cycle.

No feature ships unless it passes all four axes. Each axis catches a different class of failure that the others miss. The methodology earned each axis the hard way — by something escaping detection that one of the existing axes should have caught, and adding the new axis as the documented response.

The four axes at a glance

Each is an independent dimension of "is this shippable?" — a feature that passes three of four still fails.

Axis 01

Functional

Does it work?

Axis 02

Visual

Does it look right?

Axis 03

Semantic

Does it make sense?

Axis 04

Honesty

Did we cheat to pass the others?

Axis 01 · Product surface

Functional — does it work?

Strict probes verify transport correctness AND end-state correctness against the deployed surface.

Evaluator

The story-runner (Playwright) executes each story's happyPath, failureModes, and pageEvaluation probes against the deployed URL. Each probe's expect array is evaluated under the strict-predicate iron rule: every passOrFail() requires both the network/transport to succeed AND the resulting end-state (DOM, D1 row, audit log, response payload) to be correct.

What this axis catches

  • Transport failures: 404, 5xx, malformed payloads, missing headers
  • End-state failures: API call returned 200 but the DOM didn't update; vote submitted but tally didn't increment; idempotency key reused but the row was duplicated anyway
  • State machine bugs: out-of-order delta replay regressing to held tally; foreign-event delta mutating local state; reconnect not reapplying snapshot
  • Concurrency edges: 99/100 mailings reaching terminal status while 1/100 stays stuck; the queue's duplicate-invocation race

Pass condition

Every probe in the story's happyPath + failureModes + pageEvaluation returns { ok: true } under the strict iron rule. No soft-passes. No probe is allowed to require less than transport + end-state correctness.

What this axis cannot catch alone

A page can pass every functional probe and still ship broken UI (visual axis catches that), display nonsense values (semantic axis catches that), or be the product of a cycle that gamed its own probes (honesty axis catches that).

Axis 02 · Product surface

Visual — does it look right?

A vision evaluator scores the rendered page on five aesthetic and accessibility axes; axe-clean codifies WCAG. Both must pass.

Evaluator

The harness captures full-page screenshots at canonical viewports (1920×1080 for big-screen visualizers, 1280×800 for laptop, 375×812 for mobile). Each screenshot is sent to a locally-installed agent CLI (codex) with a strict structured-output prompt that scores 5 axes 1–5 and returns a would_a_designer_ship boolean. axe-clean runs in parallel and asserts zero serious WCAG violations.

The 5 sub-axes

  • typography_hierarchy — heading/body sizes appropriate for viewport; nothing bleeds outside containers
  • contrast_page_wide — every text element passes WCAG AA by visual inspection (4.5:1 body / 3:1 large)
  • layout_integrity — no nested-iframe-feeling, no admin chrome leaking into presentation surfaces, content not trapped in tiny scroll wells
  • content_density — the page does what its URL says it does, without marketing scaffolding
  • visual_polish — overall ship-quality

What this axis catches

  • Admin chrome leaking into presentation surfaces: the original EF-074 visualizer wrapped in topbar + route-summary + status-card panels + data-band — caught by visual axis, fixed by chrome-fix factory
  • Page-wide contrast failures: black-on-dark-gray rows in a SEEDED EVENTS table that no probe targeted
  • Oversized headers / dashboard cards in big-screen visualizers
  • Mobile-only failures invisible at desktop: cramped totals that wrap awkwardly, borderline-AA submit buttons

Pass condition

All 5 sub-axes ≥ 4 AND would_a_designer_ship === true AND axe-clean passes at severityAtLeast: serious. Below 4 on any sub-axis, OR designer-ship false, fails the predicate.

What this axis cannot catch alone

The vision evaluator scores visual quality — it doesn't audit arithmetic on rendered values (semantic axis), and it can't tell whether the page's beauty was achieved by hardcoding the right-looking values into a fixture (honesty axis).

Axis 03 · Product surface

Semantic — does it make sense?

Domain invariants assert that displayed values reconcile arithmetically and respect their natural constraints.

Evaluator

Each story declares a semanticInvariants block listing the domain-specific constraints its surface must satisfy. The harness extracts the values via Playwright DOM selectors, applies the invariant's check, and emits a structured pass/fail. Initial supported kinds: percentages-sum-to-100, displayed-total-equals-sum-of-parts, non-negative-count, monotonic-order, value-equals-attribute. The vocabulary grows as new bug classes are caught.

What this axis catches

  • Arithmetic that doesn't reconcile: poll percentages summing to 101% (naive rounding error); displayed total disagreeing with the sum of its parts
  • Negative values where domain forbids: occupancy counter going below zero; vote count flipping negative under a race
  • Ordering violations: audit-row timestamps out of monotonic order; events arriving with timestamps that go backward
  • Cross-element consistency: rendered text disagreeing with a server-emitted data attribute
  • Bullshit displayed values generally — anything where the user looks at the number and says "that doesn't add up"

Pass condition

Every semanticInvariants entry returns ok: true. No soft-passes. The fix must address the underlying correctness bug, not adjust the display to mask the math (which would be an honesty axis violation).

Why this axis is independent of visual

The vision evaluator can score a page 5/5 on visual polish while it displays 37% + 27% + 21% + 16% = 101%. The model sees pretty bars; it doesn't audit the numbers. Semantic invariants are the audit. The two axes catch genuinely different failures and must run together.

Axis 04 · Cycle integrity

Honesty — did we cheat to pass the others?

A meta-axis that audits the implementation and the messaging for cheating, gaming, or ownership-shirking — the failure modes the other three axes can't see.

Two evaluators

Static code scan. Audits the cycle's diff + test corpus for hallmarks of cheating — hardcoded values that match test expectations, algorithms that force data into expected shapes by hammering rather than by correctness, tests that always return true, mocks that bypass the actual logic, comments that betray gaming intent.

Post-hoc message audit. Audits the cycle's commit messages + closure markdown + exit reports for ownership-shirking patterns — "pre-existing condition" excuses applied to in-scope failures, illegitimate scope-shifting, verdicts broader than the underlying evidence supports, suppressed surfacing of findings that appear in the diff but not in the closure.

What this axis catches

  • Cooked data: tally = [37, 27, 21, 16] in a fixture when the test asserts those exact percentages — the implementation is hardcoded to pass
  • Force-shape algorithms: if (sum !== 100) values[last] += 100 - sum hammers percentages to sum without fixing the rounding bug
  • Always-true tests: expect(true).toBe(true) dressed up as something meaningful; if (errors > 0) expect(...).toBe(...) where the if makes the assertion optional
  • Nerfed thresholds: expect(errors).toBeLessThan(5) when the strict expectation should be expect(errors).toBe(0)
  • "Pre-existing condition" excuses: attributing newly-discovered failures to the state of the world before the cycle started, when in fact the cycle's work depends on or surfaces the gap
  • Illegitimate scope-shifting: "out of scope for this cycle" applied to findings that ARE in scope by definition
  • Verdict broader than evidence: declaring a row Shippable when the cycle's chosen probe set conveniently excludes the dimensions where there's residual risk

Pass condition

Per cycle: 0 critical findings, 0 unaddressed serious findings. Critical = automatic cycle-level fail. Serious = fails unless explicitly addressed within the cycle. Minor = noted, doesn't gate the verdict.

Why this axis is required

Without it, an agent can pass functional + visual + semantic by gaming each one independently — and the only line of defense is human discerning eye. Codifying the human's audit pattern as a fourth axis closes that loop. The methodology audits itself.

Commitments

What the methodology obligates us to do, every time.

Every cycle runs all four axes

No factory completes without producing per-axis verdicts. No row promotes to Shippable until all four axes pass for it. Three of four is a fail. The cycle's third commit is responsible for the matrix re-stamp AND the per-axis evidence bundle.

Every evaluation goes into the audit trail

Pre-cycle measure, post-cycle measure, screenshots, evaluator JSON, closure narrative, deployed-version pointer. All published to the deployed stories surface, all linked from the cycle's roadmap card and findings ledger entry. The trail is persistent — past evidence stays browsable.

Honest demotions stay demoted

If an axis fails and can't be closed within the time budget, the row demotes with documented reason. Demotion is not a failure of the methodology — it's the methodology working. Soft-passing is the failure mode.

Human-caught escapes get healing plans

When a human catches a bug the methodology should have caught, the case is logged in the escape ledger. Each escape requires a healing plan: what was missed, root cause, the change to the methodology (new axis, tightened predicate, new invariant kind, new honesty hallmark) that prevents the same class of escape next time. Healing plans are tracked to completion. The audit trail visualizes escape rate over time so we can see whether the methodology is getting better.

No row is shippable on probe coverage alone

The EF-074 cycle 3 over-claim is the founding escape: a row was declared Shippable based on probe coverage that didn't include aggregate visual quality. The visual axis was added in response. The methodology now treats "Shippable verdict" as a claim that must be proven across all four axes, not a checkbox on probe execution.

See it in practice