Methodology

Confidence stops where evidence stops.

The scoring model is intentionally conservative. It separates what was inspected, what executed, what a person observed, and what another reviewer challenged.

Six evidence levels

L1 Static covers repository structure and source signals. L2 Deterministic records declared command execution. L3 Runtime points to an observed critical flow. L4 Visual flow covers responsive and interaction states. L5 CI reproducible shows that the checks run outside one workstation. L6 Independent requires a second reviewer or agent to challenge the report.

Four verdicts

READY means the required checks and evidence for the declared destination are present. ALMOST READY means no P0 blocker was established, but a failed command or required proof remains. NOT READY means a release-blocking condition was found. DEMO ONLY identifies code paths that still rely on explicit mock or demo behavior.

Why scores are capped

Scores summarize evidence coverage; they are not a product grade. A UI app without visual proof cannot exceed 74. A high-stakes release without independent review cannot exceed 89. A secret candidate in release-relevant source caps the score at 59. These caps make unsupported confidence visible.

The report preserves every cap and its reason. The verdict is computed from the same machine-readable fields used by the summary.

Known limits

  • Pattern matching can produce false positives.
  • A command exit code does not prove every product promise.
  • Evidence notes should point to durable artifacts.
  • A tool cannot independently certify itself.
  • The model does not measure product-market fit, revenue potential, or legal compliance.

Read the full evaluation method and scoring rubric.