Back to Tofu

How Tofu calculates its statistics

We use Bayesian inference instead of classical p-values because pre-launch experiments typically have small samples and need to be interpreted honestly. Here's exactly what we do.

The short version

We use Bayesian statistics to estimate the probability that one variant is truly better than another, based on the traffic and conversions your experiment has seen so far. We show you "probability to be best"instead of a classical p-value because it's both more intuitive and more robust at the small sample sizes that pre-launch experiments typically have.

The test we run

For every pair of variants we compare, we build a Beta-Binomial posterior with a uniform Beta(1, 1)prior (Laplace's rule of succession). This prior is deliberately non-informative — it's equivalent to "we're not assuming anything about your conversion rate until the data arrives."

The winning variant's probability shown in the UI is P(winner rate > runner-up rate), computed from the posterior via 5,000-sample Monte Carlo.

Credible intervals

The parenthetical range next to each conversion rate (e.g. 8.2% (4.1–14.6%)) is the 95% credible interval — the range within which we believe the true conversion rate lies, with 95% probability. Wide intervals mean your experiment needs more traffic.

When we call a winner

We only declare a clear winner when all three conditions hold:

  • Probability to be best ≥ 95%
  • At least 100 visitors per variant (enough to avoid false confidence from thin data)
  • Lift credible interval entirely above zero (we don't commit when "no effect" is still plausible)

When any of these fails, we'll still show you which variant is leading — but we'll be explicit that the data is preliminary.

Randomization check (SRM)

The badge in the header runs a chi-square teston the actual traffic split vs. the configured allocation. If traffic is landing unevenly — say, 320 visitors on variant A and 110 on variant B when they were supposed to get 50/50 — the test fails and we flag the experiment. A broken randomization makes any comparison meaningless, so we refuse to recommend a winner until it's fixed.

Want to go deeper?

Evan Miller's intro to Bayesian A/B testing

A canonical, plain-English reference for the same framework we use.