The Bayesian Average: Why Your Rating System Is Probably Lying to You

  • Posted on June 14, 2026
  • Author: Elosiuba Favour
  • Articles

You open an app and see a restaurant rated **9.8 out of 10**. Impressive. You scroll down and notice it only has **two reviews**. Suddenly, that 9.8 means nothing.

This is the central problem that the Bayesian average was designed to solve β€” and once you understand it, you will never look at a rating system the same way again.

---

## The Problem With Plain Averages

The arithmetic mean is the most natural way to calculate a score. Add everything up, divide by the count. Simple, honest, intuitive.

But it has a catastrophic weakness: **it treats a score from 2 people with the same authority as a score from 2,000 people.**

Consider two neighbourhoods on a livability app:

| Neighbourhood | Reviews | Avg Security Score |
|---|---|---|
| Maitama | 1 | 10 / 10 |
| Garki | 340 | 7.4 / 10 |

A plain average would rank Maitama above Garki. But any reasonable person knows that one person's opinion β€” however enthusiastic β€” should not outrank 340 residents' lived experience. The average is not wrong mathematically. It is just **statistically meaningless** at that sample size.

This is called the **cold start problem**, and it appears everywhere: new restaurants, new products, new neighbourhoods, new employees being rated on a platform.

---

## Enter the Bayesian Average

The Bayesian average is a technique borrowed from Bayesian statistics β€” a branch of probability theory that formalises the idea of **updating your beliefs as new evidence arrives.**

The core insight is this:

> Before you have enough data about a specific item, your best guess is the global average. As more data arrives, you gradually trust the item's own score more and the global average less.

It blends two things together:
- What the **world** looks like on average (the prior)
- What **this specific item's** reviewers are saying (the evidence)

As evidence accumulates, the item's own score takes over. With very little evidence, the score is pulled toward the safe, non-misleading global mean.

---

## The Formula

```
Bayesian Score = (C Γ— m + Ξ£x) / (C + n)
```

Where:

- **n** β€” the number of reviews for this item
- **Ξ£x** β€” the sum of all scores for this item
- **m** β€” the global mean score across all items on your platform
- **C** β€” the confidence threshold: how many reviews you need before you trust an item's own score (typically set to the average number of reviews per item)

Let us walk through a real example.

---

## Worked Example: Neighbourhood Livability Scores

Suppose your platform has the following global data for security scores:

- Global mean score (**m**) = **6.5**
- Average reviews per neighbourhood (**C**) = **50**

Now compare two neighbourhoods:

**Neighbourhood A β€” Wuse II**
- Reviews (n): 2
- Sum of scores (Ξ£x): 19 (both reviewers gave 9.5 and 9.5)
- Plain average: **9.5**

```
Bayesian Score = (50 Γ— 6.5 + 19) / (50 + 2)
= (325 + 19) / 52
= 344 / 52
= 6.62
```

**Neighbourhood B β€” Garki**
- Reviews (n): 340
- Sum of scores (Ξ£x): 2,516
- Plain average: **7.4**

```
Bayesian Score = (50 Γ— 6.5 + 2516) / (50 + 340)
= (325 + 2516) / 390
= 2841 / 390
= 7.28
```

**Final ranking:**

| Neighbourhood | Plain Avg | Bayesian Avg | Reviews |
|---|---|---|---|
| Wuse II | 9.5 | 6.62 | 2 |
| Garki | 7.4 | 7.28 | 340 |

Garki now ranks higher β€” which is almost certainly the more honest result. Wuse II's score is not dismissed; it is simply held in check until enough residents weigh in.

---

## How C (The Confidence Threshold) Changes Everything

The value of **C** is the most important tuning parameter in the formula. It controls how aggressively you pull weak scores toward the mean.

Using Wuse II (n=2, Ξ£x=19, m=6.5):

| C value | Bayesian Score | Interpretation |
|---|---|---|
| C = 10 | 7.75 | Trusts new items quickly |
| C = 50 | 6.62 | Balanced (recommended) |
| C = 200 | 6.59 | Very conservative, slow to diverge |

A **low C** is appropriate when your items are homogeneous and a few reviews are genuinely representative. A **high C** is appropriate when you expect high variance and manipulation attempts β€” such as a public platform where bad actors might try to game scores.

Most production systems set **C equal to the average number of reviews per item** because it naturally scales with your platform's density.

---

## Where Bayesian Averages Are Already Used

You encounter Bayesian (or Bayesian-adjacent) scoring constantly, even if it is not labelled as such:

**IMDb Movie Rankings** β€” IMDb's famous Top 250 uses a Bayesian formula explicitly. A film with 100,000 votes rating 8.1 will rank above a film with 50 votes rating 9.9. They publish the formula openly.

**Amazon Product Ratings** β€” Amazon does not display a plain average. Their "Overall Rating" blends recency, verified purchase status, and volume β€” all Bayesian-influenced adjustments.

**Yelp** β€” Uses a proprietary confidence interval system that achieves the same goal: preventing new restaurants with one five-star review from topping neighbourhood lists.

**Chess and Gaming ELO** β€” The ELO rating system in chess and competitive gaming is a dynamic Bayesian update: after every match, both players' scores move based on the expected vs. actual outcome, with the magnitude shrinking as a player's rating becomes more established.

---

## The Bayesian Average vs. Other Approaches

**Wilson Score Interval** β€” Used by Reddit's comment ranking. Rather than blending with a global mean, it calculates a statistical confidence interval around the observed score and uses the lower bound. It is more mathematically rigorous but harder to explain to stakeholders.

**Time-Decay Weighting** β€” Reviews from 90 days ago matter less than reviews from last week. You can combine time-decay with Bayesian averaging:

```
weight = e^(-Ξ» Γ— days_ago)
```

Older reviews get an exponentially smaller weight before they enter the Bayesian formula. This is essential for livability scores, where a neighbourhood's security situation can change rapidly.

**Minimum Threshold Display** β€” The simplest alternative: do not show a score at all until you have at least N reviews. This avoids misleading averages but creates a cold-start UX problem β€” users see nothing instead of an honest estimate.

The Bayesian average beats all of these for most applications because it is continuous (no hard cutoff), explainable ("we blend your reviews with the platform average until you have enough data"), and easy to compute in SQL.

---

## Implementing It in SQL

For a livability platform, this is the complete query:

```sql
-- Step 1: Calculate global constants
WITH global_stats AS (
SELECT
AVG(security_score) AS m_security,
AVG(power_score) AS m_power,
AVG(water_score) AS m_water,
COUNT(*) * 1.0 / COUNT(DISTINCT neighbourhood) AS C
FROM area_reviews
WHERE created_at >= NOW() - INTERVAL '90 days'
),

-- Step 2: Aggregate per neighbourhood
neighbourhood_stats AS (
SELECT
neighbourhood,
COUNT(*) AS review_count,
SUM(security_score) AS sum_security,
SUM(power_score) AS sum_power,
SUM(water_score) AS sum_water
FROM area_reviews
WHERE created_at >= NOW() - INTERVAL '90 days'
GROUP BY neighbourhood
)

-- Step 3: Apply Bayesian formula
SELECT
n.neighbourhood,
n.review_count,
ROUND(
(g.C * g.m_security + n.sum_security) / (g.C + n.review_count), 1
) AS security_score,
ROUND(
(g.C * g.m_power + n.sum_power) / (g.C + n.review_count), 1
) AS power_score,
ROUND(
(g.C * g.m_water + n.sum_water) / (g.C + n.review_count), 1
) AS water_score
FROM neighbourhood_stats n
CROSS JOIN global_stats g
ORDER BY security_score DESC;
```

No external libraries. No machine learning pipeline. Just arithmetic that is statistically honest.

---

## Common Mistakes When Implementing

**Mistake 1: Hardcoding C.**
If your platform grows from 10 to 10,000 neighbourhoods, a hardcoded C becomes incorrect. Compute it dynamically as shown in the SQL above.

**Mistake 2: Forgetting to gate the display.**
Even with Bayesian averaging, showing "6.5 / 10 β€” Based on 0 reviews" looks broken. Set a minimum display threshold of 1–3 reviews; below that, show "Not enough data yet."

**Mistake 3: Not separating categories.**
Each score category (security, power, water) needs its own global mean **m**. Power outages are rated differently from security perceptions. Mixing them into one global mean distorts all scores.

**Mistake 4: Ignoring time windows.**
A neighbourhood that was dangerous two years ago but has improved should not be penalised forever. Always filter reviews to a rolling window (30, 60, or 90 days) before computing **m**, **n**, and **Ξ£x**.

---

## The Deeper Principle

The Bayesian average is really a formalisation of something humans already do intuitively.

When a friend you have known for a decade says a restaurant is bad, you believe them almost immediately. When a stranger on the internet says the same thing, you want a second opinion, a third, a tenth. You are naturally applying a prior belief (people's food tastes vary), and you are updating it based on evidence (reviews), weighted by trust (familiarity, history, sample size).

Bayesian statistics simply makes this instinct precise and computable.

The formula tells your rating system: **"Be humble about what you don't know yet. Trust the crowd, not the individual. And update your beliefs as evidence grows."**

That is not just good mathematics. For any platform built on community trust, it is good ethics.

---

*The Bayesian average is most powerful not when you have millions of data points, but precisely when you don't β€” in the early, fragile, data-sparse moments when a wrong score could mislead a real person making a real decision.*