Data Engineering

Outlier Detection in Python: Four Methods and Their Failure Modes

· 14 min read

Half of the analytics bugs we see on client engagements trace to a single root cause: one bad data point that shouldn’t be in the analysis but is. A sensor that briefly returned −999 instead of NULL, a power user who logged 14,000 events in one session because their browser tab was stuck reloading, a typo that converted “44 mpg” to “212 mpg” before the row made it into the warehouse.

In This Article

  1. Method 1: describe() and the histogram
  2. Method 2: Box plot (Tukey’s fences)
  3. Method 3: Percentile cutoffs
  4. Method 4: Hampel filter
  5. Method 5: Grubbs / Rosner ESD tests
  6. Which method to actually use
  7. A note on scipy and modern alternatives
  8. Takeaway

The fix isn’t a single magic method. Each outlier-detection technique answers a slightly different question, and using the wrong one buys you a clean-looking dataset with the actual outlier still in it. This article walks through the four methods we reach for in Python — what they assume, how to call them, and where they quietly fail.

We’ll use the mpg dataset (fuel economy for 234 car models, available as a CSV on GitHub) throughout. It’s small, public, and has enough natural spread that the methods give different answers — which is the whole point.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('mpg.csv')
df.head()

Method 1: describe() and the histogram

The cheapest outlier check is the one you should always run before anything else.

df.describe()

This prints count, mean, std, min, 25%, 50%, 75%, max for every numeric column. If max is wildly larger than the 75th percentile, or min is wildly smaller than the 25th percentile, you have a suspect. For hwy (highway mpg), the values run from 12 to 44 with a 75th percentile of 27 — nothing crazy.

A histogram tells you the same story visually and catches a different failure mode: multi-modal data that describe() flattens into a single mean.

df.hwy.plot(kind='hist', bins=20, color='grey', alpha=0.5)
plt.show()

What it catches: Obvious outliers (the 212-mpg row from a typo), gross data-quality issues (a column of all zeros), bimodal distributions that should be split before analysis.

Where it fails: Subtle outliers inside the bulk of the distribution. A point that’s at the 99th percentile of a normally-distributed column is “extreme” in some sense but won’t show up as a spike in a 20-bin histogram, and describe() won’t flag it because it doesn’t exceed the max by much.

Same hwy column, four methods applied to it. Box plot catches 6 points; percentile catches 11; Hampel catches 0; Grubbs catches the single planted 212-mpg outlier. Methods disagree by design.
Same hwy column, four methods applied to it. Box plot catches 6 points; percentile catches 11; Hampel catches 0; Grubbs catches the single planted 212-mpg outlier. Methods disagree by design.

Method 2: Box plot (Tukey’s fences)

The classic textbook method. A box plot identifies any point outside [Q1 − 1.5·IQR, Q3 + 1.5·IQR] as an outlier, where IQR is Q3 − Q1.

_, bp = df.hwy.plot.box(return_type='both')
plt.show()

outliers = [flier.get_ydata() for flier in bp['fliers']][0]
df[df.hwy.isin(outliers)].head()

This grabs the actual outlier values from the box plot object and pulls the matching rows from the dataframe. For hwy, the 1.5·IQR rule flags six points — five hybrid sedans at 41–44 mpg and one at the bottom end.

What it catches: Anything heavier-tailed than a normal distribution would predict. The 1.5·IQR threshold is calibrated for normality; if your data is normal, it flags ~0.7% of points. If your data is skewed, it’ll flag more — but those points may not actually be wrong.

Where it fails: Heavily skewed data. Revenue per customer, session duration, log files — anything with a real heavy tail — will have the box-plot method flag dozens of legitimate large values as “outliers” because the IQR was calibrated against the bulk. We’ve seen analysts strip “outliers” from LTV data using .boxplot() and then wonder why the top-decile customers disappeared.

The fix when your data is skewed: apply a log() transformation first, then box-plot. Or use a different method.

Method 3: Percentile cutoffs

A more honest version of the box plot for non-normal data: just pick a percentile and call everything outside it an outlier. No assumption about distribution shape, just a flat “the bottom and top 2.5% are suspect.”

lower = df.hwy.quantile(0.025)
upper = df.hwy.quantile(0.975)
df[(df.hwy < lower) | (df.hwy > upper)]

This is the method that survives on real-world dashboards. It’s not statistically principled — picking 2.5% / 97.5% is a convention, not a theorem — but it’s robust to whatever distribution you actually have and the cutoffs are reproducible across reruns.

What it catches: Anything at the tails of the empirical distribution, regardless of shape.

Where it fails: Two things. First, it always flags exactly the same fraction of your data — if 5% of your column is legitimately tail values, you’ll discard 5% of legitimate data. Second, it doesn’t tell you which of the flagged points is statistically anomalous. If you have one point that’s 40 standard deviations from the median and ten that are 2σ out, this method treats them identically.

In practice we use this for dashboards where the goal is “trim the worst of the noise to make the median visible,” not for hypothesis testing.

Method 4: Hampel filter

The Hampel filter compares each value to the median and the median absolute deviation (MAD). Anything more than 3·MAD from the median gets replaced with NaN. It’s the median-based cousin of “drop points more than 3 standard deviations from the mean,” but more robust because the median doesn’t get pulled by extreme values the way the mean does.

def hampel(series, threshold=3):
    median = series.median()
    diff = (series - median).abs()
    mad = diff.median()
    mask = diff > threshold * mad
    out = series.copy()
    out[mask] = np.nan
    return out

cleaned = hampel(df.hwy)
cleaned.isna().sum()

On the unmodified mpg data this returns zero NaNs — the Hampel filter doesn’t see anything strange about the highway-mpg column at its native spread.

What it catches: True statistical outliers that survive a robust comparison. Good for time series where you want to drop sensor glitches without being fooled by the trend.

Where it fails: Datasets with very small spread (e.g., a sensor reading the same value to three decimal places) collapse the MAD to near zero, and then every deviation looks like 100·MAD. Also fails on multi-modal data the same way the box plot does — it sees the bulk and flags the second mode.

Method 5: Grubbs / Rosner ESD tests

The previous methods all answer “which points are unusual?” Grubbs and Rosner answer a different question: “is there an outlier here, statistically?” They compute a test statistic, compare to a critical value, and return a yes/no.

Grubbs is for a single outlier. Rosner’s ESD generalizes to up to k outliers.

from scipy import stats

def grubbs_statistic(values):
    std = np.std(values)
    mean = np.mean(values)
    abs_dev = np.abs(values - mean)
    g_calc = np.max(abs_dev) / std
    max_idx = np.argmax(abs_dev)
    return g_calc, max_idx

def grubbs_critical(n, alpha=0.05):
    t = stats.t.ppf(1 - alpha / (2 * n), n - 2)
    return ((n - 1) * np.sqrt(t**2)) / (np.sqrt(n) * np.sqrt(n - 2 + t**2))

def is_outlier(values, alpha=0.05):
    g, idx = grubbs_statistic(values)
    g_crit = grubbs_critical(len(values), alpha)
    return g > g_crit, values[idx], g, g_crit

# Plant a synthetic outlier and test
df.loc[34, 'hwy'] = 212

flag, val, g, g_crit = is_outlier(df.hwy.values)
print(f"Outlier: {flag} (value={val}, G={g:.2f}, critical={g_crit:.2f})")
# → Outlier: True (value=212, G=13.75, critical=3.65)

The Rosner ESD extension just runs Grubbs iteratively, removing the worst point each round, up to a user-specified max_outliers:

def rosner_esd(values, alpha=0.05, max_outliers=3):
    series = values.copy()
    found = []
    for _ in range(max_outliers):
        flag, val, g, g_crit = is_outlier(series, alpha)
        print(f"G={g:.2f}, critical={g_crit:.2f}, value={val} → {'outlier' if flag else 'not'}")
        if not flag:
            break
        found.append(val)
        series = np.delete(series, np.argmax(np.abs(series - np.mean(series))))
    return found

rosner_esd(df.hwy.values, max_outliers=3)
# 212 → outlier
# 44  → not
# 44  → not

What it catches: A clean statistical confirmation that a specific point is unlikely under the null hypothesis of normality. Good for QA pipelines that need to log “this row tripped a 5% significance threshold.”

Where it fails: Assumes the underlying data is normal. If your data is log-normal, exponential, or just heavy-tailed for a real reason, Grubbs will keep flagging legitimate tail values as outliers test after test. The Rosner extension makes this worse because it’s iterative — strip the top point, the next point looks like an outlier, strip it, the next one looks like an outlier, and so on until the test is computing significance against an artificially compressed distribution.

Our rule on engagements: don’t apply Grubbs to anything you haven’t first checked with a Q-Q plot. If the Q-Q deviates from the diagonal at the tails, Grubbs is the wrong tool.

Which method to actually use

We get asked this in every consulting kickoff. The decision tree we use:

QuestionMethod
First-pass data-quality checkdescribe() + histogram
Roughly-normal data, want a quick visualBox plot
Skewed / unknown-distribution data, want a reproducible trimPercentile cutoffs (2.5/97.5)
Time series with sensor glitches, want robust replacementHampel filter
Need a yes/no statistical answer on a planted suspectGrubbs / Rosner ESD

When clients hand us a notebook full of df = df[df.value < df.value.quantile(0.99)], we usually leave it in place — that's the right call for dashboarding most of the time. Where we step in is when the question is causal ("did this customer's behavior change because of the campaign?") and the percentile trim has accidentally removed exactly the customers who responded most. The right method depends on what the analysis is for.

A note on scipy and modern alternatives

Since 2021, the Python outlier-detection toolbox has grown:

  • scipy.stats.iqr and scipy.stats.zscore ship with SciPy and cover the standard cases.
  • PyOD (pip install pyod) packages ~40 algorithms including LOF, Isolation Forest, and DBSCAN-based detection. Worth knowing about for multivariate outliers.
  • statsmodels.robust has Hampel-style robust estimators built in.
  • scikit-learn's IsolationForest is what we actually reach for when there's a multi-column "row looks suspicious" question. Single-variable methods miss it.

The five methods above still cover 90% of real-world cases. The packages above are for when you need to defend the method choice in a regulatory submission or when the data is genuinely multivariate. For an exploratory notebook on a single column, df.quantile() and df.boxplot() remain the right answers.

Takeaway

There is no universal outlier detector. Each method bakes in assumptions about the distribution and answers a slightly different question:

  • describe() / histogram asks "do I see anything weird?"
  • Box plot asks "is this point outside Tukey's normality-calibrated fence?"
  • Percentile asks "is this point in the extreme tails of my data?"
  • Hampel asks "is this point far from the robust center?"
  • Grubbs / Rosner asks "is this point statistically improbable under normality?"

Get the question right first, then pick the method. Most analytics bugs we debug are not failed outlier detection — they are the right detector applied to the wrong question.

Keep reading

Enjoyed this article?

Get weekly data strategy insights delivered to your inbox.

Get in Touch

Let's Discuss Your Project

Book a 30-minute discovery call. We'll assess your data maturity and recommend the right approach — no strings attached.

Book a Discovery Call →
Need help with your data strategy? Book a Discovery Call →