Data Engineering

Six Ways to Detect Outliers in R, and How to Pick Between Them

· 14 min read

Outliers are the points that make summary statistics lie to you. A single trailing zero in a data entry turns the mean into a sales-floor decoration. A real but rare event (the millionaire in the income survey, the marathon in the daily-steps log) drags the regression line out of the cluster. Before you can decide what to do about them, you need to find them, and “find them” is harder than it sounds because there is no single correct definition of an outlier.

In This Article

  1. Setup
  2. Method 1: descriptive statistics and the histogram
  3. Method 2: the boxplot and IQR rule
  4. Method 3: percentile cutoffs
  5. Method 4: the Hampel filter
  6. Method 5: Grubbs test
  7. Method 6: Dixon test
  8. Method 7: Rosner test
  9. What you get when you run them all
  10. How to decide
  11. What to take from this

This is a tour of six methods I use in R, when each one is appropriate, and what the same dataset looks like under all of them. We will use the mpg dataset that ships with ggplot2, specifically the hwy column (highway miles per gallon across 234 cars), because it is small, public, and has a believable shape with a few real outliers at the top.

The methods split into two groups. The first four make no assumption about the distribution: you look at quantiles or medians and flag anything outside a range. The last three are formal statistical tests that assume approximate normality and return a p-value. They cost a real package import and a real interpretation, and they earn their keep when you need to defend the call in a stakeholder review.

b14-methods
b14-methods

Setup

library(ggplot2)
library(outliers)
library(EnvStats)
library(dplyr)

dat <- ggplot2::mpg
summary(dat$hwy)
# Min. 1st Qu.  Median  Mean 3rd Qu.   Max.
# 12.00   18.00   24.00 23.44   27.00   44.00

234 rows. Min 12, max 44, median 24. Right-skewed because a handful of small efficient cars push the upper tail. That is the field we will run through every method below.

Method 1: descriptive statistics and the histogram

The first move with any continuous column is always the same: print summary() and plot a histogram.

hist(dat$hwy, breaks = sqrt(nrow(dat)), xlab = "hwy", main = "Histogram of hwy")

Square-root binning is a defensible default for an unfamiliar column (a little coarser than Sturges’ rule, a little finer than Scott’s). What you are looking for is the shape of the body and the presence of detached bumps in the tail. If the histogram is unimodal and the tail decays smoothly, there are probably no scary outliers. If the histogram shows a body and then a tiny isolated bar far from the body, that is your candidate.

For hwy, the histogram shows a clear mode around 24 to 28, a long left tail down to 12, and a small cluster around 40. Those 40-ish values are the suspects.

This is not a detection method by itself, it is the question that the other methods will answer. Always look at the shape before you flag anything.

Method 2: the boxplot and IQR rule

The boxplot is the visual default and the IQR rule is the math behind it. A point is flagged if it falls outside the interval [Q1 - 1.5 IQR, Q3 + 1.5 IQR], where IQR is Q3 - Q1.

boxplot(dat$hwy, ylab = "hwy")
boxplot.stats(dat$hwy)$out
# [1] 44 44 41

Three flagged values, all on the upper tail. The IQR rule has two virtues: it makes no assumption about the distribution, and it is the thing everyone already knows from a high-school stats class, so explaining it in a report is free. Its weakness is the 1.5 multiplier, which is a Tukey-era convention with no deep justification. With a heavier-tailed distribution it will flag too many; with a perfectly normal distribution it will flag about 0.7 percent of points by chance.

For exploratory work and dashboards, the IQR rule is the right first cut.

Method 3: percentile cutoffs

Pick the bottom and top percentile you are willing to call an outlier. The conservative choice is 2.5 / 97.5; the more aggressive is 1 / 99.

lower <- quantile(dat$hwy, 0.025)  # 14
upper <- quantile(dat$hwy, 0.975)  # 35.175
sum(dat$hwy < lower | dat$hwy > upper)
# [1] 11

Tightening to 1 / 99 reduces the count to 3, the same as the IQR rule but at different positions. The percentile method has the same distribution-free virtue as the IQR rule but trades the 1.5-IQR convention for a percentile choice that is at least transparent: you said “I am flagging the most extreme 5 percent,” nobody can argue you did not.

It also makes outliers a proportion rather than a property of the data, which is the right mental model when the dataset can grow. If you flag the top 1 percent on a stream of incoming events, the threshold updates as new data arrives.

Method 4: the Hampel filter

The Hampel filter uses the median and the median absolute deviation (MAD) instead of the mean and standard deviation. Anything outside [median - 3 MAD, median + 3 MAD] is flagged.

lower <- median(dat$hwy) - 3 * mad(dat$hwy, constant = 1)  # 9
upper <- median(dat$hwy) + 3 * mad(dat$hwy, constant = 1)  # 39
sum(dat$hwy < lower | dat$hwy > upper)
# [1] 3

The MAD is computed as the median of |x - median(x)|, which makes the whole rule resistant to extreme values in a way the standard deviation is not. The 3-MAD bound is the rough analog of the 3-sigma bound on a normal distribution.

The Hampel filter is the right choice when the data has a few wild values that would inflate the standard deviation enough to hide other outliers (a problem called “masking”). It is the default in signal-processing contexts for the same reason.

Method 5: Grubbs test

We move from rules-of-thumb to formal hypothesis tests. Grubbs assumes approximate normality and tests one extreme value at a time. The null is “all values come from the same normal distribution”; the alternative is “the most extreme value is from a different distribution.”

test <- grubbs.test(dat$hwy)
test$p.value
# [1] 0.05555

A p-value of 0.056 means we just barely fail to reject the null at the conventional 0.05 threshold. In plain English, the most extreme value (44) is suspicious but not formally significant. Grubbs is fussy by design.

Use Grubbs when you have one specific outlier candidate, the rest of the data is roughly normal, and the audience expects a p-value. Do not use it iteratively (remove the flagged point, test the next one) without a Bonferroni correction, or you will inflate your false-positive rate.

Method 6: Dixon test

Dixon is the test for small samples (n up to about 25). It compares the gap between the extreme value and its nearest neighbor against the total range. Like Grubbs, it tests one extreme at a time.

subdat <- dat %>% slice(1:20)
test <- dixon.test(subdat$hwy)
test$p.value
# [1] 0.006508

In the first 20 rows of mpg, the lowest value (15) is sharply separated from the next-lowest, and Dixon flags it at a strong p-value of 0.007. With the full 234-row dataset, Dixon refuses to run and tells you to use Grubbs or Rosner instead.

Dixon is the right choice for genuinely small samples (lab measurements, small surveys, early-stage A/B reads). For anything north of 30 rows it is the wrong tool.

Method 7: Rosner test

Rosner is the test when you suspect several outliers and the sample is large enough (n at least 20). You tell it how many outliers to look for (k) and it does the iterative testing for you, internally accounting for the multiple comparisons.

test <- rosnerTest(dat$hwy, k = 3)
test$all.stats

The output is a table with one row per candidate, the test statistic, the critical value, and a logical Outlier column. On mpg$hwy with k = 3, Rosner flags 0 of the top 3 values as statistically significant outliers at alpha 0.05.

Rosner is the most defensible test in a stakeholder context because it acknowledges that real data usually has more than one outlier, and it lets you set the maximum count you are willing to entertain.

What you get when you run them all

MethodFlagged countThreshold logic
Boxplot / IQR3distribution-free, 1.5 * IQR
Percentile (2.5/97.5)11distribution-free, fixed share
Percentile (1/99)3distribution-free, fixed share
Hampel3distribution-free, 3 * MAD
Grubbs0 at alpha 0.05normality assumed
Dixon (first 20 rows)1 at alpha 0.05normality, small n only
Rosner (k=3)0 at alpha 0.05normality, multiple candidates

b14-results
b14-results
The interesting line of this table is that the methods do not agree. The visual rules flag three points, the loose percentile flags eleven, the strict tests flag zero. None of them are wrong. The variation is the point: an outlier is whatever you call an outlier, and the method you pick is a statement about your tolerance.

How to decide

b14-decision
b14-decision
If you are doing exploratory work and need a default, the boxplot/IQR rule is fine. It is fast, distribution-free, and easy to explain. The Hampel filter is a better fallback if you already know the data has a few wild values that distort the mean.

If you are about to remove outliers in production and need a defensible rule, pick percentile cutoffs and write the percentage in the documentation. “We drop the top and bottom 1 percent of column_x before computing the mean” is a sentence anyone can audit. The boxplot rule is harder to audit because the 1.5 multiplier feels arbitrary if you stop to look at it.

If you are reporting to a scientific or regulatory audience and need a p-value, Rosner is the test to reach for whenever the sample is large enough. Grubbs is the niche tool for “I want to test one specific candidate.” Dixon is the test for genuinely small samples.

The big decision is upstream of all this: do you remove the outlier, transform around it, or report it as a finding? Removing is the most common move and almost always wrong if the outlier is a real observation rather than a data-entry error. Transformations (log, winsorize, rank) let you keep the data and stop the outlier from dominating the analysis. Reporting it as a finding is the right move when the outlier is the interesting thing, which is more often than the textbook makes it sound.

What to take from this

Six methods, one column, results from 0 flagged points to 11. The methods are tools, not arbiters. Pick the one that matches the question and the audience, document the choice, and remember that the right number of outliers to remove from a real dataset is usually zero.


About the author

Nick Valiotti is the founder of Valiotti Data. 15+ years building analytics infrastructure for SaaS, marketplaces, and consumer subscription. 50+ production deployments across BigQuery, Snowflake, dbt, Metabase, and modern BI stacks. Author of two books on data strategy. LinkedIn · Discovery call.

Keep reading

Enjoyed this article?

Get weekly data strategy insights delivered to your inbox.

Get in Touch

Let's Discuss Your Project

Book a 30-minute discovery call. We'll assess your data maturity and recommend the right approach — no strings attached.

Book a Discovery Call →
Need help with your data strategy? Book a Discovery Call →