# Data normalization with SQL

Estimated read time – 5 min

According to GIGO (garbage in, garbage out) principle, errors in input data lead to erroneous analysis results. The results of our work directly depend on the quality of data preparation.

For instance, when we need to prepare data to use in ML algorithms (like k-NN, k-means, logistic regression, etc.), features in the original dataset may vary in scale like the age and height of a person. This may lead to the incorrect performance of the algorithm. Thus, such data needs to be rescaled first.

In this tutorial, we will consider the ways to scale the data using SQL query: min-max normalization, min-max normalization for an arbitrary range, and z-score normalization. For each of these methods we have prepared two SQL query options: one using a SELECT subquery and another using a window function OVER().

We will work with the simple table students that contains the data on the height of the students:

 name height Ivan 174 Peter 181 Dan 199 Kate 158 Mike 179 Silvia 165 Giulia 152 Robert 188 Steven 177 Sophia 165

## Min-max rescaling

Min-max scaling approach scales the data using the fixed range from 0 to 1. In this case, all the data is on the same scale which will exclude the impact of outliers on the conclusions.

The formula for the min-max scaling is given as:

We multiply the numerator by 1.0 in order to get a floating point number at the end.

SQL-query with a subquery:

``````SELECT height,
1.0 * (height-t1.min_height)/(t1.max_height - t1.min_height) AS scaled_minmax
FROM students,
(SELECT min(height) as min_height,
max(height) as max_height
FROM students
) as t1;``````

SQL-query with a window function:

``````SELECT height,
(height - MIN(height) OVER ()) * 1.0 / (MAX(height) OVER () - MIN(height) OVER ()) AS scaled_minmax
FROM students;``````

As a result, we get the values in [0, 1] range where 0 is the height of the shortest student and 1 is the height of the tallest one.

 name height scaled_minmax Ivan 174 0.46809 Peter 181 0.61702 Dan 199 1 Kate 158 0.12766 Mike 179 0.57447 Silvia 165 0.2766 Giulia 152 0 Robert 188 0.76596 Steven 177 0.53191 Sophia 165 0.2766

## Rescaling within a given range

This is an option of min-max normalization between an arbitrary set of values. When it comes to data scaling, the values do not always need to be in the range between 0 and 1. In these cases, the following formula is applied.

This allows us to scale the data to an arbitrary scale. In our example, let’s assume a=10.0 and b=20.0.

SQL-query with subquery:

``````SELECT height,
((height - min_height) * (20.0 - 10.0) / (max_height - min_height)) + 10 AS scaled_ab
FROM students,
(SELECT MAX(height) as max_height,
MIN(height) as min_height
FROM students
) t1;``````

SQL-query with a window function:

``````SELECT height,
((height - MIN(height) OVER() ) * (20.0 - 10.0) / (MAX(height) OVER() - MIN(height) OVER())) + 10.0 AS scaled_ab
FROM students;``````

We receive similar results as before, but with data spread between 10 and 20.

 name height scaled_ab Ivan 174 14.68085 Peter 181 16.17021 Dan 199 20 Kate 158 11.2766 Mike 179 15.74468 Silvia 165 12.76596 Giulia 152 10 Robert 188 17.65957 Steven 177 15.31915 Sophia 165 12.76596

## Z-score normalization

Using Z-score normalization, the data will be scaled so that it has the properties of a standard normal distribution where the mean (μ) is equal to 0 and the standard deviation (σ) to 1.

Z-score is calculated using the formula:

SQL-query with a subquery:

``````SELECT height,
(height - t1.mean) * 1.0 / t1.sigma AS zscore
FROM students,
(SELECT AVG(height) AS mean,
STDDEV(height) AS sigma
FROM students
) t1;``````

SQL-query with a window function:

``````SELECT height,
(height - AVG(height) OVER()) * 1.0 / STDDEV(height) OVER() AS z-score
FROM students;``````

As a result, we can easily notice the outliers that exceed the standard deviation.

 name height zscore Ivan 174 0.01488 Peter 181 0.53582 Dan 199 1.87538 Kate 158 -1.17583 Mike 179 0.38698 Silvia 165 -0.65489 Giulia 152 -1.62235 Robert 188 1.05676 Steven 177 0.23814 Sophia 165 -0.65489
Share
Send
166   4 mon   Analytics engineering   sql
Popular