# How to Calculate Retention Rate

7 minutes

422

For starters, we have to understand in which situations and for whom this metric is important. Imagine this situation: you have launched a particular product (an application or a web service) and, fortunately, people use it. Woo-hoo!

Should you be satisfied if 1000 people use your product daily? The answer is no, and here is a short explanation of why. If you have some loyal users and their number increases, that’s great. However, suppose these 1000 people are different people from day to day. In that case, none of them are truly interested in the service you offer; thus, you should thoroughly re-assess your product and implement some structural changes. So, you end up in desperate need of a metric that can help you to distinguish an optimal status-quo from a disappointing one.

The most common and effective metric to use in this case is the retention rate. In this article, we will show you how to create a retention report with the use of Redash and SQL. To begin with, we will get to the crux of the retention rate: why it is important for a company and how it should be calculated.

## Retention rate

The retention rate metric is quite widespread and is especially well known in the mobile industry because it allows businesses to understand how well certain products engage users in daily use. Firstly, let’s remember (or find out) how retention is calculated:

X day retention is the N% percentage of users who will return to the product on X day. In other words, if on a particular day (e.g. day 0) 100 new users came, and on the 1st day 15 returned, then the retention of the 1st day will be 15 / 100 = 15%.

Most often, we calculate retention for days 1, 3, 7 and 30 as they are the most descriptive product metrics. Moreover, it is useful to know how to analyse the retention curve and draw conclusions from it.

## Retention curve

Ultimately, we are interested in building a curve that shows user retention from day 0 to day 30.

Retention rate curve from day 0 to day 30

## Rolling retention (RR)

In addition to the classic retention rate, there is also rolling retention (RR). When calculating RR, in addition to day X, you should also take all subsequent days into account. Thus, the RR of the 1st day is the number of users who returned on the 1st and following days.

Let’s compare retention and rolling retention on the 10th day:

Retention10 – number of users who returned on the 10th day / number of users who installed the application 10 days ago * 100%.

Rolling retention10 – number of users who returned on the 10th day or later / number of users who installed the application 10 days ago * 100%.

## Granularity

In some industries and related tasks, it is useful to get the retention of a specific day (most often in the mobile industry). In other cases, it is useful to get the retention of a user at different time intervals: for example, weekly or monthly periods (often useful in e-commerce and retail).

An example of cohorts by month and their corresponding monthly retention

## How to create a retention report in SQL

We have already discussed how to calculate retention. Now let’s apply all our theoretical knowledge to SQL.

Let’s say that we have two tables: user, where we store data about user IDs and meta-information, and client_session, where we store information about user visits to the mobile application.

Only these two tables will appear in the query, so you can easily customise the query for yourself.

Note:  for this code, we are using Impala as the DBMS.

## Gather cohort size

``````
SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
ndv(user.id) AS users
FROM USER
GROUP BY 1``````

Let’s analyse this rather simple query: for each day, we count the number of unique users for the interval [60 days ago; 31 days ago].

In order not to get into the documentation: the ndv() command in Impala is analogous to the count(distinct) command.

## Count the number of returning users for each cohort

``````
SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
ndv(user.id) AS ret_base
FROM USER
LEFT JOIN client_session cs ON cs.user_id=user.id
WHERE 1=1
AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
GROUP BY 1, 2``````

In this query, the key part is contained in the datediff command: now we count for each cohort and for each datediff the number of unique users with the same ndv() command (in fact, the number of users who returned during days 0 to 30).

Great, now we have the size of the cohorts and the number of returning users.

## Putting it all together

``````
SELECT reg.reg_date AS date_registration,
reg.users AS cohort_size,
cohort.date_diff AS day_difference,
cohort.ret_base AS retention_base,
cohort.ret_base/reg.users AS retention_rate
FROM
(SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
ndv(user.id) AS users
FROM USER
GROUP BY 1) reg
LEFT JOIN
(SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
ndv(user.id) AS ret_base
FROM USER
LEFT JOIN client_session cs ON cs.user_id=user.id
WHERE 1=1
AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
GROUP BY 1, 2) cohort ON reg.reg_date=cohort.reg_date
ORDER BY 1.3``````

We received a query that counts retention for each cohort, as a result, the result can be displayed as follows:

Retention rate calculated for each user cohort

Plotting a single retention curve

Next, we will slightly modify our query to get the required data to build a retention curve:

``````
SELECT
cohort.date_diff AS day_difference,
avg(reg.users) AS cohort_size,
avg(cohort.ret_base) AS retention_base,
avg(cohort.ret_base)/avg(reg.users)*100 AS retention_rate
FROM
(SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
ndv(user.id) AS users
FROM USER
AND from_unixtime(user.installed_at)<=date_add(now(), -31) GROUP BY 1) reg
LEFT JOIN
(SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date, datediff(cast(cs.created_at AS TIMESTAMP),
cast(installed_at AS TIMESTAMP)) AS date_diff, ndv(user.id) AS ret_base
FROM USER
LEFT JOIN client_session cs ON cs.user_id=user.id
WHERE 1=1
AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30