19.04.2022 | Nikolay Valiotti

How to calculate retention rate?

 

For starters, we have to understand in which situations and for whom this metric is important. Imagine this situation: you have launched a particular product (an application or a web service) and, fortunately, people use it. Woo-hoo!

Should you be satisfied if 1000 people use your product daily? The answer is no, and here is a short explanation of why. If you have some loyal users and their number increases, that’s great. However, suppose these 1000 people are different people from day to day. In that case, none of them are truly interested in the service you offer; thus, you should thoroughly re-assess your product and implement some structural changes. So, you end up in desperate need of a metric that can help you to distinguish an optimal status-quo from a disappointing one. 

The most common and effective metric to use in this case is the retention rate. In this article, we will show you how to create a retention report with the use of Redash and SQL language. To begin with, we will get to the crux of the retention rate: why it is important for a company and how it should be calculated.

 

Retention rate

The retention rate metric is quite widespread and is especially well known in the mobile industry because it allows businesses to understand how well certain products engage users in daily use. Firstly, let’s remember (or find out) how retention is calculated:

X day retention is the N% percentage of users who will return to the product on X day. In other words, if on a particular day (e.g. day 0) 100 new users came, and on the 1st day 15 returned, then the retention of the 1st day will be 15 / 100 = 15%.

Most often, we calculate retention for days 1, 3, 7 and 30 as they are the most descriptive product metrics. Moreover, it is useful to know how to analyse the retention curve and draw conclusions from it.

 

Retention curve

Ultimately, we are interested in building a curve that shows user retention from day 0 to day 30.

Retention rate curve from day 0 to day 30

 

Rolling retention (RR)

In addition to the classic retention rate, there is also rolling retention (RR). When calculating RR, in addition to day X, you should also take all subsequent days into account. Thus, the RR of the 1st day is the number of users who returned on the 1st and following days.

 

Let’s compare retention and rolling retention on the 10th day:

Retention10 – number of users who returned on the 10th day / number of users who installed the application 10 days ago * 100%.

Rolling retention10 – number of users who returned on the 10th day or later / number of users who installed the application 10 days ago * 100%.

 

Granularity

In some industries and related tasks, it is useful to get the retention of a specific day (most often in the mobile industry). In other cases, it is useful to get the retention of a user at different time intervals: for example, weekly or monthly periods (often useful in e-commerce and retail).

An example of cohorts by month and their corresponding monthly retention

 

How to create a retention report in SQL

We have already discussed how to calculate retention. Now let’s apply all our theoretical knowledge to SQL.

Let’s say that we have two tables: user, where we store data about user IDs and meta-information, and client_session, where we store information about user visits to the mobile application.

Only these two tables will appear in the query, so you can easily customise the query for yourself.

Note:  for this code, we are using Impala as the DBMS.

 

Gather cohort size


SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          ndv(user.id) AS users
   FROM USER
   WHERE from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1

Let’s analyse this rather simple query: for each day, we count the number of unique users for the interval [60 days ago; 31 days ago].

In order not to get into the documentation: the ndv() command in Impala is analogous to the count(distinct) command.

 

Count the number of returning users for each cohort


SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff,
          ndv(user.id) AS ret_base
   FROM USER
   LEFT JOIN client_session cs ON cs.user_id=user.id
   WHERE 1=1
     AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30
     AND from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1, 2

 

In this query, the key part is contained in the datediff command: now we count for each cohort and for each datediff the number of unique users with the same ndv() command (in fact, the number of users who returned during days 0 to 30).

 

Great, now we have the size of the cohorts and the number of returning users.

 

Putting it all together


SELECT reg.reg_date AS date_registration,
       reg.users AS cohort_size,
       cohort.date_diff AS day_difference,
       cohort.ret_base AS retention_base,
       cohort.ret_base/reg.users AS retention_rate
FROM
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          ndv(user.id) AS users
   FROM USER
   WHERE from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31) 
   GROUP BY 1) reg 
LEFT JOIN 
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date, 
   datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) AS date_diff, 
   ndv(user.id) AS ret_base 
   FROM USER 
   LEFT JOIN client_session cs ON cs.user_id=user.id 
   WHERE 1=1 
    AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30 
    AND from_unixtime(user.installed_at)>=date_add(now(), -60)
    AND from_unixtime(user.installed_at)<=date_add(now(), -31)
   GROUP BY 1, 2) cohort ON reg.reg_date=cohort.reg_date
    ORDER BY 1.3

 

We received a query that counts retention for each cohort, as a result, the result can be displayed as follows:

Retention rate calculated for each user cohort

 

Plotting a single retention curve

Next, we will slightly modify our query to get the required data to build a retention curve:


SELECT
       cohort.date_diff AS day_difference,
       avg(reg.users) AS cohort_size,
       avg(cohort.ret_base) AS retention_base,
       avg(cohort.ret_base)/avg(reg.users)*100 AS retention_rate
FROM
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date,
          ndv(user.id) AS users
   FROM USER
   WHERE from_unixtime(user.installed_at)>=date_add(now(), -60)
     AND from_unixtime(user.installed_at)<=date_add(now(), -31) GROUP BY 1) reg 
LEFT JOIN 
  (SELECT from_unixtime(user.installed_at, "yyyy-MM-dd") AS reg_date, datediff(cast(cs.created_at AS TIMESTAMP), 
  cast(installed_at AS TIMESTAMP)) AS date_diff, ndv(user.id) AS ret_base 
  FROM USER 
  LEFT JOIN client_session cs ON cs.user_id=user.id 
  WHERE 1=1 
   AND datediff(cast(cs.created_at AS TIMESTAMP), cast(installed_at AS TIMESTAMP)) between 0 and 30 
   AND from_unixtime(user.installed_at)>=date_add(now(), -60)
   AND from_unixtime(user.installed_at)<=date_add(now(), -31)
  GROUP BY 1,2) cohort ON reg.reg_date=cohort.reg_date
   GROUP BY 1
   ORDER BY 1

 

Now we have calculated the average retention rate for all cohorts for each day!

More on this topic:

– Top 3 ways to calculate user retention rate with formulas