Most gaming and mobile companies have a retention curve. If you missed __the last article__, we strongly recommend reading it as it discussed retention and how to calculate it.

But in a nutshell, retention is a metric that allows you to understand how well a product engages users daily. And with the help of retention and average revenue per daily active user (ARPDAU), you can calculate lifetime value (LTV): lifetime income per user.

In today’s article, we will analyse how to model LTV for 180 days using SQL and simple linear regression.

LTV is a prediction of the net profit attributed to an ongoing relationship between customer and product. In general, the LTV formula looks like ARPDAU multiplied by lifetime – the user’s lifetime in the project:

LTV = ARPDAU * Lifetime

Let’s look at the classic Retention chart:

So, lifetime is the area of the figure under Retention:

Note: Where the integrals and areas come from is explained in more detail in this article.

So, to calculate lifetime, you need to take the integral of the time hold function. The formula takes the following form:

LTV = ∫_{0}^{T}S(t)*ARPDAU

The exponential function *a*x^b* is best suited to describe the retention curve. Here’s how it looks compared to the retention curve:

In this case, x is the number of days, and a and b are the function parameters that we will build using linear regression. The appearance of regression was not by chance – this power function can be reduced to the form of a linear function by taking the logarithm:

y = a*x^{b}

ln(y) = ln(a) + ln(x^{b})

ln(y) = ln(a) + b*ln(x)

where ln(a) – intercept, b – slope.

It remains to find these parameters, and linear regression and the least-squares method are here to help you. Lifetime is the cumulative amount of the forecast for 180 days. After calculating it, it remains to multiply lifetime by ARPDAU to get the LTV for 180 days.

Let’s move on to some practice. We used the data of one gaming company and the PostgreSQL DBMS for all calculations since it already has the functions of searching for parameters for linear regression. Let’s start by building retention: let’s collect the total number of users from 1 March to 1 April 2021. We study the activity for one month:

```
--total number of users in the cohort
with cohort as (
select count(distinct id) as total_users_of_cohort
from users
where date(registration) between date '2021-03-01' and date '2021-03-30'
),
```

Now let’s see how these users behave in the next 90 days:

```
--number of active users on day 1, day 2, day 3, etc. from the cohort
active_users as (
select date_part('day', activity.date - users.registration) as activity_day,
count(distinct users.id) as active_users_of_day
from activity
join users on activity.user_id = users.id
where date(registration) between date '2021-03-01' and date '2021-03-30'
group by 1
having date_part('day', activity.date - users.registration) between 1 and 90
-- we take only the first 90 days, we predict the remaining days.
),
```

The retention curve is the ratio of the number of active users to the current day’s cohort size. In our case, it looks like this:

According to the curve, we calculate the parameters for linear regression: where regr_slope(x, y) is a function to calculate the slope of the regression and regr_intercept(x, y) is a function to calculate the intercept along the Y-axis. These functions are standard with aggregate functions in PostgreSQL and for known X and Y by least squares.

Let’s go back to our formula. We now have a linear equation, and next, we want to find the linear regression coefficients. Interception along the Y-axis and the slope coefficient can be found by the default functions for PostgreSQL. It turns out that:

ln(y) = ln(a) + b*ln(x)

ln(a) = intercept( ln(x), ln(y))

From the property of the natural logarithm, it follows that:

a = e^{intercept( ln(x), ln(y))}

The slope is calculated in the same way:

b = slope( ln(x), ln(y))

You can read more about how the intercept(x,y) and slope(x,y) functions work in __this manual__.

We will write the same calculations in a subquery for calculating the regression coefficients:

```
--calculate regression coefficients
coef as (
select exp(regr_intercept(ln(activity), ln(activity_day))) as a,
regr_slope(ln(activity), ln(activity_day)) as b
from(
select activity_day,
active_users_of_day::real / total_users_of_cohort as activity
from active_users
cross join cohort order by activity_day
)
),
```

We get a forecast for 180 days by substituting the parameters into the exponential function described earlier. At the same time, let’s calculate lifetime: the cumulative sum of the predicted data. In the coef subquery, we will get only two numbers: the slope and intercept parameter. To make these parameters available to each row of the lt subquery, we do a cross join to coef:

```
lt as(
select generate
_series as activity_day,
active_users_of_day::real/total_users_of_cohort as real_data,
a*power(generate_series,b) as pred_data,
sum(a*power(generate_series,b)) over(order by generate_series) as cumulative_lt
from generate_series(1,180,1)
cross join coefficient
join active_users on generate_series = activity_day::int
),
```

Now we can compare the forecast for 180 days with retention:

At this stage we consider LTV itself: lifetime multiplied by ARPDAU. In our case, ARPDAU is $83.7:

```
select cumulative_lt as LT,
cumulative_lt * 83.7 as LTV
from lt
```

Finally, let’s plot LTV for 180 days:

```
--total number of users in the cohort
with cohort as (
select count(*) as total_users_of_cohort
from users
where date(registration) between date '2021-03-01' and date '2021-03-30'
),
--number of active users on day 1, day 2, day 3, etc. from the cohort
active_users as (
select date_part('day', activity.date - users.registration) as activity_day,
count(distinct users.id) as active_users_of_day
from activity
join users on activity.user_id = users.id
where date(registration) between date '2021-03-01' and date '2021-03-30'
group by 1
having date_part('day', activity.date - users.registration) between 1 and 90
-- we take only the first 90 days, we predict the remaining days.
),
--calculate regression coefficients
coef as (
select exp(regr_intercept(ln(activity), ln(activity_day))) as a,
regr_slope(ln(activity), ln(activity_day)) as b
from(
select activity_day,
active_users_of_day::real / total_users_of_cohort as activity
from active_users
cross join cohort order by activity_day
)
),
lt as(
select generate_series as activity_day,
active_users_of_day::real/total_users_of_cohort as real_data,
a*power(generate_series,b) as pred_data,
sum(a*power(generate_series,b)) over(order by generate_series) as cumulative_lt
from generate_series(1,180,1)
cross join coefficient
join active_users on generate_series = activity_day::int
),
select cumulative_lt as LT,
cumulative_lt * 83.7 as LTV
from lt
```

You need this guide, because it includes:

- evolution of data stack
- reasons why many well-known tools do not meet the challenges of the new world
- vital structure of modern data stack
- the newest tools for every part of modern data stack

Turn it on to get exclusive guide on modern data stack

Emails suck. This newsletter doesn’t

Subscribe to the newsletter and get the most useful guide on modern data stack

The newsletter

You will also receive other useful materials on data analysis hacks with case examples from our company.