LEFT JOIN: blog on analytics, visualisation & data science

Kazakhstan Marketing Conference 2020

Yesterday I had a chance to address the largest marketing conference in Kazakhstan: Kazakhstan Marketing Conference 2020.

Almaty, as a city, has made a positive impression on me, whereas the conference itself turned out to be highly professional event, filled with plenty of smart, versatile and kind people.

A pleasant bonus for conference participants: presentation of my speech available on slideshare (careful, VPN!), so one can recall what it was about.

Apart from the speech, in the main forum’s section I was holding a masterclass on “How to construct a comprehensible technical specification on analytics?”.
And, within the framework of work with the audience, we managed to formulate points for a template of a technical specification.

Sharing the template, it will be helpful for those, who faced with difficulties in translating of a task from business language to the technical one.

 No comments    12   2 d   analysis   conference   marketing   template
 No comments    23   5 d   machine learning   python

Overview of Yandex DataLens

Let’s take our minds off of the project on receipt data collection for a while. We will speak about the project’s following steps a bit later.

Today we’ll be discussing a new service from Yandex – DataLens (the access to demo was kindly provided to me by my great friend Vasiliy Ozerov and the team Fevlake / Rebrain). Currently, the service is in Preview mode and is, in essence, a cloud BI. The main shtick of the service is that it can easily and handy work with clickhouse (Yandex Clickhouse).

Connection of data sources

Let’s review the major things: connection of a data source and dataset setting.
The selection of DBMS is not vast, nevertheless some main things are present. For the purpose of our testing, let’s take MySQL.

Selection of data sources DataLens

On the basis of the connection created, it is suggested to create a dataset:

Interface of dataset settings, definition of measurements and metrics

On this stage it’s defined which table’s attributes are becoming measurements, and which are turning into metrics. You can choose data aggregation type for the metrics.
Unfortunately, I didn’t manage to discover how it’s possible to state several interconnected tables (for example, attach a handbook for measurements) instead of a single table. Perhaps, on this stage developers suggest us to solve this issue by creating of required view.

Data visualization

Regarding the interface itself – everything is pretty easy and handy. It reminds of a cloud version of Tableau. If comparing to Redash, which is most frequently used in conjunction with Clickhouse, the opportunities of visualization are simply staggering.
Even pivot tables, in which one can use Measure Names as columns’ names are worth something:

Setting of pivot tables in DataLens

Obviously, there is an opportunity to make also basic charts in DataLens from Yandex:

Construction of a linear chart in DataLens

There are also area charts:

Construction of an area chart in DataLens

However, I didn’t manage to find out how data classification by months / quarters / weeks is carried out. According to an example of data, available in the demo version, developers are still solving this issue by creating additional attributes (DayMonth, DayWeek, etc).

Dashboards

For now, interface of dashboard blocks’ creation looks bulky, and interface windows are not always comprehensive. Here is, for instance, a window, allowing to state a parameter:

Not really apparent setting window for dashboard parameters

However, in the gallery of examples we can see highly functional and convenient dashboards with selectors, tabs and parameters:

An example of a working dashboard with parameters and tabs in DataLens

Looking forward to fixing of interface shortcomings, improving of Datalens and preparing to use it together with Clickhouse!

Building a funnel-report in redash

So, we’ve been planning to review Funnel-visualization of a report in Redash.
First and foremost, let’s build a request to the data source that we’ve created – Google Analytics.

The following text needs to be placed in the request console:

{
    "ids": "ga:128886640",
    "start_date": "30daysAgo",
    "end_date": "yesterday",
    "metrics": "ga:users,ga:goal1Completions,ga:goal2Completions,ga:goal3Completions"
}

In this request we are asking API Google Analytics to provide data for the last 30 days on the account GA: 128886640. We want to see the number of users and the number of completion of goals 1, 2 and 3.

As a result, our table will look as follows:

ga:users ga:goal1Completions ga:goal2Completions ga:goal3Completions
3,926 105 41 32

Great, that’s right what we need in order to build a funnel.
Now I will tell you about one very useful Redash feature: query-results. In order to connect tables with results of queries’ execution, we go to Data Sources and search for query-results (beta). Connecting new data source.
Now we have an opportunity to refer to results of Redash queries. Thus, for instance, we can use the results of a requests to Google Analytics API.

How to do it?
We need to choose a data source query-results on the left:

Drop down menu with selection of data sources (in the console – on the left)

Now we’ll learn to make funnel-visualization. For this purpose, we write the following SQL-query:

select 'Add a good to the shopping cart' as step_name, ga_goal1Completions as goalCompletion from query_8
union all
select 'View the shopping cart' as step_name, ga_goal2Completions from query_8
union all
select 'Order processing' as step_name, ga_goal3Completions from query_8

In this case query_8 – is the very table with results of request to the data source Google Analytics.

Let’s set visualization:

Carefully selecting parameters, in order to achieve the desired result

As a result, we receive the funnel of conversions from one goal to another:

You can display this funnel in the dashboard and add filters / parameters thereto.

Collecting data from hypermarket receipts on Python

Recently, once again buying products in a hypermarket, I recalled that, according to the Russian Federal Act FZ-54, any trade operator, that issues a receipt, is obliged to send the data thereof to the Tax Service.

Receipt from “Lenta” hypermarket. The QR-code of our interest is circled.

So, what does it mean for us, data analysts? It means that we can know ourselves and our needs better, and also acquire interesting data on own purchases.

Let’s try to assemble a small prototype of an app that will allow to make a dynamic of our purchases within the framework of blog posts’ series. So, we’ll start from the fact, that each receipt has a QR-code, and if you identify it, you’ll receive the following line:

t=20190320T2303&s=5803.00&fn=9251440300007971&i=141637&fp=4087570038&n=1

This line comprises:

t – timestamp, the time when you made a purchase
s – sum of the receipt
fn – code number of fss, will be needed further in a request to API
i – receipt number, will be needed further in a request to API
fp – fiscalsign parameter, will be needed further in a request to API

Within the solution of the first step, we will parse the receipt data and collect it in pandas dataframe, using Python modules.

We will use API, that provides data on the receipt from the Tax Service website.

Initially, we will receive authentication data:

import requests
your_phone = '+7XXXYYYZZZZ' #you need to state your phone number, SMS with password will arrive thereon
r = requests.post('https://proverkacheka.nalog.ru:9999/v1/mobile/users/signup', json = {"email":"email@email.com","name":"USERNAME","phone":your_phone})

As a result of performing POST request we receive a password in SMS to the indicated phone number. Further on, we will be using it in a variable pwd

Now we’ll parse our line with values from QR-code:

import re
qr_string='t=20190320T2303&s=5803.00&fn=9251440300007971&i=141637&fp=4087570038&n=1'
t=re.findall(r't=(\w+)', qr_string)[0]
s=re.findall(r's=(\w+)', qr_string)[0]
fn=re.findall(r'fn=(\w+)', qr_string)[0]
i=re.findall(r'i=(\w+)', qr_string)[0]
fp=re.findall(r'fp=(\w+)', qr_string)[0]

We’ll use the variables obtained in order to extract the data.
One Habr post pretty thoroughly examines status of errors at formation of API request, therefore I won’t repeat this information.

In the beginning, we need to verify the presence of data on this receipt, so we form a GET request.

headers = {'Device-Id':'', 'Device-OS':''}
payload = {'fiscalSign': fp, 'date': t,'sum':s}
check_request=requests.get('https://proverkacheka.nalog.ru:9999/v1/ofds/*/inns/*/fss/'+fn+'/operations/1/tickets/'+i,params=payload, headers=headers,auth=(your_phone, pwd))
print(check_request.status_code)

In the request one needs to indicate headers, at least empty ones. In my case, GET request returns error 406, thus I get that such receipt is found (why GET request returns 406 remains a mystery to me, so I will be glad to receive some clues in comments). If not indicating sum or date, GET request returns error 400 – bad request.

Let’s move on to the most interesting part, receiving data of the receipt:

request_info=requests.get('https://proverkacheka.nalog.ru:9999/v1/inns/*/kkts/*/fss/'+fn+'/tickets/'+i+'?fiscalSign='+fp+'&sendToEmail=no',headers=headers,auth=(your_phone, pwd))
print(request_info.status_code)
products=request_info.json()

We should receive code 200 (successful execution of the request), and in the variable products – everything, that applies to our receipt.

In order to further work with this data, let’s use pandas and transform everything in dataframe.

import pandas as pd
from datetime import datetime
my_products=pd.DataFrame(products['document']['receipt']['items'])
my_products['price']=my_products['price']/100
my_products['sum']=my_products['sum']/100
datetime_check = datetime.strptime(t, '%Y%m%dT%H%M') #((https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior formate the date))
my_products['date']=datetime_check
my_products.set_index('date',inplace=True)

Now we have working pandas.dataframe with receipts, visually it looks as follows:

“Header” of receipt data

You can construct a bar chart of purchases or observe everything as a box plot:

import matplotlib.pyplot as plt
%matplotlib inline
my_products['sum'].plot(kind='hist', bins=20)
plt.show()
my_products['sum'].plot(kind='box')
plt.show()
boxplot_cheques.png

In conclusion, we will simply get descriptive statistics as text, using a command .describe():

my_products.describe()

It’s convenient to write down data as .csv file, so that the next time you can amend the statistics:

with open('hyper_receipts.csv', 'a') as f:
             my_products.to_csv(f, header=True)
Earlier Ctrl + ↓