9 posts tagged


QR code recognition for sales receipts with Skimage

When we want to scan a QR code the image quality matters, and oftentimes the image may look blurred and defocused. To address this problem and suppress unwanted distortions we can use image pre-processing. In today’s article we will discover how to improve QR code recognition with the help of scikit-image library.

from matplotlib import pyplot as plt
import skimage
from skimage import util, exposure, io, measure, feature
from scipy import ndimage as ndi
import numpy as np
import cv2


Let’s try to scan this till receipt from our preceding article Collecting data from hypermarket receipts on Python. Use the  imread() function to read our image and then display it.

img = plt.imread('чек.jpg')

It seems hardly possible to read any letter from this blurred image. Let’s try to do this again with a predefined function from the  opencv library:

def qr_reader(img):
    detector = cv2.QRCodeDetector()
    data, bbox, _ = detector.detectAndDecode(img)
    if data:
        print('Ooops! Nothing here...')

Scan our image once again:

Ooops! Nothing here...

That’s not surprising, the abundance of pixels makes it difficult for the scanner to recognize the QR code. Nevertheless, we can simplify the task by specifying the edges of the QR image.


First, let us remove all the unnecessary pixels, find the coordinates of the QR code and pass it to the qr_reader function. First off, remove noise in the image using the median filter and convert our RGB image to grayscale, as QR-codes are composed of only two colors.

image = ndi.median_filter(util.img_as_float(img), size=9)
image = skimage.color.rgb2gray(image)
plt.imshow(image, cmap='gray')

The median filter blurred our image, and the scattered pixels have become less clear, while the QR code looks much better now. Apply the  adjust_gamma function to our image. This function exponentiates the gamma value of each pixel, less gamma means that the pixel will be closer to white color. We will set gamma to 0.5.

pores_gamma = exposure.adjust_gamma(image, gamma=0.5)
plt.imshow(pores_gamma, cmap='gray')

We can see clear improvements, the QR code is now much distinct than previously. Let’s take advantage of it and set all pixels with a value of less than 0.3 to 0, while others to 1.

thresholded = (pores_gamma <= 0.3)
plt.imshow(thresholded, cmap='gray')

Now, let’s apply the Canny filter to our thresholded image. This filter smoothes the image and calculate the gradients, the edges are where the gradient at maximum. With the increasing sigma parameter, the canny filter stops discerning less clear edges.

edge = feature.canny(thresholded, sigma=6)

Outline the QR code with the coordinates of the edges. We can calculate them with the find_contours method and draw them atop the image. Coordinates are stored in the contours array.

contours = measure.find_contours(edge, 0.5)
for contour in contours:
    plt.plot(contour[:,1], contour[:,0], linewidth=2)

We will take minimum and maximum coordinates for X and Y axes, thus drawing a visible rectangle.

positions = np.concatenate(contours, axis=0)
min_pos_x = int(min(positions[:,1]))
max_pos_x = int(max(positions[:,1]))
min_pos_y = int(min(positions[:,0]))
max_pos_y = int(max(positions[:,0]))

Having the coordinates, we can ensquare the code area:

start = (min_pos_x, min_pos_y)
end = (max_pos_x, max_pos_y)
cv2.rectangle(img, start, end, (255, 0, 0), 5)

Let’s try to cut this area according to our coordinates:

new_img = img[min_pos_y:max_pos_y, min_pos_x:max_pos_x]

Pass the new image to the qr_reader function:


And it returns this:


That’s exactly what we need! Of course, the script is not universal and every image is unique, some may have too much noise or low contrast, while others may not. The sequence of actions depends on the case. Next time, we will show the subsequent stage of image processing using a well-established python library.

 No comments    5   16 h   data analytics   python   skimage

Predicting category of products by name from Russian Food Stores

This article is a continuation of our series about analyzing data on consumer products: «Collecting data from hypermarket receipts on Python» and «Parsing the data of site’s catalog, using Beautiful Soup and Selenium». We are going to build a model that would classify products by name in a till receipt. Till receipts contain data for each product bought, but it doesn’t provide us a summary of how much were spent on Sweets or Dairy Foods in total.

Data Wrangling

Load data from our .csv file to a Pandas DataFrame and see how it looks:

Did you know that we can emulate human behavior to parse data from a web-catalog? More details about it are in this article: «Parsing the data of site’s catalog, using Beautiful Soup and Selenium»

import pandas as pd
sku = pd.read_csv('SKU_igoods.csv',sep=';')

As you can see, the DataFrame contains even more than we need for predicting the category of products by name. So we can drop() columns with prices and weights, and rename() the remaining ones:

sku.drop(columns=['Unnamed: 0', 'Weight','Price'],inplace=True)
sku.rename(columns={"SKU": "SKU", "Category": "Group"},inplace=True)

Group the products by its category and count them up with the following methods:


We will train our predictive model on this data so that it could identify the product category by name. Since the DataFrame includes product names mainly in Russian, the model won’t make predictions properly. The Russian language contains a lot of prepositions, conjunctions, and specific speech patterns. We want our model to distinguish that «Мангал с ребрами жесткости» («Brazier with strengthening ribs» ) and «Мангал с 6 шампурами» («Brazier with 6 skewers») belongs to the same category. With this is my we need to clean up all the product names, removing conjunctions, preposition, interjections, particles and keep only word bases with the help of stemming.

A stemmer is a tool that operates on the principle of recognizing “stem” words embedded in other words.

import nltk
from nltk.corpus import stopwords
from pymystem3 import Mystem
from string import punctuation

In our case will be using the pymystem3 library developed by Yandex. Product names in our DataFrame may vary from those ones you could find in supermarkets today. So first, let’s improve the list of stop words that our predictive model will ignore.

mystem = Mystem() 
russian_stopwords = stopwords.words("russian")
russian_stopwords.extend(['лента','ассорт','разм','арт','что', 'это', 'так', 'вот', 'быть', 'как', 'в', '—', 'к', 'на'])

Write a function that would preprocess our data and extract the word base, remove punctuation, numerical signs, and stop words. The following code snippet belongs to one Kaggle kernel.

def preprocess_text(text):
    text = str(text)
    tokens = mystem.lemmatize(text.lower())
    tokens = [token for token in tokens if token not in russian_stopwords\
              and token != " " \
              and len(token)>=3 \
              and token.strip() not in punctuation \
              and token.isdigit()==False]
    text = " ".join(tokens)
    return text

See how it works:

An extract from Borodino (Russian: Бородино), a poem by Russian poet Mikhail Lermontov which describes the Battle of Borodino.

preprocess_text("Мой дядя самых честных правил, Когда не в шутку занемог, Он уважать себя заставил И лучше выдумать не мог.")

Transformed into:

'дядя самый честный правило шутка занемогать уважать заставлять выдумывать мочь'

Everything works as expected – the result includes only word stems in lower case with no punctuation, prepositions or conjunctions. Let’s apply this function to a product name from our DataFrame:

print(‘Before:’, sku['SKU'][0])
print(‘After:’, preprocess_text(sku['SKU'][0]))

Preprocessed text:

Before: Фисташки соленые жареные ТМ 365 дней
After: фисташка соленый жареный день

The function works fine and now we can apply it to the whole column, and create a new one with processed names:


Building our Predictive Model

We will be using CountVectorizer to predict the product category, and Naive Bayes Classifier.
CountVectorizer will tokenize our text and build a vocabulary of known words, while Naive Bayes Classifier allows us to train our model on a DataFrame with multiple classes. We will also need TfidfTransformer for computing words count (term frequency). As we want to chain these steps, let’s import the Pipeline library:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.pipeline import Pipeline

Separate our targets, Y (categories) from the predictors, X (processed product names). And split the DataFrame into Test and Training sets, allocating 33% of samples for testing.

x = sku.processed
y = sku.Group
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

Add the following methods to our pipeline:

  • CountVectorizer() – returns a matrix of token counts
  • TfidfTransformer() – transforms a matrix into a normalized tf-idf representation
  • MultinomialNB() – an algorithm for predicting product category
text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,2))),
                     ('tfidf', TfidfTransformer()), 
                    ('clf', MultinomialNB())])

Fit our model to the Training Dataset and make predictions for the Test Dataset:

text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)

Evaluate our predictive model:

print('Score:', text_clf.score(X_test, y_test))

The model predicts correctly 90% of the time.

Score: 0.923949864498645

Validate our model with the real-world data

Let’s test how good our model performs on real-world data. We’ll refer to the DataFrame from our previous article, «Collecting data from hypermarket receipts on Python», and preprocess the product names:


Pass the processed text to the model and create a new column that would hold our predictions:

prediction = text_clf.predict(my_products['processed'])
my_products[['name', 'prediction']]

Now, the DataFrame looks the following:

Calculate the spendings for each product category:


Overall, the model seems to be robust in predicting that sausages fall under meat products, quark is a dairy product, baguette belongs to bread and pastries. But sometimes it misclassifies kiwi as a dairy product and pear as an eco-product. This is probably because these categories include many products are «with the taste of pear» or «with the taste of kiwi», and the algorithm makes predictions based on the prevailing group of products. This is a well-known issue of unbalanced classes, but it can be addressed by resampling the DataSet or choosing proper weights for our model.

 No comments    66   4 d   data analytics   machine learning   python

Beautiful Bar Charts with Python and Matplotlib

The Matplotlib library provides a wide range of tools for Data Visualisation, allowing us to create compelling, expressive visualizations. But why then so many plots look so bland and boring? Back in 2011 we built a simple yet decent diagram for a telecommunication company report and named it ‘Thermometer’. Later this type of bars was exposed to a wide audience on  Chandoo, which was a popular blog on Excel. By the way, here’s what it looks like:

Times change, and today we’ll recall the way to plot this type of diagrams with the help of Matplotlib

When should one use this type of diagram?
The best way to plot this type of diagrams is when comparing target values with actual values because it reflects underfulfilment and overfulfilment of planned targets. A diagram may reflect data in percentages as well as in real figures. Let’s view an example using the latter.

We’ll use data stored in an excel file and already familiar python libraries for data analysis:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Read our file as a DataFrame:

df = pd.read_excel('data.xlsx')

That’s what it looks like:

Now, we need to extract columns from the table. The first column called «Sales» will be displayed under each bar. Some values may be of a string type if there is a comma between two values. We need to convert these type of values by replacing a comma with a dot and converting them to floats.

xticks = df.iloc[:,0]
    bars2 = df.iloc[:,1].str.replace(',','.').astype('float')
except AttributeError:
    bars2 = df.iloc[:,1].astype('float')
    bars1 = df.iloc[:,2].str.replace(',','.').astype('float')
except AttributeError:
    bars1 = df.iloc[:,2].astype('float')

As we don’t know for sure if the table includes such values, our actions may cause an  AttributeError . Fortunatelly for us, Python has a built-in try – except
method for handling such errors.

Let’s plot a simple side-by-side bar graph, setting a distance between two related values using a NumPy array:

barWidth = 0.2
r1 = np.arange(len(bars1))
r2 = [x + barWidth for x in r1]
plt.bar(r1, bars1, width=barWidth)
plt.bar(r2, bars2, width=barWidth)

And see what happens:

Obviously, this is not what we expected. Let’s try to set a different bar width to make bars overlapping each other.

barWidth1 = 0.065
barWidth2 = 0.032
x_range = np.arange(len(bars1) / 8, step=0.125)

We can plot the bars and set its coordinates, color, width, legend and signatures in advance:

plt.bar(x_range, bars1, color='#dce6f2', width=barWidth1/2, edgecolor='#c3d5e8', label='Target')
plt.bar(x_range, bars2, color='#ffc001', width=barWidth2/2, edgecolor='#c3d5e8', label='Actual Value')
for i, bar in enumerate(bars2):
    plt.text(i / 8 - 0.015, bar + 1, bar, fontsize=14)

Add some final touches – remove the frames, ticks, add a grey line under the bars, adjust font size and layout, make a plot a bit wider and save it as a .png file.

plt.xticks(x_range, xticks)
plt.rcParams['figure.figsize'] = [25, 7]
plt.axhline(y=0, color='gray')
plt.legend(frameon=False, loc='lower center', bbox_to_anchor=(0.25, -0.3, 0.5, 0.5), prop={'size':20})
plt.savefig('plt', bbox_inches = "tight")

And here’s the final result:

 No comments    204   8 d   data analytics   matplotlib   python   visualisation

Working with Materialized Views in Clickhouse

This time we’ll illustrate how you can pass data on Facebook ad campaigns to Clickhouse tables with Python and implement Materialized Views. What is materialized views, you may ask. Oftentimes Clickhouse is used to handle large amounts of data and the time spent waiting for a response from a table with raw data is constantly increasing. Usually, we would use ETL-process to address this task efficiently or create aggregate tables, which are not that useful because we have to regularly update them. Clickhouse system offers a new way to meet the challenge using materialized views.
Materialized Views allow us to store and update data on a hard drive in line with the SELECT query that was used to get a view. When we need to insert data into a table, the SELECT method transforms our data and populates a materialized view.

Setting Up Amazon EC2 instance
We need to connect our Python script that we created in this article to Cickhouse. The script will make queries, so let’s open several ports. In your AWS Dashboard go to Network & Security — Security Groups. Our instance belongs to the launch-wizard-1 group. Сlick it and pay attention to the Inbound rules, you need to set them as shown in this screenshot:

Setting up Clickhouse
It’s time to set up Clickhouse. Let’s edit the config.xml file using nano text editor:

cd /etc/clickhouse-server
sudo nano config.xml

Learn more about the shortcuts here if you didn’t get how to exit nano too :)

Uncomment this line:


to access your database from any IP-address:

Create a table and its materialized view
Open a terminal window to create our database with tables:

USE db1

We’ll refer to the same example of data collection from Facebook. The data on Ad Campaigns may often change and be updated, with this in mind we want to create a materialized view that would automatically update aggregate tables containing the costs data. Our Clickhouse table will look almost the same as the DataFrame used in the previous post. We picked ReplacingMergeTree as an engine for our table, it will remove duplicates by sorting key:

CREATE TABLE facebook_insights(
	campaign_id UInt64,
	clicks UInt32,
	spend Float32,
	impressions UInt32,
	date_start Date,
	date_stop	 Date,
	sign Int8
) ENGINE = ReplacingMergeTree
ORDER BY (date_start, date_stop)

And then, create a materialized view:

ENGINE = SummingMergeTree()
ORDER BY date_start
	SELECT campaign_id,
		      sum(spend * sign) as spent,
		      sum(impressions * sign) as impressions,
		      sum(clicks * sign) as clicks
	FROM facebook_insights
	GROUP BY date_start, campaign_id

More details are available in the Clickhouse blog.

Unfortunately for us, Clikhouse system doesn’t include a familiar UPDATE method. So we need to find a workaround. Thanks to the Yandex team, these guys offered to insert rows with a negative sign first, and then use sign for reversing. According to this principle, the old data will be ignored when summing.

Let’s start writing the script and import a new library, which is called clickhouse_driver. It allows to make queries to Clickhouse in Python:

We are using the updated version of the script from “Collecting Data on Facebook Ad Campaigns”. But it will work fine if you just combine this code with the previous one.

from datetime import datetime, timedelta
from clickhouse_driver import Client
from clickhouse_driver import errors

An object of the Clientclass enables us to make queries with the execute() method. Type in your public DNS in the host field, port – 9000, specify default as a user, and a databasefor the connection.

client = Client(host='ec1-2-34-56-78.us-east-2.compute.amazonaws.com', user='default', password=' ', port='9000', database='db1')

To ensure that everything works as expected, we need to write the following query that will print out names of all databases stored on the server:

client.execute('SHOW DATABASES')

In case of success the query will return this list:

[('_temporary_and_external_tables',), ('db1',), ('default',), ('system',)]

For example, we want to get data for the past three days. Create several datetime objects with the datetime library and convert them to strings using thestrftime()

date_start = datetime.now() - timedelta(days=3)
date_end = datetime.now() - timedelta(days=1)
date_start_str = date_start.strftime("%Y-%m-%d")
date_end_str = date_end.strftime("%Y-%m-%d")

This query returns all table columns for a certain period:

SQL_select = f"select campaign_id, clicks, spend, impressions, date_start, date_stop, sign from facebook_insights where date_start > '{date_start_str}' AND date_start < '{date_end_str}'"

Make a query and pass the data to the old_data_list. And then, replace their signfor -1 and append elements to the new_data_list:

new_data_list = []
old_data_list = []
old_data_list = client.execute(SQL_select)

for elem in old_data_list:
    elem = list(elem)
    elem[len(elem) - 1] = -1

Finally, write our algorithm: insert the data with the sign =-1, optimize it with ReplacingMergeTree, remove duplicates, and INSERT new data with the sign = 1.

SQL_query = 'INSERT INTO facebook_insights VALUES'
client.execute(SQL_query, new_data_list)
SQL_optimize = "OPTIMIZE TABLE facebook_insights"
for i in range(len(insight_campaign_id_list)):
    client.execute(SQL_query, [[insight_campaign_id_list[i],
                                datetime.strptime(insight_date_start_list[i], '%Y-%m-%d').date(),
                                datetime.strptime(insight_date_start_list[i], '%Y-%m-%d').date(),

Get back to Clickhouse and make the next query to view the first 20 rows:
SELECT * FROM facebook_insights LIMIT 20

And  SELECT * FROM fb_aggregated LIMIT 20 to compare our materialized view:

Nice work! Now we have a materialized view that will be updated each time when the data in the facebook_insights table changes. The trick with the sign operator allows to differ already processed data and prevent its summation, while ReplacingMergeTree engine helps us to remove duplicates.

Collecting Data on Facebook Ad Campaigns

Let’s find out how to obtain data on costs, ad clicks, and impressions on Facebook.

from facebook_business.api import FacebookAdsApi
from facebook_business.exceptions import FacebookRequestError
from facebook_business.adobjects.adaccount import AdAccount
from facebook_business.adobjects.adreportrun import AdReportRun
from facebook_business.adobjects.adsinsights import AdsInsights
from facebook_business.adobjects.campaign import Campaign
from facebook_business.adobjects.adset import AdSet
from facebook_business.adobjects.adaccountuser import AdAccountUser as AdUser
from facebook_business import adobjects
from matplotlib import pyplot as plt
from pandas import DataFrame
import time

Getting the API Key
First thing we need to do before working with Facebook API is to create a Facebook App. Go to https://developers.facebook.com — My Apps — Create App. Type in your app name, your email, and click «Create App ID».

After all the steps have done right you will see the App Dashboard. Choose Settings — Basic, copy your App ID and your App Secret. We will need this data for authentication process later.

Now click Tools — Graph API Explorer and we will find ourselves in the token creation menu.

These tokens can be generated for different needs, we need to set access rights for our token. We’ll need ads_management – this right allows to get information about ad campaigns for your Facebook account. Add it and click «Generate Access Token».

Be careful – the user access token provides only a temporary access and expires after 1-2 hours. If you want to get a long-lived token click the blue button to open Access Token Info -- Open in Access Token Tool. You will see a new page with the «Extend Access Token» button, click it and a new long-lived Access Token will be generated for a period of 60 days.

Writing the Script

Now we need to create 3 variables, assign them to our Access Token, App ID and App Secret respectively. Log in using the init() method of FacebooksAdsApi Class and add a user. The get_ad_accounts() method returns data on all our ad accounts as a dictionary. We can get the same data using the get_campaigns() method instead.

my_access_token = ''
my_app_id = ''
my_app_secret = ''
FacebookAdsApi.init(my_app_id, my_app_secret, my_access_token)

me = AdUser(fbid='me')
my_accounts = list(me.get_ad_accounts())

my_account = my_accounts[0]
campaigns = my_account.get_campaigns()

Let’s retrieve the amount spent using my_account. We’ll use the api_get() method and pass AdAccount.Field.amount_spent in the fields parameter. Now, to receive the data we need some math:
divide my_account by 100 to see a whole number. The expenses will be displayed in your account currency, in our case it’s RUB. The main purpose of our actions – obtaining data on the costs of ad campaigns for further analysis:


The next step will be creating two functions. The first one will send asynchronous queries to Facebook and return results. The second function forms these queries and passes them in the first function. As a result, we will get a list of dictionaries.

count = 0

def wait_for_async_job(async_job):
    global count
    async_job = async_job.api_get()
    while async_job[AdReportRun.Field.async_status] != 'Job Completed' or async_job[
        AdReportRun.Field.async_percent_completion] < 100:
        async_job = async_job.api_get()
        print("Job " + str(count) + " completed")
        count += 1
    return async_job.get_result(params={"limit": 1000})

def get_insights(account, date_preset='last_3d'):
    account = AdAccount(account["id"])
    i_async_job = account.get_insights(
            'level': 'ad',
            'date_preset': date_preset,
            'time_increment': 1},
    results = [dict(item) for item in wait_for_async_job(i_async_job)]
    return results

The following step is to get data on costs. We need the data for all time, therefore create the variable date_preset, set its value to lifetime. Call the get_insights() function for every account, and assign the output to insights_lists.
Create a DataFrame and extract the relevant data from the insights_lists, that’s campaign’s id, number of clicks, costs and impressions.

elem_insights = []
insights_lists = []
date_preset = 'last_year'
for elem in my_accounts:
            elem_insights = get_insights(elem, date_preset)

insight_campaign_id_list = []
insight_clicks_list = []
insight_spend_list = []
insight_impressions_list = []
insight_date_start_list = []
insight_date_stop_list = []
for elem1 in insights_lists:
    for elem2 in elem1:

df = DataFrame()
df['campaign_id'] = insight_campaign_id_list
df['clicks'] = insight_clicks_list
df['spend'] = insight_spend_list
df['impressions'] = insight_impressions_list
df['date_start'] = insight_date_start_list
df['date_stop'] = insight_date_stop_list

We’ll get the following DataFrame:

Let’s summarize this data – group it by campaigns and calculate the sum for each group. Pandas have build-in group() and sum() methods for it, just keep in mind which column to group by.


Now, plot two graphs  – by the number of impressions and clicks relative to the dates. We’ll use the rcParams attribute to set parameters for the graphs.

plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(df.date_start.str.replace('2019-', ''), df.clicks)
plt.rcParams['figure.figsize'] = [20, 5]
plt.plot(df.date_start.str.replace('2019-', ''), df.impressions)

That’s it! Now we have the data on costs, ad impressions and clicks. In the next article, we will explain how to upload it in JSON format and pass it to Redash for further analysis and visualization.

 No comments    614   24 d   data analytics   facebook   python
Earlier Ctrl + ↓