Later Ctrl + ↑

QR code recognition for sales receipts with Skimage

When we want to scan a QR code the image quality matters, and oftentimes the image may look blurred and defocused. To address this problem and suppress unwanted distortions we can use image pre-processing. In today’s article we will discover how to improve QR code recognition with the help of scikit-image library.

from matplotlib import pyplot as plt
import skimage
from skimage import util, exposure, io, measure, feature
from scipy import ndimage as ndi
import numpy as np
import cv2

Subject

Let’s try to scan this till receipt from our preceding article Collecting data from hypermarket receipts on Python. Use the  imread() function to read our image and then display it.

img = plt.imread('чек.jpg')
plt.imshow(img)

It seems hardly possible to read any letter from this blurred image. Let’s try to do this again with a predefined function from the  opencv library:

def qr_reader(img):
    detector = cv2.QRCodeDetector()
    data, bbox, _ = detector.detectAndDecode(img)
    if data:
        print(data)
    else:
        print('Ooops! Nothing here...')

Scan our image once again:

qr_reader(img)
Ooops! Nothing here...

That’s not surprising, the abundance of pixels makes it difficult for the scanner to recognize the QR code. Nevertheless, we can simplify the task by specifying the edges of the QR image.

Solution

First, let us remove all the unnecessary pixels, find the coordinates of the QR code and pass it to the qr_reader function. First off, remove noise in the image using the median filter and convert our RGB image to grayscale, as QR-codes are composed of only two colors.

image = ndi.median_filter(util.img_as_float(img), size=9)
image = skimage.color.rgb2gray(image)
plt.imshow(image, cmap='gray')

The median filter blurred our image, and the scattered pixels have become less clear, while the QR code looks much better now. Apply the  adjust_gamma function to our image. This function exponentiates the gamma value of each pixel, less gamma means that the pixel will be closer to white color. We will set gamma to 0.5.

pores_gamma = exposure.adjust_gamma(image, gamma=0.5)
plt.imshow(pores_gamma, cmap='gray')

We can see clear improvements, the QR code is now much distinct than previously. Let’s take advantage of it and set all pixels with a value of less than 0.3 to 0, while others to 1.

thresholded = (pores_gamma <= 0.3)
plt.imshow(thresholded, cmap='gray')

Now, let’s apply the Canny filter to our thresholded image. This filter smoothes the image and calculate the gradients, the edges are where the gradient at maximum. With the increasing sigma parameter, the canny filter stops discerning less clear edges.

edge = feature.canny(thresholded, sigma=6)
plt.imshow(edge)

Outline the QR code with the coordinates of the edges. We can calculate them with the find_contours method and draw them atop the image. Coordinates are stored in the contours array.

contours = measure.find_contours(edge, 0.5)
plt.imshow(edge)
for contour in contours:
    plt.plot(contour[:,1], contour[:,0], linewidth=2)

We will take minimum and maximum coordinates for X and Y axes, thus drawing a visible rectangle.

positions = np.concatenate(contours, axis=0)
min_pos_x = int(min(positions[:,1]))
max_pos_x = int(max(positions[:,1]))
min_pos_y = int(min(positions[:,0]))
max_pos_y = int(max(positions[:,0]))

Having the coordinates, we can ensquare the code area:

start = (min_pos_x, min_pos_y)
end = (max_pos_x, max_pos_y)
cv2.rectangle(img, start, end, (255, 0, 0), 5)
io.imshow(img)

Let’s try to cut this area according to our coordinates:

new_img = img[min_pos_y:max_pos_y, min_pos_x:max_pos_x]
plt.imshow(new_img)

Pass the new image to the qr_reader function:

qr_reader(new_img)

And it returns this:

t=20190320T2303&s=5803.00&fn=9251440300007971&i=141637&fp=4087570038&n=1

That’s exactly what we need! Of course, the script is not universal and every image is unique, some may have too much noise or low contrast, while others may not. The sequence of actions depends on the case. Next time, we will show the subsequent stage of image processing using a well-established python library.

 No comments    79   1 mon   data analytics   python   skimage
 No comments    59   1 mon   Data engineering   sql

Predicting category of products by name from Russian Food Stores

This article is a continuation of our series about analyzing data on consumer products: «Collecting data from hypermarket receipts on Python» and «Parsing the data of site’s catalog, using Beautiful Soup and Selenium». We are going to build a model that would classify products by name in a till receipt. Till receipts contain data for each product bought, but it doesn’t provide us a summary of how much were spent on Sweets or Dairy Foods in total.

Data Wrangling

Load data from our .csv file to a Pandas DataFrame and see how it looks:

Did you know that we can emulate human behavior to parse data from a web-catalog? More details about it are in this article: «Parsing the data of site’s catalog, using Beautiful Soup and Selenium»

import pandas as pd
sku = pd.read_csv('SKU_igoods.csv',sep=';')
sku.head()

As you can see, the DataFrame contains even more than we need for predicting the category of products by name. So we can drop() columns with prices and weights, and rename() the remaining ones:

sku.drop(columns=['Unnamed: 0', 'Weight','Price'],inplace=True)
sku.rename(columns={"SKU": "SKU", "Category": "Group"},inplace=True)
sku.head()

Group the products by its category and count them up with the following methods:

sku.groupby('Group').agg(['count'])

We will train our predictive model on this data so that it could identify the product category by name. Since the DataFrame includes product names mainly in Russian, the model won’t make predictions properly. The Russian language contains a lot of prepositions, conjunctions, and specific speech patterns. We want our model to distinguish that «Мангал с ребрами жесткости» («Brazier with strengthening ribs» ) and «Мангал с 6 шампурами» («Brazier with 6 skewers») belongs to the same category. With this is my we need to clean up all the product names, removing conjunctions, preposition, interjections, particles and keep only word bases with the help of stemming.

A stemmer is a tool that operates on the principle of recognizing “stem” words embedded in other words.

import nltk
from nltk.corpus import stopwords
from pymystem3 import Mystem
from string import punctuation
nltk.download('stopwords')

In our case will be using the pymystem3 library developed by Yandex. Product names in our DataFrame may vary from those ones you could find in supermarkets today. So first, let’s improve the list of stop words that our predictive model will ignore.

mystem = Mystem() 
russian_stopwords = stopwords.words("russian")
russian_stopwords.extend(['лента','ассорт','разм','арт','что', 'это', 'так', 'вот', 'быть', 'как', 'в', '—', 'к', 'на'])

Write a function that would preprocess our data and extract the word base, remove punctuation, numerical signs, and stop words. The following code snippet belongs to one Kaggle kernel.

def preprocess_text(text):
    text = str(text)
    tokens = mystem.lemmatize(text.lower())
    tokens = [token for token in tokens if token not in russian_stopwords\
              and token != " " \
              and len(token)>=3 \
              and token.strip() not in punctuation \
              and token.isdigit()==False]
    text = " ".join(tokens)
    return text

See how it works:

An extract from Borodino (Russian: Бородино), a poem by Russian poet Mikhail Lermontov which describes the Battle of Borodino.

preprocess_text("Мой дядя самых честных правил, Когда не в шутку занемог, Он уважать себя заставил И лучше выдумать не мог.")

Transformed into:

'дядя самый честный правило шутка занемогать уважать заставлять выдумывать мочь'

Everything works as expected – the result includes only word stems in lower case with no punctuation, prepositions or conjunctions. Let’s apply this function to a product name from our DataFrame:

print(‘Before:’, sku['SKU'][0])
print(‘After:’, preprocess_text(sku['SKU'][0]))

Preprocessed text:

Before: Фисташки соленые жареные ТМ 365 дней
After: фисташка соленый жареный день

The function works fine and now we can apply it to the whole column, and create a new one with processed names:

sku['processed']=sku['SKU'].apply(preprocess_text)
sku.head()

Building our Predictive Model

We will be using CountVectorizer to predict the product category, and Naive Bayes Classifier.
CountVectorizer will tokenize our text and build a vocabulary of known words, while Naive Bayes Classifier allows us to train our model on a DataFrame with multiple classes. We will also need TfidfTransformer for computing words count (term frequency). As we want to chain these steps, let’s import the Pipeline library:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.pipeline import Pipeline

Separate our targets, Y (categories) from the predictors, X (processed product names). And split the DataFrame into Test and Training sets, allocating 33% of samples for testing.

x = sku.processed
y = sku.Group
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

Add the following methods to our pipeline:

  • CountVectorizer() – returns a matrix of token counts
  • TfidfTransformer() – transforms a matrix into a normalized tf-idf representation
  • MultinomialNB() – an algorithm for predicting product category
text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,2))),
                     ('tfidf', TfidfTransformer()), 
                    ('clf', MultinomialNB())])

Fit our model to the Training Dataset and make predictions for the Test Dataset:

text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)

Evaluate our predictive model:

print('Score:', text_clf.score(X_test, y_test))

The model predicts correctly 90% of the time.

Score: 0.923949864498645

Validate our model with the real-world data

Let’s test how good our model performs on real-world data. We’ll refer to the DataFrame from our previous article, «Collecting data from hypermarket receipts on Python», and preprocess the product names:

my_products['processed']=my_products['name'].apply(preprocess_text)
my_products.head()

Pass the processed text to the model and create a new column that would hold our predictions:

prediction = text_clf.predict(my_products['processed'])
my_products['prediction']=prediction
my_products[['name', 'prediction']]

Now, the DataFrame looks the following:

Calculate the spendings for each product category:

my_products.groupby('prediction').sum()

Overall, the model seems to be robust in predicting that sausages fall under meat products, quark is a dairy product, baguette belongs to bread and pastries. But sometimes it misclassifies kiwi as a dairy product and pear as an eco-product. This is probably because these categories include many products are «with the taste of pear» or «with the taste of kiwi», and the algorithm makes predictions based on the prevailing group of products. This is a well-known issue of unbalanced classes, but it can be addressed by resampling the DataSet or choosing proper weights for our model.

Beautiful Bar Charts with Python and Matplotlib

The Matplotlib library provides a wide range of tools for Data Visualisation, allowing us to create compelling, expressive visualizations. But why then so many plots look so bland and boring? Back in 2011 we built a simple yet decent diagram for a telecommunication company report and named it ‘Thermometer’. Later this type of bars was exposed to a wide audience on  Chandoo, which was a popular blog on Excel. By the way, here’s what it looks like:

Times change, and today we’ll recall the way to plot this type of diagrams with the help of Matplotlib

When should one use this type of diagram?
The best way to plot this type of diagrams is when comparing target values with actual values because it reflects underfulfilment and overfulfilment of planned targets. A diagram may reflect data in percentages as well as in real figures. Let’s view an example using the latter.

We’ll use data stored in an excel file and already familiar python libraries for data analysis:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Read our file as a DataFrame:

df = pd.read_excel('data.xlsx')

That’s what it looks like:

Now, we need to extract columns from the table. The first column called «Sales» will be displayed under each bar. Some values may be of a string type if there is a comma between two values. We need to convert these type of values by replacing a comma with a dot and converting them to floats.

xticks = df.iloc[:,0]
try:
    bars2 = df.iloc[:,1].str.replace(',','.').astype('float')
except AttributeError:
    bars2 = df.iloc[:,1].astype('float')
try:
    bars1 = df.iloc[:,2].str.replace(',','.').astype('float')
except AttributeError:
    bars1 = df.iloc[:,2].astype('float')

As we don’t know for sure if the table includes such values, our actions may cause an  AttributeError . Fortunatelly for us, Python has a built-in try – except
method for handling such errors.

Let’s plot a simple side-by-side bar graph, setting a distance between two related values using a NumPy array:

barWidth = 0.2
r1 = np.arange(len(bars1))
r2 = [x + barWidth for x in r1]
 
plt.bar(r1, bars1, width=barWidth)
plt.bar(r2, bars2, width=barWidth)

And see what happens:

Obviously, this is not what we expected. Let’s try to set a different bar width to make bars overlapping each other.

barWidth1 = 0.065
barWidth2 = 0.032
x_range = np.arange(len(bars1) / 8, step=0.125)

We can plot the bars and set its coordinates, color, width, legend and signatures in advance:

plt.bar(x_range, bars1, color='#dce6f2', width=barWidth1/2, edgecolor='#c3d5e8', label='Target')
plt.bar(x_range, bars2, color='#ffc001', width=barWidth2/2, edgecolor='#c3d5e8', label='Actual Value')
for i, bar in enumerate(bars2):
    plt.text(i / 8 - 0.015, bar + 1, bar, fontsize=14)

Add some final touches – remove the frames, ticks, add a grey line under the bars, adjust font size and layout, make a plot a bit wider and save it as a .png file.

plt.xticks(x_range, xticks)
plt.tick_params(
    bottom=False,
    left=False,
    labelsize=15
)
plt.rcParams['figure.figsize'] = [25, 7]
plt.axhline(y=0, color='gray')
plt.legend(frameon=False, loc='lower center', bbox_to_anchor=(0.25, -0.3, 0.5, 0.5), prop={'size':20})
plt.box(False)
plt.savefig('plt', bbox_inches = "tight")
plt.show()

And here’s the final result:

Collecting Data on Ad Campaigns from VK.com

We have a lot to share in today’s longread: we’ll retrieve data on ad campaigns from Vkontakte (widely popular social network in Russia and CIS countries) and compare them to Google Analytics data in Redash. This time we don’t need to create a server, as our data will be transferred to Google Docs via Google Sheets API.

Getting an Access Token
We need to create an app to receive our access token. Follow this link https://vk.com/apps?act=manage and click “Create app” on the developer’s page. Choose a name for your app and check it as a “Standalone app”. Then, click Settings in the left menu and save your app ID.

More details on access tokens can be found here: Getting an access token

Copy this link:

https://oauth.vk.com/authorize?client_id=YourClientID&scope=ads&response_type=token

And change YourClientID to your app ID, this will allow you to get information about your advertising account. Open this link in your browser and you will be redirected to another page, which URL address holds your generated access token.

Access token expires in 86400 seconds or 24 hours. If you want to generate a token with an unlimited lifetime period, just pass scope to the offline parameter. In case if you need to generate a new token – change your password account or terminate all active sessions in security settings.

You will also need your advertising account ID to make API requests. It can be found via this link, just copy it:  https://vk.com/ads?act=settings

Using APIs to collect data
Let’s write a script that would allow us to retrieve information on all user’s ad campaigns: number of impressions, сlicks and costs. The script will pass this data to a DataFrame and send it to Google Docs.

from oauth2client.service_account import ServiceAccountCredentials
from pandas import DataFrame
import requests
import gspread
import time

We have several constant variables: access token, advertising account ID and Vkontakte API Version. Here we are using the most recent API version, which is 5.103.

token = 'fa258683fd418fafcab1fb1d41da4ec6cc62f60e152a63140c130a730829b1e0bc'
version = 5.103
id_rk = 123456789

To get advertising stats you need to use the  ads.getStatistics method and pass your ad campaign ID to it. Since we don’t run any advertisements yet, we’ll use the  ads.getAds method that returns IDs of ads and campaigns.

Learn more about the API methods available for Vkontakte here

Use the requests library to send a request and convert the response to JSON.


campaign_ids = []
ads_ids = []
r = requests.get('https://api.vk.com/method/ads.getAds', params={
    'access_token': token,
    'v': version,
    'account_id': id_rk
})
data = r.json()['response']

We have a familiar list of dictionaries returned, similar to the one we have reviewed in the previous article, “Analysing data on Facebook Ad Campaigns with Redash”.

Fill in the ad_campaign_dict dictionary as follows: specify ad ID as a key, and campaign ID as a value, where this ad belongs to.

ad_campaign_dict = {}
for i in range(len(data)):
    ad_campaign_dict[data[i]['id']] = data[i]['campaign_id']

Having ID for every ad needed we can invoke the  ads.getStatistics method to collect data on the number of impressions, clicks, costs, and dates for a particular ad, so create several empty lists in advance.

ads_campaign_list = []
ads_id_list = []
ads_impressions_list = []
ads_clicks_list = []
ads_spent_list = []
ads_day_start_list = []
ads_day_end_list = []

We need to invoke the getStatistics method for each ad separately, let’s refer to the ad_campaign_dict and iterate our requests. Retrieve all-time data by calling the ‘period’ method with the  ‘overall’ value. Some ads may not have impression or clicks if they haven’t been launched yet, this may cause a  KeyError. Let’s recall to the try — except approach to handle this error.

for ad_id in ad_campaign_dict:
        r = requests.get('https://api.vk.com/method/ads.getStatistics', params={
            'access_token': token,
            'v': version,
            'account_id': id_rk,
            'ids_type': 'ad',
            'ids': ad_id,
            'period': 'overall',
            'date_from': '0',
            'date_to': '0'
        })
        try:
            data_stats = r.json()['response']
            for i in range(len(data_stats)):
                for j in range(len(data_stats[i]['stats'])):
                    ads_impressions_list.append(data_stats[i]['stats'][j]['impressions'])
                    ads_clicks_list.append(data_stats[i]['stats'][j]['clicks'])
                    ads_spent_list.append(data_stats[i]['stats'][j]['spent'])
                    ads_day_start_list.append(data_stats[i]['stats'][j]['day_from'])
                    ads_day_end_list.append(data_stats[i]['stats'][j]['day_to'])
                    ads_id_list.append(data_stats[i]['id'])
                    ads_campaign_list.append(ad_campaign_dict[ad_id])
        except KeyError:
            continue

Now, create a DataFrame and print out the first 5 data points

df = DataFrame()
df['campaign_id'] = ads_campaign_list
df['ad_id'] = ads_id_list
df['impressions'] = ads_impressions_list
df['clicks'] = ads_clicks_list
df['spent'] = ads_spent_list
df['day_start'] = ads_day_start_list
df['day_end'] = ads_day_end_list
print(df.head())

Exporting Data to Google Docs
We’ll need a Google API access token, navigate to https://console.developers.google.com and create one. Choose any name you like, then go to your Dashboard and click “Enable APIs and Services”. Choose Google Drive API from the list, enable it and do exactly the same for Google Sheets API.

After activation you will be redirected to the API control panel. Click Credentials – Create Credentials, click choose data type and create an account. Choosing a role is optional, just proceed and specify JSON as a key type.

After these steps you can download a JSON file with your credentials, we’ll rename it to «credentials.json». On the main page you’ll find the email field – copy your email address.

Go to https://docs.google.com/spreadsheets and create a new file named data, we’ll pass data from our DataFrame to it. Put the  credentials.json file in one directory with the script and continue coding. Add these links to the scope list:

scope = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive']

We will use the  ServiceAccountCredentials.from_json_keyfile_name and  gspread.authorize methods available in the  oauth2client and  gspread libraries for authenticaion process. Specify your file name and the scope variable in the ServiceAccountCredentials.from_json_keyfile_name method. The  sheet variable will allow us to send requests to our file in Google Docs.

creds = ServiceAccountCredentials.from_json_keyfile_name('credentials.json', scope)
client = gspread.authorize(creds)
sheet = client.open('data').sheet1

Apply the update_cell method to enter new value in a table cell. It’s worth mentioning that the indexing starts at 0, not 1. With the first loop we’ll move the column names of our DataFrame. And with the following loops we’ll move the rest of our data points. The default limits allow us to make 100 loops for 100 seconds. These restrictions may cause errors and stop our script, that’s why we need to use time.sleep and make the script sleep for 1 second after each loop.

count_of_rows = len(df)
count_of_columns = len(df.columns)
for i in range(count_of_columns):
    sheet.update_cell(1, i + 1, list(df.columns)[i])
for i in range(1, count_of_rows + 1):
    for j in range(count_of_columns):
        sheet.update_cell(i + 1, j + 1, str(df.iloc[i, j]))
        time.sleep(1)

In case of success, you’ll get the same table:

Exporting data to Redash

See how you can connect Google Analytics to Redash in this article «How to connect Google Analytics to Redash?».

Having a table with Google Analytics and ad campaigns from Vkontakte exported we can compare them by writing the following query:

SELECT
    query_50.day_start,
    CASE WHEN ga_source LIKE '%vk%' THEN 'vk.com' END AS source,
    query_50.spent,
    query_50.impressions,
    query_50.clicks,
    SUM(query_49.ga_sessions) AS sessions,
    SUM(query_49.ga_newUsers) AS users
FROM query_49
JOIN query_50
ON query_49.ga_date = query_50.day_start
WHERE query_49.ga_source LIKE '%vk%' AND DATE(query_49.ga_date) BETWEEN '2020-05-16' AND '2020-05-20'
GROUP BY query_49.ga_date, source

ga_source — the traffic source, from which a user was redirected. Use the  CASE method to combine everything that contains “vk” in one column called «vk.com». With the help of JOIN operator we can add the table with the data on ad campaigns, merging by date. Let’s take the day of the last ad campaign and a couple of days after, this will result in the following output:

Takeaways
Now we have a table that reflects how much were spent in ad costs on a certain day, the number of users who viewed this ad, were engaged and redirected to our website, and then completed the sign-up process.

Earlier Ctrl + ↓