Python and lyrics of Zemfira’s new album: capturing the spirit of her songs

Estimated read time – 15 min

header.jpg

Zemfira’s latest studio album, Borderline, was released in February, 8 years after the previous one. For this album, various people cooperated with her, including her relatives – the riff for the song “Таблетки” was written by her nephew from London. The album turned out to be diverse: for instance, the song “Остин” is dedicated to the main character of the Homescapes game by the Russian studio Playrix (by the way, check out the latest Business Secrets with the Bukhman brothers, they also mention it there). Zemfira liked the game a lot, thus, she contacted Playrix to create this song. Also, the song “Крым” was written as a soundtrack to a new film by Zemfira’s colleague Renata Litvinova.

Listen new album in Apple Music / Яндекс.Музыке / Spotify

Nevertheless, the spirit of the whole album is rather gloomy – the songs often repeat the words ‘боль’, ‘ад’, ‘бесишь’ and other synonyms. We decided to conduct an exploratory analysis of her album, and then, using the Word2Vec model and a cosine measure, look at the semantic closeness of the songs and calculate the general mood of the album.

For those who are bored with reading about data preparation and analysis steps, you can go directly to the results.

Data preparation

For starters, we write a data processing script. The purpose of the script is to collect a united csv-table from a set of text files, each of which contains a song. At the same time, we have to get rid of all punctuation marks and unnecessary words as we need to focus only on significant content.

import pandas as pd
import re
import string
import pymorphy2
from nltk.corpus import stopwords

Then we create a morphological analyzer and expand the list of everything that needs to be discarded:

morph = pymorphy2.MorphAnalyzer()
stopwords_list = stopwords.words('russian')
stopwords_list.extend(['куплет', 'это', 'я', 'мы', 'ты', 'припев', 'аутро', 'предприпев', 'lyrics', '1', '2', '3', 'то'])
string.punctuation += '—'

The names of the songs are given in English, so we have to create a dictionary for translation into Russian and a dictionary, from which we will later make a table:

result_dict = dict()

songs_dict = {
    'snow':'снег идёт',
    'crimea':'крым',
    'mother':'мама',
    'ostin':'остин',
    'abuse':'абьюз',
    'wait_for_me':'жди меня',
    'tom':'том',
    'come_on':'камон',
    'coat':'пальто',
    'this_summer':'этим летом',
    'ok':'ок',
    'pills':'таблетки'
}

Let’s define several necessary functions. The first one reads the entire song from the file and removes line breaks, the second clears the text from unnecessary characters and words, and the third one converts the words to normal form, using the pymorphy2 morphological analyzer. The pymorphy2 module does not always handle ambiguity well – additional processing is required for the words ‘ад’ and ‘рай’.

def read_song(filename):
    f = open(f'{filename}.txt', 'r').read()
    f = f.replace('\n', ' ')
    return f

def clean_string(text):
    text = re.split(' |:|\.|\(|\)|,|"|;|/|\n|\t|-|\?|\[|\]|!', text)
    text = ' '.join([word for word in text if word not in string.punctuation])
    text = text.lower()
    text = ' '.join([word for word in text.split() if word not in stopwords_list])
    return text

def string_to_normal_form(string):
    string_lst = string.split()
    for i in range(len(string_lst)):
        string_lst[i] = morph.parse(string_lst[i])[0].normal_form
        if (string_lst[i] == 'аду'):
            string_lst[i] = 'ад'
        if (string_lst[i] == 'рая'):
            string_lst[i] = 'рай'
    string = ' '.join(string_lst)
    return string

After all this preparation, we can get back to the data and process each song and read the file with the corresponding name:

name_list = []
text_list = []
for song, name in songs_dict.items():
    text = string_to_normal_form(clean_string(read_song(song)))
    name_list.append(name)
    text_list.append(text)

Then we combine everything into a DataFrame and save it as a csv-file.

df = pd.DataFrame()
df['name'] = name_list
df['text'] = text_list
df['time'] = [290, 220, 187, 270, 330, 196, 207, 188, 269, 189, 245, 244]
df.to_csv('borderline.csv', index=False)

Result:
2-table.png

Word cloud for the whole album

To begin with the analysis, we have to construct a word cloud, because it can display the most common words found in these songs. We import the required libraries, read the csv-file and set the configurations:

import nltk
from wordcloud import WordCloud
import pandas as pd
import matplotlib.pyplot as plt
from nltk import word_tokenize, ngrams

%matplotlib inline
nltk.download('punkt')
df = pd.read_csv('borderline.csv')

Now we create a new figure, set the design parameters and, using the word cloud library, display words with the size directly proportional to the frequency of the word. We additionally indicate the name of the song above the corresponding graph.

fig = plt.figure()
fig.patch.set_facecolor('white')
plt.subplots_adjust(wspace=0.3, hspace=0.2)
i = 1
for name, text in zip(df.name, df.text):
    tokens = word_tokenize(text)
    text_raw = " ".join(tokens)
    wordcloud = WordCloud(colormap='PuBu', background_color='white', contour_width=10).generate(text_raw)
    plt.subplot(4, 3, i, label=name,frame_on=True)
    plt.tick_params(labelsize=10)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.title(name,fontdict={'fontsize':7,'color':'grey'},y=0.93)
    plt.tick_params(labelsize=10)
    i += 1

3-wordcloud.jpg

EDA of the lyrics

Let us move to the next part and analyze the lyrics. To do this, we have to import special libraries to deal with data and visualization:

import plotly.graph_objects as go
import plotly.figure_factory as ff
from scipy import spatial
import collections
import pymorphy2
import gensim

morph = pymorphy2.MorphAnalyzer()

Firstly, we should count the overall number of words in each song, the number of unique words, and their percentage:

songs = []
total = []
uniq = []
percent = []

for song, text in zip(df.name, df.text):
    songs.append(song)
    total.append(len(text.split()))
    uniq.append(len(set(text.split())))
    percent.append(round(len(set(text.split())) / len(text.split()), 2) * 100)

All this information should be written in a DataFrame and additionally we want to count the number of words per minute for each song:

df_words = pd.DataFrame()
df_words['song'] = songs
df_words['total words'] = total
df_words['uniq words'] = uniq
df_words['percent'] = percent
df_words['time'] = df['time']
df_words['words per minute'] = round(total / (df['time'] // 60))
df_words = df_words[::-1]

4-table.png

It would be great to visualize the data, so let us build two bar charts: one for the number of words in the song, and the other one for the number of words per minute.

colors_1 = ['rgba(101,181,205,255)'] * 12
colors_2 = ['rgba(62,142,231,255)'] * 12

fig = go.Figure(data=[
    go.Bar(name='📝 Total number of words,
           text=df_words['total words'],
           textposition='auto',
           x=df_words.song,
           y=df_words['total words'],
           marker_color=colors_1,
           marker=dict(line=dict(width=0)),),
    go.Bar(name='🌀 Unique words',
           text=df_words['uniq words'].astype(str) + '<br>'+ df_words.percent.astype(int).astype(str) + '%' ,
           textposition='inside',
           x=df_words.song,
           y=df_words['uniq words'],
           textfont_color='white',
           marker_color=colors_2,
           marker=dict(line=dict(width=0)),),
])

fig.update_layout(barmode='group')

fig.update_layout(
    title = 
        {'text':'<b>The ratio of the number of unique words to the total</b><br><span style="color:#666666"></span>'},
    showlegend = True,
    height=650,
    font={
        'family':'Open Sans, light',
        'color':'black',
        'size':14
    },
    plot_bgcolor='rgba(0,0,0,0)',
)
fig.update_layout(legend=dict(
    yanchor="top",
    xanchor="right",
))

fig.show()
colors_1 = ['rgba(101,181,205,255)'] * 12
colors_2 = ['rgba(238,85,59,255)'] * 12

fig = go.Figure(data=[
    go.Bar(name='⏱️ Track length, min.',
           text=round(df_words['time'] / 60, 1),
           textposition='auto',
           x=df_words.song,
           y=-df_words['time'] // 60,
           marker_color=colors_1,
           marker=dict(line=dict(width=0)),
          ),
    go.Bar(name='🔄 Words per minute',
           text=df_words['words per minute'],
           textposition='auto',
           x=df_words.song,
           y=df_words['words per minute'],
           marker_color=colors_2,
           textfont_color='white',
           marker=dict(line=dict(width=0)),
          ),
])

fig.update_layout(barmode='overlay')

fig.update_layout(
    title = 
        {'text':'<b>Track length and words per minute</b><br><span style="color:#666666"></span>'},
    showlegend = True,
    height=650,
    font={
        'family':'Open Sans, light',
        'color':'black',
        'size':14
    },
    plot_bgcolor='rgba(0,0,0,0)'
)


fig.show()

Working with Word2Vec model

Using the gensim module, load the model pointing to a binary file:

model = gensim.models.KeyedVectors.load_word2vec_format('model.bin', binary=True)

Для материала мы использовали готовую обученную на Национальном Корпусе Русского Языка модель от сообщества RusVectōrēs

The Word2Vec model is based on neural networks and allows you to represent words in the form of vectors, taking into account the semantic component. It means that if we take two words – for instance, “mom” and “dad”, then represent them as two vectors and calculate the cosine, the values ​​will be close to 1. Similarly, two words that have nothing in common in their meaning have a cosine measure close to 0.

Now we will define the get_vector function: it will take a list of words, recognize a part of speech for each word, and then receive and summarize vectors, so that we can find vectors even for whole sentences and texts.

def get_vector(word_list):
    vector = 0
    for word in word_list:
        pos = morph.parse(word)[0].tag.POS
        if pos == 'INFN':
            pos = 'VERB'
        if pos in ['ADJF', 'PRCL', 'ADVB', 'NPRO']:
            pos = 'NOUN'
        if word and pos:
            try:
                word_pos = word + '_' + pos
                this_vector = model.word_vec(word_pos)
                vector += this_vector
            except KeyError:
                continue
    return vector

For each song, find a vector and select the corresponding column in the DataFrame:

vec_list = []
for word in df['text']:
    vec_list.append(get_vector(word.split()))
df['vector'] = vec_list

So, now we should compare these vectors with one another, calculating their cosine proximity. Those songs with a cosine metric higher than 0.5 will be saved separately – this way we will get the closest pairs of songs. We will write the information about the comparison of vectors into the two-dimensional list result.

similar = dict()
result = []
for song_1, vector_1 in zip(df.name, df.vector):
    sub_list = []
    for song_2, vector_2 in zip(df.name.iloc[::-1], df.vector.iloc[::-1]):
        res = 1 - spatial.distance.cosine(vector_1, vector_2)
        if res > 0.5 and song_1 != song_2 and (song_1 + ' / ' + song_2 not in similar.keys() and song_2 + ' / ' + song_1 not in similar.keys()):
            similar[song_1 + ' / ' + song_2] = round(res, 2)
        sub_list.append(round(res, 2))
    result.append(sub_list)

Moreover, we can construct the same bar chart:

df_top_sim = pd.DataFrame()
df_top_sim['name'] = list(similar.keys())
df_top_sim['value'] = list(similar.values())
df_top_sim.sort_values(by='value', ascending=False)

И построим такой же bar chart:

colors = ['rgba(101,181,205,255)'] * 5

fig = go.Figure([go.Bar(x=df_top_sim['name'],
                        y=df_top_sim['value'],
                        marker_color=colors,
                        width=[0.4,0.4,0.4,0.4,0.4],
                        text=df_top_sim['value'],
                        textfont_color='white',
                        textposition='auto')])

fig.update_layout(
    title = 
        {'text':'<b>Топ-5 closest songs</b><br><span style="color:#666666"></span>'},
    showlegend = False,
    height=650,
    font={
        'family':'Open Sans, light',
        'color':'black',
        'size':14
    },
    plot_bgcolor='rgba(0,0,0,0)',
    xaxis={'categoryorder':'total descending'}
)

fig.show()

Given the vector of each song, let us calculate the vector of the entire album – add the vectors of the songs. Then, for such a vector, using the model, we get the words that are the closest in spirit and meaning.

def get_word_from_tlist(lst):
    for word in lst:
        word = word[0].split('_')[0]
        print(word, end=' ')

vec_sum = 0
for vec in df.vector:
    vec_sum += vec
sim_word = model.similar_by_vector(vec_sum)
get_word_from_tlist(sim_word)

небо тоска тьма пламень плакать горе печаль сердце солнце мрак

This is probably the key result and the description of Zemfira’s album in just 10 words.

Finally, we build a general heat map, each cell of which is the result of comparing the texts of two tracks with a cosine measure.

colorscale=[[0.0, "rgba(255,255,255,255)"],
            [0.1, "rgba(229,232,237,255)"],
            [0.2, "rgba(216,222,232,255)"],
            [0.3, "rgba(205,214,228,255)"],
            [0.4, "rgba(182,195,218,255)"],
            [0.5, "rgba(159,178,209,255)"],
            [0.6, "rgba(137,161,200,255)"],
            [0.7, "rgba(107,137,188,255)"],
            [0.8, "rgba(96,129,184,255)"],
            [1.0, "rgba(76,114,176,255)"]]

font_colors = ['black']
x = list(df.name.iloc[::-1])
y = list(df.name)
fig = ff.create_annotated_heatmap(result, x=x, y=y, colorscale=colorscale, font_colors=font_colors)
fig.show()

Results and data interpretation

To give valuable conclusions, we would like to take another look at everything we got. First of all, let us focus on the word cloud. It is easy to see that the words ‘боль’, ‘невозможно’, ‘сорваться’, ‘растерзаны’, ‘сложно’, ‘терпеть’, ‘любить’ have a very decent size, because such words are often found throughout the entire lyrics:

Давайте ещё раз посмотрим на всё, что у нас получилось — начнём с облака слов. Нетрудно заметить, что у слов «боль», «невозможно», «сорваться», «растерзаны», «сложно», «терпеть», «любить» размер весьма приличный — всё потому, что такие слова встречаются часто на протяжении всего текста песен:

9-wordcloud.jpg

The song “Крым” turned out to be one of the most diverse songs – it contains 74% of unique words. Also, the song “Снег идет” contains very few words, so the majority, which is 82%, are unique. The largest song on the album in terms of amount of words is the track “Таблетки” – there are about 150 words in total.

As it was shown on the last chart, the most dynamic track is “Таблетки”, as much as 37 words per minute – nearly one word for every two seconds – and the longest track is “Абьюз”, and according to the previous chart, it also has the lowest percentage of unique words – 46%.

Top 5 most semantically similar text pairs:

We also got the vector of the entire album and found the closest words. Just take a look at them – ‘тьма’, ‘тоска’, ‘плакать’, ‘горе’, ‘печаль’, ‘сердце’ – this is the list of words that characterizes Zemfira’s lyrics!

небо тоска тьма пламень плакать горе печаль сердце солнце мрак

The final result is a heat map. From the visualization, it is noticeable that almost all songs are quite similar to each other – the cosine measure for many pairs exceeds the value of 0.4.

Conclusions

In the material, we carried out an EDA of the entire text of the new album and, using the pre-trained Word2Vec model, we proved the hypothesis – most of the “Borderline” songs are permeated with rather dark lyrics. However, this is normal, because we love Zemfira precisely for her sincerity and straightforwardness.

 No comments    44   18 d   analysis   Analytics engineering   data analytics   plotly   python

Building frequency counts and bigrams using the posts of traders

Estimated read time – 8 min

Stocktwits is the largest social network for investors and traders of all levels which allows us to see what is happening in the financial markets. Today we will build a frequency dictionary and bigrams of the users’ posts and divide them by the number of followers. This will allow us to see the difference between the posts of different types of traders.

This is how the feed on the CCIV security looks at Stocktwits:
--2021-04-27-155729.png

Some users have the status of officials:
--2021-04-27-160235.png

Scraping the posts

Stocktwits has an API that allows getting 30 posts at a time. The API request returns a JSON file, so we will write a get_30_messages function that reads the JSON file and writes all the entries into the list called rows. The information about posts already contains the information about users, so we will not create separate tables and will save everything in one DataFrame. For this purpose, we will create a list with the names of columns and initiate an empty list called rows where we will append all the scraped posts.

Some posts don’t have a “likes” key in the JSON file which results in KeyError. To avoid the error, we will assign 0 to the “likes” in such posts.

cols = ['post_id', 'text', 'created_at', 'user_id', 'likes', 'sentiment', 'identity','followers', 'following', 'ideas', 'watchlist_stocks_count', 'like_count', 'plus_tier']
rows = []
 
def get_30_messages(data):
    for p in data['messages']:
        try:
            likes = p['likes']['total']
        except KeyError:
            likes = 0
        rows.append({'id': p['id'], 
                    'text': p['body'], 
                    'created_at': p['created_at'], 
                    'user_id': p['user']['id'], 
                    'likes': likes,
                    'sentiment': p['entities']['sentiment'], 
                    'symbol': symbol,
                    'identity': p['user']['identity'],
                    'followers': p['user']['followers'], 
                    'following': p['user']['following'], 
                    'ideas': p['user']['ideas'], 
                    'watchlist_stocks_count': p['user']['watchlist_stocks_count'], 
                    'like_count': p['user']['like_count'], 
                    'plus_tier': p['user']['like_count']
                    })

We will scrap the posts from the pages of 16 most trending securities.

symbols = ['DIA', 'SPY', 'QQQ', 'INO', 'OCGN', 'BTC.X', 'SNAP', 'INTC', 'VXX', 'ASTS', 'SKLZ', 'RIOT', 'DJIA', 'GOLD', 'GGII', 'COIN']

As the API request returns only 30 most recent posts, to get older posts, we need to save the id of the last post into a dictionary and insert it as the max parameter during the next request. Unfortunately, the API allows us to make only 200 requests per hour, so in order to stay within the limits, we will run the for loop for each security only 11 times.

last_id_values = dict()
        
for symbol in symbols:
    file = requests.get(f"https://api.stocktwits.com/api/2/streams/symbol/{symbol}.json")
    data = json.loads(file.content)
    
    for i in range(10):
        get_30_messages(data)
            
        last_id = data['cursor']['max']
        last_id_values[symbol] = last_id
        
        file = requests.get(f"https://api.stocktwits.com/api/2/streams/symbol/{symbol}.json?max={last_id}")
        data = json.loads(file.content)
    
    get_30_messages(data)

Thus, we have collected only about 6000 posts, which is not enough for the analysis. That’s why, we will create a timer to run the same code after 1 hour and 5 minutes for 11 cycles.

def get_older_posts():
    for symbol in symbols:
        for i in range(12):
            file = requests.get(f"https://api.stocktwits.com/api/2/streams/symbol/{symbol}.json?max={last_id_values[symbol]}")
            data = json.loads(file.content)        
            get_30_messages(data)
 
            last_id = data['cursor']['max']
            last_id_values[symbol] = last_id
 
for i in range(11):
    time.sleep(3900)
    get_older_posts()

After all the data is collected, let’s create a DataFrame.

df = pd.DataFrame(rows, columns = cols)

The resulting table will look like this:
--2021-04-27-183708.png

It is important to check that the post_id doesn’t have duplicate values. By looking at the number of unique values and the number of total values in posts_id we can notice that we have about 10000 duplicate values.

df.posts_id.nunique(), len(df.posts_id)

This happened because some posts get posted on multiple pages. So the last step will be dropping the duplicate values.

df.drop_duplicates(subset="posts_id", inplace=True)

Frequency counts and bigrams

First of all, let’s create a frequency count for posts without dividing them into groups.

df.text.str.split(expand=True).stack().value_counts()

We can see that articles, conjunctions, and prepositions prevail over the other words:
--2021-04-27-174112.png

Thus, we need to remove them from the dataset. However, even if the dataset is cleaned, the results will look like this. Apart from the fact that 39 is the most frequent word, the data is not very informative and it’s difficult to make any conclusions based on it.
--2021-04-27-174622.png

In this case, we will need to build bigrams. One bigram is a sequence of two elements, that is two words standing next to each other. There are many algorithms for building n-grams with different optimization levels. We will use a built-in function in nltk to create a bigram for one group. First, let’s import the additional libraries, download stop words for the English language, and clean the data. Then we will add more stop words including the names of the stock tickers that are used in every post.

import nltk
from nltk.corpus import stopwords
from string import punctuation
import unicodedata
import collections
import nltk
from nltk.stem import WordNetLemmatizer
 
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

english_stopwords = stopwords.words("english")
symbols = ['DIA', 'SPY', 'QQQ', 'INO', 'OCGN', 'BTC.X', 'SNAP', 'INTC', 'VXX', 'ASTS', 'SKLZ', 'RIOT', 'DJIA', 'GOLD', 'GGII', 'COIN']
symbols_lower = [sym.lower() for sym in symbols]
append_stopword = ['https', 'www', 'btc', 'x', 's', 't', 'p', 'amp', 'utm', 'm', 'gon', 'na', '’', '2021', '04', 'stocktwits', 'com', 'spx', 'ndx', 'gld', 'slv', 'es', 'f', '...', '--', 'cqginc', 'cqgthom', 'gt']
english_stopwords.extend(symbols_lower)
english_stopwords.extend(append_stopword)

Let’s define a function to prepare the text that will translate all the words to lowercase, bring them to the base form and remove stop words and punctuation.

wordnet_lemmatizer = WordNetLemmatizer()
 
def preprocess_text(text):
    tokens = nltk.word_tokenize(text.lower())
    tokens = [wordnet_lemmatizer.lemmatize(token) for token in tokens if token not in english_stopwords\
              and token != " " \
              and token.strip() not in punctuation]
    
    text = " ".join(tokens)
    
    return text
    
    df.text = df.text.apply(process_text)

For example, let’s take the group of the least popular users with less than 300 followers, build bigrams and output the most frequent ones.

non_pop_df = df[(df['followers'] < 300)]
 
non_pop_counts = collections.Counter()
for sent in non_pop_df.text:
    words = nltk.word_tokenize(sent)
    non_pop_counts.update(nltk.bigrams(words))
non_pop_counts.most_common()

--2021-04-27-184020.png

Results of the bigrams study

Users with less than 300 followers mostly write about their personal plans on making money. This is shown by the collocations like short term, long term, and make money.
Less than 300 followers:
1. look like, 439
2. next week, 422
3. let 39, 364
4. capital gain, 306
5. long term, 274
6. let go, 261
7. stock market, 252
8. buy dip, 252
9. gain tax, 221
10. make money, 203
11. short term, 201
12. buy buy, 192

More popular users with 300 to 3000 followers discuss more abstract issues like sweep premium, stock price and artificial intelligence.
From 300 to 3000 followers:
1. sweep premium, 166
2. price target, 165
3. total day, 140
4. stock market, 139
5. ask premium, 132
6. stock price, 129
7. current stock, 117
8. money trade, 114
9. trade option, 114
10. activity alert, 113
11. trade volume, 113
12. artificial intelligence, 113

Popular users that have below 30000 followers discuss their observations as well as promote their accounts or articles.
From 3000 to 30000 followers:
1. unusual option, 632
2. print size, 613
3. option activity, 563
4. large print, 559
5. activity alerted, 355
6. observed unusual, 347
7. sweepcast observed, 343
8. |🎯 see, 311
9. see profile, 253
10. profile link, 241
11. call expiring, 235
12. new article, 226

Very popular traders with more than 30000 followers mostly act as information sources and post about changes at the stock market. This is indicated by the frequent up and down arrows and collocations like “stock x-day” or “moving average”.
Users with more than 30000 followers:
1. dow stock, 69
2. elliottwave trading, 53
3. ⇩ indexindicators.com, 51
4. ⇧ indexindicators.com, 50
5. u stock, 47
6. stock 5-day, 36
7. moving average, 29
8. stock moving, 28
9. stock x-day, 27
10. ⇧ 10-day, 26
11. stock daily, 25
12. daily rsi, 25

We have also built the bigrams of officials, but the results turned out to be very similar to the most popular users.

 No comments    51   5 mon  

How to build Animated Charts like Hans Rosling in Plotly

Estimated read time – 11 min

Hans Rosling’s work on world countries economic growth presented in 2007 at TEDTalks can be attributed to one of the most iconic data visualizations, ever created. Just check out this video, in case you don’t know what we’re talking about:

Sometimes we want to compare standards of living in other countries. One way to do this is to refer to the Big Mac index, which the Economist magazine has kept track of since 1986. The key idea this index represents is to measure purchasing power parity (PPP) in different countries, considering costs of domestic production. To make a standard burger, one would need the following ingredients: cheese, meat, bread and vegetables. Considering that all these ingredients can be produced locally, we can compare the production cost of one Big Mac in different countries, and measure purchasing power. Plus, McDonald’s is the world’s most popular franchise network, its restaurants are almost everywhere around the globe.

In today’s material, we will build a Motion Chart for the Big Mac index using Plotly. Following Hann Rosling’s idea, the chart will display country population along the X-axis and GDP per capita in US dollars along the Y. The size of the dots is going to be proportional to the Big Mac Index for a given country. And the color of the dots will represent the continent where the country is located.

Preparing Data

Even though The Economist has been updating it for over 30 years and sharing its observations publicly, the dataset contains many missing values. It also lacks continents names, but we can handle it by supplementing the data with some more datasets that can be found in our repo.

Let’s start by importing the libraries:

import pandas as pd
from pandas.errors import ParserError
import plotly.graph_objects as go
import numpy as np
import requests
import io

We can access the dataset directly from GitHub. Just use the following function to send a GET request to a CSV file and create a Pandas DataFrame. However, in some cases, this may raise a  ParseError because of the caption title, so we will add a try block:

def read_raw_file(link):
    raw_csv = requests.get(link).content
    try:
        df = pd.read_csv(io.StringIO(raw_csv.decode('utf-8')))
    except ParserError:
        df = pd.read_csv(io.StringIO(raw_csv.decode('utf-8')), skiprows=3)
    return df

bigmac_df = read_raw_file('https://github.com/valiotti/leftjoin/raw/master/motion-chart-big-mac/big-mac.csv')
population_df = read_raw_file('https://github.com/valiotti/leftjoin/raw/master/motion-chart-big-mac/population.csv')
dgp_df = read_raw_file('https://github.com/valiotti/leftjoin/raw/master/motion-chart-big-mac/gdp.csv')
continents_df = read_raw_file('https://github.com/valiotti/leftjoin/raw/master/motion-chart-big-mac/continents.csv')

From The Economist dataset we will need these columns: country name, local price, dollar exchange rate, country code (iso_a3) and record date. Take the timeline from 2005 to 2020, as the records are most complete for this span. And divide the local price by the exchange rate to calculate the price of one Big Mac in US dollars.

bigmac_df = bigmac_df[['name', 'local_price', 'dollar_ex', 'iso_a3', 'date']]
bigmac_df = bigmac_df[bigmac_df['date'] >= '2005-01-01']
bigmac_df = bigmac_df[bigmac_df['date'] < '2020-01-01']
bigmac_df['date'] = pd.DatetimeIndex(bigmac_df['date']).year
bigmac_df = bigmac_df.drop_duplicates(['date', 'name'])
bigmac_df = bigmac_df.reset_index(drop=True)
bigmac_df['dollar_price'] = bigmac_df['local_price'] / bigmac_df['dollar_ex']

Take a look at the result:

1_1.png

Next, let’s try adding a new column called continents. To ease the task, leave only two columns containing country code and continent name. Then we need to iterate through the bigmac_df[‘iso_a3’] column, adding a continent name for the corresponding values. However some cases may raise an error, because it’s not really clear, whether a country belongs to Europe or Asia, we will consider such cases as Europe by default.

continents_df = continents_df[['Continent_Name', 'Three_Letter_Country_Code']]
continents_list = []
for country in bigmac_df['iso_a3']:
    try:
        continents_list.append(continents_df.loc[continents_df['Three_Letter_Country_Code'] == country]['Continent_Name'].item())
    except ValueError:
        continents_list.append('Europe')
bigmac_df['continent'] = continents_list

Now we can drop unnecessary columns, apply sorting by country names and date, convert values in the date column into integers, and view the current result:

bigmac_df = bigmac_df.drop(['local_price', 'iso_a3', 'dollar_ex'], axis=1)
bigmac_df = bigmac_df.sort_values(by=['name', 'date'])
bigmac_df['date'] = bigmac_df['date'].astype(int)

2-20.png

Then we need to fill up missing values for The Big Mac index with zeros and remove the Republic of China, since this partially recognized state is not included in the World Bank datasets. The UAE occurs several times, this can lead to issues.

countries_list = list(bigmac_df['name'].unique())
years_set = {i for i in range(2005, 2020)}
for country in countries_list:
    if len(bigmac_df[bigmac_df['name'] == country]) < 15:
        this_continent = bigmac_df[bigmac_df['name'] == country].continent.iloc[0]
        years_of_country = set(bigmac_df[bigmac_df['name'] == country]['date'])
        diff = years_set - years_of_country
        dict_to_df = pd.DataFrame({
                      'name':[country] * len(diff),
                      'date':list(diff),
                      'dollar_price':[0] * len(diff),
                      'continent': [this_continent] * len(diff)
                     })
        bigmac_df = bigmac_df.append(dict_to_df)
bigmac_df = bigmac_df[bigmac_df['name'] != 'Taiwan']
bigmac_df = bigmac_df[bigmac_df['name'] != 'United Arab Emirates']

Next, let’s augment the data with GDP per capita and population from other datasets. Both datasets have differences in country names, so we need to specify such cases explicitly and replace them.

years = [str(i) for i in range(2005, 2020)]

countries_replace_dict = {
    'Russian Federation': 'Russia',
    'Egypt, Arab Rep.': 'Egypt',
    'Hong Kong SAR, China': 'Hong Kong',
    'United Kingdom': 'Britain',
    'Korea, Rep.': 'South Korea',
    'United Arab Emirates': 'UAE',
    'Venezuela, RB': 'Venezuela'
}
for key, value in countries_replace_dict.items():
    population_df['Country Name'] = population_df['Country Name'].replace(key, value)
    gdp_df['Country Name'] = gdp_df['Country Name'].replace(key, value)

Finally, extract population data and GDP for the given years, adding the data to the bigmac_df DataFrame:

countries_list = list(bigmac_df['name'].unique())

population_list = []
gdp_list = []
for country in countries_list:
    population_for_country_df = population_df[population_df['Country Name'] == country][years]
    population_list.extend(list(population_for_country_df.values[0]))
    gdp_for_country_df = gdp_df[gdp_df['Country Name'] == country][years]
    gdp_list.extend(list(gdp_for_country_df.values[0]))
    
bigmac_df['population'] = population_list
bigmac_df['gdp'] = gdp_list
bigmac_df['gdp_per_capita'] = bigmac_df['gdp'] / bigmac_df['population']

And here is our final dataset:

3-16.png

Creating a chart in Plotly

The population in China or India, on average, is 10 times more than in other countries. That’s why we need to transform X-axis to Log Scale, to make the chart easier for interpreting. The log-transformation is a common way to address skewness in data.

fig_dict = {
    "data": [],
    "layout": {},
    "frames": []
}

fig_dict["layout"]["xaxis"] = {"title": "Population", "type": "log"}
fig_dict["layout"]["yaxis"] = {"title": "GDP per capita (in $)", "range":[-10000, 120000]}
fig_dict["layout"]["hovermode"] = "closest"
fig_dict["layout"]["updatemenus"] = [
    {
        "buttons": [
            {
                "args": [None, {"frame": {"duration": 500, "redraw": False},
                                "fromcurrent": True, "transition": {"duration": 300,
                                                                    "easing": "quadratic-in-out"}}],
                "label": "Play",
                "method": "animate"
            },
            {
                "args": [[None], {"frame": {"duration": 0, "redraw": False},
                                  "mode": "immediate",
                                  "transition": {"duration": 0}}],
                "label": "Pause",
                "method": "animate"
            }
        ],
        "direction": "left",
        "pad": {"r": 10, "t": 87},
        "showactive": False,
        "type": "buttons",
        "x": 0.1,
        "xanchor": "right",
        "y": 0,
        "yanchor": "top"
    }
]

We will also add a slider to filter data within a certain range:

sliders_dict = {
    "active": 0,
    "yanchor": "top",
    "xanchor": "left",
    "currentvalue": {
        "font": {"size": 20},
        "prefix": "Year: ",
        "visible": True,
        "xanchor": "right"
    },
    "transition": {"duration": 300, "easing": "cubic-in-out"},
    "pad": {"b": 10, "t": 50},
    "len": 0.9,
    "x": 0.1,
    "y": 0,
    "steps": []
}

By default, the chart will display data for 2005 before we click on the “Play” button.

continents_list_from_df = list(bigmac_df['continent'].unique())
year = 2005
for continent in continents_list_from_df:
    dataset_by_year = bigmac_df[bigmac_df["date"] == year]
    dataset_by_year_and_cont = dataset_by_year[dataset_by_year["continent"] == continent]
    
    data_dict = {
        "x": dataset_by_year_and_cont["population"],
        "y": dataset_by_year_and_cont["gdp_per_capita"],
        "mode": "markers",
        "text": dataset_by_year_and_cont["name"],
        "marker": {
            "sizemode": "area",
            "sizeref": 200000,
            "size":  np.array(dataset_by_year_and_cont["dollar_price"]) * 20000000
        },
        "name": continent,
        "customdata": np.array(dataset_by_year_and_cont["dollar_price"]).round(1),
        "hovertemplate": '<b>%{text}</b>' + '<br>' +
                         'GDP per capita: %{y}' + '<br>' +
                         'Population: %{x}' + '<br>' +
                         'Big Mac price: %{customdata}$' +
                         '<extra></extra>'
    }
    fig_dict["data"].append(data_dict)

Next, we need to fill up the frames field, which will be used for animating the data. Each frame represents a certain data point from 2005 to 2019.

for year in years:
    frame = {"data": [], "name": str(year)}
    for continent in continents_list_from_df:
        dataset_by_year = bigmac_df[bigmac_df["date"] == int(year)]
        dataset_by_year_and_cont = dataset_by_year[dataset_by_year["continent"] == continent]

        data_dict = {
            "x": list(dataset_by_year_and_cont["population"]),
            "y": list(dataset_by_year_and_cont["gdp_per_capita"]),
            "mode": "markers",
            "text": list(dataset_by_year_and_cont["name"]),
            "marker": {
                "sizemode": "area",
                "sizeref": 200000,
                "size": np.array(dataset_by_year_and_cont["dollar_price"]) * 20000000
            },
            "name": continent,
            "customdata": np.array(dataset_by_year_and_cont["dollar_price"]).round(1),
            "hovertemplate": '<b>%{text}</b>' + '<br>' +
                             'GDP per capita: %{y}' + '<br>' +
                             'Population: %{x}' + '<br>' +
                             'Big Mac price: %{customdata}$' +
                             '<extra></extra>'
        }
        frame["data"].append(data_dict)

    fig_dict["frames"].append(frame)
    slider_step = {"args": [
        [year],
        {"frame": {"duration": 300, "redraw": False},
         "mode": "immediate",
         "transition": {"duration": 300}}
    ],
        "label": year,
        "method": "animate"}
    sliders_dict["steps"].append(slider_step)

Just a few finishing touches left, instantiate the chart, set colors, fonts and title.

fig_dict["layout"]["sliders"] = [sliders_dict]

fig = go.Figure(fig_dict)

fig.update_layout(
    title = 
        {'text':'<b>Motion chart</b><br><span style="color:#666666">The Big Mac index from 2005 to 2019</span>'},
    font={
        'family':'Open Sans, light',
        'color':'black',
        'size':14
    },
    plot_bgcolor='rgba(0,0,0,0)'
)
fig.update_yaxes(nticks=4)
fig.update_xaxes(tickfont=dict(family='Open Sans, light', color='black', size=12), nticks=4, gridcolor='lightgray', gridwidth=0.5)
fig.update_yaxes(tickfont=dict(family='Open Sans, light', color='black', size=12), nticks=4, gridcolor='lightgray', gridwidth=0.5)

fig.show()

Bingo! The Motion Chart is done:

View the code on GitHub

 No comments    316   11 mon   data analytics   Data engineering   plotly

Collecting Social Media Data for Top ML, AI & Data Science related accounts on Instagram

Estimated read time – 9 min

Instagram is in the top 5 most visited websites, perhaps not for our industry. Nevertheless, we are going to test this hypothesis using Python and our data analytics skills. In this post, we will share how to collect social media data using the Instagram API.

Data collection method
The Instagram API won’t let us collect data about other platform users for no reason, but there is always a way. Try sending the following request:

https://instagram.com/leftjoin/?__a=1

The request returns a JSON object with detailed user information, for instance, we can easily get an account name, number of posts, followers, subscriptions, as well as the first ten user posts with likes count, comments and etc. The pyInstagram library allows sending such requests.

SQL schema
Data will be collected into thee Clickhouse tables: users, posts, comments. The users table will contain user data, such as user id, username, user’s first and last name, account description, number of followers, subscriptions, posts, comments, and likes, whether an account is verified or not, and so on.

CREATE TABLE instagram.users
(
    `added_at` DateTime,
    `user_id` UInt64,
    `user_name` String,
    `full_name` String,
    `base_url` String,
    `biography` String,
    `followers_count` UInt64,
    `follows_count` UInt64,
    `media_count` UInt64,
    `total_comments` UInt64,
    `total_likes` UInt64,
    `is_verified` UInt8,
    `country_block` UInt8,
    `profile_pic_url` Nullable(String),
    `profile_pic_url_hd` Nullable(String),
    `fb_page` Nullable(String)
)
ENGINE = ReplacingMergeTree
ORDER BY added_at

The posts table will be populated with the post owner name, post id, caption, comments coun, and so on. To check whether a post is an advertisement, Instagram carousel, or a video we can use these fields: is_ad, is_album and is_video.

CREATE TABLE instagram.posts
(
    `added_at` DateTime,
    `owner` String,
    `post_id` UInt64,
    `caption` Nullable(String),
    `code` String,
    `comments_count` UInt64,
    `comments_disabled` UInt8,
    `created_at` DateTime,
    `display_url` String,
    `is_ad` UInt8,
    `is_album` UInt8,
    `is_video` UInt8,
    `likes_count` UInt64,
    `location` Nullable(String),
    `recources` Array(String),
    `video_url` Nullable(String)
)
ENGINE = ReplacingMergeTree
ORDER BY added_at

In the comments table, we store each comment separately with the comment owner and text.

CREATE TABLE instagram.comments
(
    `added_at` DateTime,
    `comment_id` UInt64,
    `post_id` UInt64,
    `comment_owner` String,
    `comment_text` String
)
ENGINE = ReplacingMergeTree
ORDER BY added_at

Writing the script
Import the following classes from the library: Account, Media, WebAgent and Comment.

from instagram import Account, Media, WebAgent, Comment
from datetime import datetime
from clickhouse_driver import Client
import requests
import pandas as pd

Next, create an instance of the WebAgent class required for some library methods and data updating. To collect any meaningful information we need to have at least account names. Since we don’t have them yet, send the following request to search for porifles by the  keywords specified in queries_list. The search results will be composed of Instagram pages that match any keyword in the list.

agent = WebAgent()
queries_list = ['machine learning', 'data science', 'data analytics', 'analytics', 'business intelligence',
                'data engineering', 'computer science', 'big data', 'artificial intelligence',
                'deep learning', 'data scientist','machine learning engineer', 'data engineer']
client = Client(host='12.34.56.789', user='default', password='', port='9000', database='instagram')
url = 'https://www.instagram.com/web/search/topsearch/?context=user&count=0'

Let’s iterate the keywords collecting all matching accounts. Then remove duplicates from the obtained list by converting it to set and back.

response_list = []
for query in queries_list:
    response = requests.get(url, params={
        'query': query
    }).json()
    response_list.extend(response['users'])
instagram_pages_list = []
for item in response_list:
    instagram_pages_list.append(item['user']['username'])
instagram_pages_list = list(set(instagram_pages_list))

Now we need to loop through the list of pages and request detailed information about an account if it’s not in the table yet. Create an instance of the Account class and pass username as a parameter.
Then update the account information using the agent.update()
method. We will collect only the first 100 posts to keep it moving. Next, create a list named media_list to store received post ids after calling the agent.get_media() method.


Collecting user media data

all_posts_list = []
username_count = 0
for username in instagram_pages_list:
    if client.execute(f"SELECT count(1) FROM users WHERE user_name='{username}'")[0][0] == 0:
        print('username:', username_count, '/', len(instagram_pages_list))
        username_count += 1
        account_total_likes = 0
        account_total_comments = 0
        try:
            account = Account(username)
        except Exception as E:
            print(E)
            continue
        try:
            agent.update(account)
        except Exception as E:
            print(E)
            continue
        if account.media_count < 100:
            post_count = account.media_count
        else:
            post_count = 100
        print(account, post_count)
        media_list, _ = agent.get_media(account, count=post_count, delay=1)
        count = 0

Because we need to count the total number of likes and comments before adding a new user to our database, we’ll start with them first. Almost all required fields belong to the Media class:


Collecting user posts

for media_code in media_list:
            if client.execute(f"SELECT count(1) FROM posts WHERE code='{media_code}'")[0][0] == 0:
                print('posts:', count, '/', len(media_list))
                count += 1

                post_insert_list = []
                post = Media(media_code)
                agent.update(post)
                post_insert_list.append(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
                post_insert_list.append(str(post.owner))
                post_insert_list.append(post.id)
                if post.caption is not None:
                    post_insert_list.append(post.caption.replace("'","").replace('"', ''))
                else:
                    post_insert_list.append("")
                post_insert_list.append(post.code)
                post_insert_list.append(post.comments_count)
                post_insert_list.append(int(post.comments_disabled))
                post_insert_list.append(datetime.fromtimestamp(post.date).strftime('%Y-%m-%d %H:%M:%S'))
                post_insert_list.append(post.display_url)
                try:
                    post_insert_list.append(int(post.is_ad))
                except TypeError:
                    post_insert_list.append('cast(Null as Nullable(UInt8))')
                post_insert_list.append(int(post.is_album))
                post_insert_list.append(int(post.is_video))
                post_insert_list.append(post.likes_count)
                if post.location is not None:
                    post_insert_list.append(post.location)
                else:
                    post_insert_list.append('')
                post_insert_list.append(post.resources)
                if post.video_url is not None:
                    post_insert_list.append(post.video_url)
                else:
                    post_insert_list.append('')
                account_total_likes += post.likes_count
                account_total_comments += post.comments_count
                try:
                    client.execute(f'''
                        INSERT INTO posts VALUES {tuple(post_insert_list)}
                    ''')
                except Exception as E:
                    print('posts:')
                    print(E)
                    print(post_insert_list)

Store comments in the variable with the same name after calling the get_comments() method:


Collecting post comments

comments = agent.get_comments(media=post)
                for comment_id in comments[0]:
                    comment_insert_list = []
                    comment = Comment(comment_id)
                    comment_insert_list.append(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
                    comment_insert_list.append(comment.id)
                    comment_insert_list.append(post.id)
                    comment_insert_list.append(str(comment.owner))
                    comment_insert_list.append(comment.text.replace("'","").replace('"', ''))
                    try:
                        client.execute(f'''
                            INSERT INTO comments VALUES {tuple(comment_insert_list)}
                        ''')
                    except Exception as E:
                        print('comments:')
                        print(E)
                        print(comment_insert_list)

And now, when we have obtained user posts and comments new information can be added to the table.


Collecting user data

user_insert_list = []
        user_insert_list.append(datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
        user_insert_list.append(account.id)
        user_insert_list.append(account.username)
        user_insert_list.append(account.full_name)
        user_insert_list.append(account.base_url)
        user_insert_list.append(account.biography)
        user_insert_list.append(account.followers_count)
        user_insert_list.append(account.follows_count)
        user_insert_list.append(account.media_count)
        user_insert_list.append(account_total_comments)
        user_insert_list.append(account_total_likes)
        user_insert_list.append(int(account.is_verified))
        user_insert_list.append(int(account.country_block))
        user_insert_list.append(account.profile_pic_url)
        user_insert_list.append(account.profile_pic_url_hd)
        if account.fb_page is not None:
            user_insert_list.append(account.fb_page)
        else:
            user_insert_list.append('')
        try:
            client.execute(f'''
                INSERT INTO users VALUES {tuple(user_insert_list)}
            ''')
        except Exception as E:
            print('users:')
            print(E)
            print(user_insert_list)

Conclusion
To sum up, we have collected data of 500 users, with nearly 20K posts and 40K comments. As the database will be updated, we can write a simple query to get the top 10 ML, AI & Data Science related most followed accounts for today.

SELECT *
FROM users
ORDER BY followers_count DESC
LIMIT 10

And as a bonus, here is a list of the most interesting Instagram accounts on this topic:

  1. @ai_machine_learning
  2. @neuralnine
  3. @datascienceinfo
  4. @compscistuff
  5. @computersciencelife
  6. @welcome.ai
  7. @papa_programmer
  8. @data_science_learn
  9. @neuralnet.ai
  10. @techno_thinkers

View the code on GitHub

 No comments    245   12 mon   Analytics engineering   clickhouse   data analytics   instagram   python

Future Data Conference Review

Estimated read time – 10 min

The Future Data Conference, which I happened to participate in, took place on September 8-9. And in today’s post, I’d like to share my observations about thoughts about presented ideas. Before we get started, I apologize for the poor quality of some images, I tried to make the most meaningful screens straight from the video.

Featured Keynote: Automating Analysis
Speaker: Pat Hanrahan
The report was presented by the Stanford Professor and Tableau Co-Founder and mostly touched the use of AI and analytics. Pat discussed where we are now, today’s AI use cases, although the report alone was kind of repetitive, the Q&A part turned out to be interesting.

The Modern Data Stack: Past, Present, and Future
Speaker: Tristan Handy
The main builder of dbt and author of the well-known post serving as a guide to data analytics for startup founders, spoke about changes in modern data-stack from 2012 to 2020. Personally, I think it was one of the best conference reports since Tristan made predictions about growing tendencies and the future of data-stack.

1-17.png
2-16.png
3-15.png
4-13.png

7-7.png
8-6.png
9-6.png
10-1.png
11-1.png
12.png
13.png
14.png
15.png

Making Enterprise Data Timelier and More Reliable with Lakehouse Technology
Speaker: Matei Zaharia
This report belongs to the CTO of DataBricks. Unfortunately, the audio part had sound issues, but Matei considered the problems of modern Data Lake, promoting a new technology of DataBricks – DeltaLake. The report was more promotional but still interesting to listen to.

16.png
17.png
18.png
19.png
20.png
21.png
22.png
23.png
24.png

How to Close the Analytic Divide
Speaker: Alan Jacobson
The Chief Data Officer of Alteryx went on about the Data Scientist job and wages statistics, citing that the average salary of a data scientist is significantly higher than others in this field. By the way, our recent research with Roman Bunin also confirms this. Alan discussed the revenue of companies at different stages of analytical growth. Companies with more advanced analytical approaches grow faster (surprising fact). A separate part was focused on changes in modern approaches to working with data. Overall, it’s a great report that was easy to listen to.

25.png
26.png
27.png
28.png
29.png
30.png
31.png

Hot Analytics — Handle with Care
Speaker: Gian Merlino
The Co-Founder and CTO of Impy compared hot & cold data (a clue to
Snowflake?). Then he demonstrated some BI tool with drag-n-drop in a simple interface. Gian went on talking about possible analytic architectures and overviewed some features of Apache Druid.

32.png
33.png
34.png
35.png
36.png
37.png
38.png
39.png

 No comments    49   1 y   conference   data analytics
Earlier Ctrl + ↓