Later Ctrl + ↑

Free math courses for analysts and data engineers

Estimated read time – 4 min

math-(2).png

Nowadays, the Internet offers a huge number of paid courses that promise to make you a data analyst. Some of them are really great and you get a valuable set of skills upon completion. However, most of them don’t focus on fundamental math and programming skills that are crucial to make it in the field.

Some people believe that an analyst doesn’t need SQL or Python. Others argue that an analyst can solve problems without deep knowledge in math using only hard skills. In my opinion, that’s a big delusion. Apart from hard skills, a good data analyst should have a strong background in math and computer science. If you think that’s a big deal, read on as I have a solution for you.

In my opinion, it’s difficult to reflect on the probability of the outflow without understanding the probability theory. It’s difficult to discuss the mean and the normal distribution without understanding statistics. It’s impossible to grasp SVD without knowledge in linear algebra or to find a gradient without understanding calculus. Some people may argue that an analyst doesn’t need this. Tools like Python / R / Matlab allow building models without taking care of the math. In the beginning, this might even work. You can use a ready-made algorithm, add a couple of commands, and voila, you have built a regression model. But what do you do next? How do you change specific parameters of the model without understanding the math behind it?

Nowadays, the Internet provides us with an incredible opportunity to get an ivy league level education free of charge. A beginner data analyst should benefit from this opportunity before buying online data analysis courses. Just recently, I have completed university-level math and programming courses and I want to share them with you. Although I took advanced math 15 years ago at the university, it was still worth revising (we tend to forget a great deal in 15 years). An additional benefit of such courses is the development of the highly desired analytical thinking skills.

Here is the list of the free online courses from the eminent US universities that I want to share with you. These courses will definitely help you to start your learning journey in data analytics.

Calculus (M.I.T.)

This is an amazing set of courses both by content and interpretation offered by MIT in three parts:

  1. Differentiation
  2. Integration
  3. Coordinate systems and infinite series

Linear Algebra (Georgia Tech)

A course in four parts from one of the leading world universities in Computer Science: Georgia Tech.

  1. Linear equations
  2. Matrix algebra
  3. Determinants and eigenvalues
  4. Orthogonality, symmetric matrices and SVD

Probability theory and mathematical statistics (Georgia Tech)

A course in four parts from one of the leading world universities in Computer Science: Georgia Tech. (в русскоязычной версии тоже эта строчка полностью совпадает с курсом выше)

  1. A gentle introduction to probability
  2. Random variables
  3. A gentle introduction to statistics
  4. Confidence intervals and hypothesis tests

Calculations in Python (Harvard)

A course in 7 parts from a Harvard professor

  1. R basics
  2. Visualization
  3. Probability theory
  4. Inference and modeling
  5. Productivity tools
  6. Wrangling
  7. Linear regression
  8. Machine learning
  9. Capstone

Building frequency counts and bigrams using the posts of traders

Estimated read time – 8 min

Stocktwits is the largest social network for investors and traders of all levels which allows us to see what is happening in the financial markets. Today we will build a frequency dictionary and bigrams of the users’ posts and divide them by the number of followers. This will allow us to see the difference between the posts of different types of traders.

This is how the feed on the CCIV security looks at Stocktwits:
--2021-04-27-155729.png

Some users have the status of officials:
--2021-04-27-160235.png

Scraping the posts

Stocktwits has an API that allows getting 30 posts at a time. The API request returns a JSON file, so we will write a get_30_messages function that reads the JSON file and writes all the entries into the list called rows. The information about posts already contains the information about users, so we will not create separate tables and will save everything in one DataFrame. For this purpose, we will create a list with the names of columns and initiate an empty list called rows where we will append all the scraped posts.

Some posts don’t have a “likes” key in the JSON file which results in KeyError. To avoid the error, we will assign 0 to the “likes” in such posts.

cols = ['post_id', 'text', 'created_at', 'user_id', 'likes', 'sentiment', 'identity','followers', 'following', 'ideas', 'watchlist_stocks_count', 'like_count', 'plus_tier']
rows = []
 
def get_30_messages(data):
    for p in data['messages']:
        try:
            likes = p['likes']['total']
        except KeyError:
            likes = 0
        rows.append({'id': p['id'], 
                    'text': p['body'], 
                    'created_at': p['created_at'], 
                    'user_id': p['user']['id'], 
                    'likes': likes,
                    'sentiment': p['entities']['sentiment'], 
                    'symbol': symbol,
                    'identity': p['user']['identity'],
                    'followers': p['user']['followers'], 
                    'following': p['user']['following'], 
                    'ideas': p['user']['ideas'], 
                    'watchlist_stocks_count': p['user']['watchlist_stocks_count'], 
                    'like_count': p['user']['like_count'], 
                    'plus_tier': p['user']['like_count']
                    })

We will scrap the posts from the pages of 16 most trending securities.

symbols = ['DIA', 'SPY', 'QQQ', 'INO', 'OCGN', 'BTC.X', 'SNAP', 'INTC', 'VXX', 'ASTS', 'SKLZ', 'RIOT', 'DJIA', 'GOLD', 'GGII', 'COIN']

As the API request returns only 30 most recent posts, to get older posts, we need to save the id of the last post into a dictionary and insert it as the max parameter during the next request. Unfortunately, the API allows us to make only 200 requests per hour, so in order to stay within the limits, we will run the for loop for each security only 11 times.

last_id_values = dict()
        
for symbol in symbols:
    file = requests.get(f"https://api.stocktwits.com/api/2/streams/symbol/{symbol}.json")
    data = json.loads(file.content)
    
    for i in range(10):
        get_30_messages(data)
            
        last_id = data['cursor']['max']
        last_id_values[symbol] = last_id
        
        file = requests.get(f"https://api.stocktwits.com/api/2/streams/symbol/{symbol}.json?max={last_id}")
        data = json.loads(file.content)
    
    get_30_messages(data)

Thus, we have collected only about 6000 posts, which is not enough for the analysis. That’s why, we will create a timer to run the same code after 1 hour and 5 minutes for 11 cycles.

def get_older_posts():
    for symbol in symbols:
        for i in range(12):
            file = requests.get(f"https://api.stocktwits.com/api/2/streams/symbol/{symbol}.json?max={last_id_values[symbol]}")
            data = json.loads(file.content)        
            get_30_messages(data)
 
            last_id = data['cursor']['max']
            last_id_values[symbol] = last_id
 
for i in range(11):
    time.sleep(3900)
    get_older_posts()

After all the data is collected, let’s create a DataFrame.

df = pd.DataFrame(rows, columns = cols)

The resulting table will look like this:
--2021-04-27-183708.png

It is important to check that the post_id doesn’t have duplicate values. By looking at the number of unique values and the number of total values in posts_id we can notice that we have about 10000 duplicate values.

df.posts_id.nunique(), len(df.posts_id)

This happened because some posts get posted on multiple pages. So the last step will be dropping the duplicate values.

df.drop_duplicates(subset="posts_id", inplace=True)

Frequency counts and bigrams

First of all, let’s create a frequency count for posts without dividing them into groups.

df.text.str.split(expand=True).stack().value_counts()

We can see that articles, conjunctions, and prepositions prevail over the other words:
--2021-04-27-174112.png

Thus, we need to remove them from the dataset. However, even if the dataset is cleaned, the results will look like this. Apart from the fact that 39 is the most frequent word, the data is not very informative and it’s difficult to make any conclusions based on it.
--2021-04-27-174622.png

In this case, we will need to build bigrams. One bigram is a sequence of two elements, that is two words standing next to each other. There are many algorithms for building n-grams with different optimization levels. We will use a built-in function in nltk to create a bigram for one group. First, let’s import the additional libraries, download stop words for the English language, and clean the data. Then we will add more stop words including the names of the stock tickers that are used in every post.

import nltk
from nltk.corpus import stopwords
from string import punctuation
import unicodedata
import collections
import nltk
from nltk.stem import WordNetLemmatizer
 
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

english_stopwords = stopwords.words("english")
symbols = ['DIA', 'SPY', 'QQQ', 'INO', 'OCGN', 'BTC.X', 'SNAP', 'INTC', 'VXX', 'ASTS', 'SKLZ', 'RIOT', 'DJIA', 'GOLD', 'GGII', 'COIN']
symbols_lower = [sym.lower() for sym in symbols]
append_stopword = ['https', 'www', 'btc', 'x', 's', 't', 'p', 'amp', 'utm', 'm', 'gon', 'na', '’', '2021', '04', 'stocktwits', 'com', 'spx', 'ndx', 'gld', 'slv', 'es', 'f', '...', '--', 'cqginc', 'cqgthom', 'gt']
english_stopwords.extend(symbols_lower)
english_stopwords.extend(append_stopword)

Let’s define a function to prepare the text that will translate all the words to lowercase, bring them to the base form and remove stop words and punctuation.

wordnet_lemmatizer = WordNetLemmatizer()
 
def preprocess_text(text):
    tokens = nltk.word_tokenize(text.lower())
    tokens = [wordnet_lemmatizer.lemmatize(token) for token in tokens if token not in english_stopwords\
              and token != " " \
              and token.strip() not in punctuation]
    
    text = " ".join(tokens)
    
    return text
    
    df.text = df.text.apply(process_text)

For example, let’s take the group of the least popular users with less than 300 followers, build bigrams and output the most frequent ones.

non_pop_df = df[(df['followers'] < 300)]
 
non_pop_counts = collections.Counter()
for sent in non_pop_df.text:
    words = nltk.word_tokenize(sent)
    non_pop_counts.update(nltk.bigrams(words))
non_pop_counts.most_common()

--2021-04-27-184020.png

Results of the bigrams study

Users with less than 300 followers mostly write about their personal plans on making money. This is shown by the collocations like short term, long term, and make money.
Less than 300 followers:
1. look like, 439
2. next week, 422
3. let 39, 364
4. capital gain, 306
5. long term, 274
6. let go, 261
7. stock market, 252
8. buy dip, 252
9. gain tax, 221
10. make money, 203
11. short term, 201
12. buy buy, 192

More popular users with 300 to 3000 followers discuss more abstract issues like sweep premium, stock price and artificial intelligence.
From 300 to 3000 followers:
1. sweep premium, 166
2. price target, 165
3. total day, 140
4. stock market, 139
5. ask premium, 132
6. stock price, 129
7. current stock, 117
8. money trade, 114
9. trade option, 114
10. activity alert, 113
11. trade volume, 113
12. artificial intelligence, 113

Popular users that have below 30000 followers discuss their observations as well as promote their accounts or articles.
From 3000 to 30000 followers:
1. unusual option, 632
2. print size, 613
3. option activity, 563
4. large print, 559
5. activity alerted, 355
6. observed unusual, 347
7. sweepcast observed, 343
8. |🎯 see, 311
9. see profile, 253
10. profile link, 241
11. call expiring, 235
12. new article, 226

Very popular traders with more than 30000 followers mostly act as information sources and post about changes at the stock market. This is indicated by the frequent up and down arrows and collocations like “stock x-day” or “moving average”.
Users with more than 30000 followers:
1. dow stock, 69
2. elliottwave trading, 53
3. ⇩ indexindicators.com, 51
4. ⇧ indexindicators.com, 50
5. u stock, 47
6. stock 5-day, 36
7. moving average, 29
8. stock moving, 28
9. stock x-day, 27
10. ⇧ 10-day, 26
11. stock daily, 25
12. daily rsi, 25

We have also built the bigrams of officials, but the results turned out to be very similar to the most popular users.

 No comments    32   1 mon  

Comparing Tableau and PowerBI training programs Not published

Estimated read time – 7 min

This year I succeeded in becoming a Tableau Desktop Certified Associate. When I was thinking about how to prepare for the exam, I came across e-learning courses from Tableau that turned out to be free for 90 days.

I decided not to waste such an opportunity and complete all the 3 modules in Fundamentals at a fast pace. When I got certified, I was wondering which programs are offered by other producers of BI tools. First things first, I decided to study training materials on PowerBI. In this small article, I would like to compare Tableau and PowerBI training programs.

Disclaimer: in the end, I have formed an unfairly prejudiced and positive attitude towards Tableau, so PowerBI supporters may not like this article and find it biased (in all fairness, there are also words of praise for PowerBI).

After having studied the training materials, I can finally state the reasons why I am definitely in favor of Tableau as a tool for data analysis and visualization.

First of all, there is a huge gap in the approach to materials and the assessment of their understanding. Although Tableau training materials are more technical and pay less attention to design, by studying through their videos you can do excellent visualization. After completing all three steps of Tableau training, a strong desire to create new stunning reports with the use of LOD Expressions, Filter Actions, and make convenient interfaces arises. However, after watching all the materials on Power BI the only question that remains is why did I waste my time?

Emotions aside, there are several key points that turned out to be important after having studied the material.

1-18-1.png
This is a good dashboard according to Microsoft

The quality of content and training examples

If you consider the way training videos are presented in Tableau and the questions in a quiz format that are posed at the end of the covered material, you start understanding the idea of the software. But in the case of Power BI, you will be totally disappointed. Have a look for instance at the material for identifying outliers, here Microsoft suggests building a scatter plot and visually identifying all the outliers.

Design of reports and dashboards

There is some objective criticism towards Tableau training materials on the topic of graph design and control elements, but they are still neatly and beautifully made. Now have a look at the dreadful thing that Microsoft suggests as the result of the analyst’s work. And this is a well-built dashboard according to Microsoft.

Assessment of the knowledge gained during the training

During the training at Tableau, immediately after a small lecture, you learn by applying the part of the studied material in practice. You need to click certain buttons in the interface to solve a problem. Power BI offers “labs” that are supposed to be launched from a remote machine. I didn’t manage to start a single lab; I wrote to the support 3 times and the support couldn’t solve my problem so I didn’t manage to experiment over the PowerBI tasks.

3-16-1.png
The results of the analyst’s work according to Microsoft.

Other points are mostly related to the software rather than the training program.

Cross-platform support

I have been working with Tableau for a long time and 4 years ago I switched to Mac. After the transition from Windows, my experience of using Tableau did not change. In fact, Tableau was developing and I was developing with it, but the team did not change the key elements of the interface. I have been experimenting with building reports in PowerBI, but I was uncomfortable with different Microsoft archaisms like publications through some share-portal where you need to have an MS account and configure something through the administrator. All of this was a terrible headache.

However, what struck me so much was that I could not use PowerBI on a Mac. There is absolutely no way and this is a principled stance of Microsoft which is not expected to change in the future. From my point of view, such software belongs to a B2B segment in the field of analytics, assumes the connection to all kinds of DBMS, but denies the existence of an alternative operating system which could be used by a number of potential consultants that could use and promote PowerBI as an analytical tool.

Most certainly, there are some rational reasons why any software from Microsoft doesn’t work very well on Mac, but the simple truth is that for me the software remains inaccessible. Nevertheless, I wasn’t looking for an easy way out and installed PowerBI through Parallels in order to honestly consider the tools again taking into account the training materials.

Visualization options

Both Tableau and PowerBI offer stunning visualization options. In fact, in this regard, PowerBI offers a video with a little more information than usual. So, on this matter, the tools are presented equally well.

Functionality

Here I want to give credits to the functionality of PowerBI. In fact, the variety of tools is extremely wide even without connecting third party libraries. For example, automatic clustering, decomposition tree, data profiler and setting filters on a graph.

Internal language syntax

To work with PowerBI you need to learn DAX. It is not a programming language, but a functional language. You won’t be able to write your own code, however, you won’t even need it – all the functions are already implemented, so you should only learn how to use them. Microsoft tells about DAX quite well in the manual. Definition of a new measure in DAX looks like this:

Revenue YoY % =
DIVIDE(
	[Revenue]
		- CALCULATE(
			[Revenue],
			SAMEPERIODLASTYEAR('Date'[Date])
	),
	CALCULATE(
		[Revenue],
		SAMEPERIODLASTYEAR('Date'[Date])
	)
)

Preparing data for the analysis

Inside PowerBI there is a Unpivot feature that allows bringing the data in columns with time periods into the form that is convenient to use in pivot tables.

02-original-data-ss.png
02-unpivot-ss.png

However, an ETL tool for data cleaning and wrangling in Tableau Prep also has this feature implemented.

Conclusions:

1) The training programs are built in completely different ways, the methodology of immersion into Tableau tools is more elaborate and efficient. There is an opportunity to get practical experience of solving problems and get feedback (albeit automatic).
2) Reports and dashboards design in training materials from Microsoft hardly look professional while Tableau’s implementation is much better.
3) Knowledge assessment at Microsoft is implemented at the abysmal level (absolutely perfunctory tests like in a bad school) while at Tableau it’s much better implemented, you dive into the problem, think about the answer and solve it.
4) Cross-platform support is not PowerBI’s strongest point, however in the case of Tableau it’s an excellent competitive advantage.
5) The functionality and capabilities of the tools are certainly at the highest level, and in some points, PowerBI wins.

Have a look at our dashboard reviews in Tableau and other BI tools.

 No comments    15   1 mon  

Target audience parsing in VK

Estimated read time – 5 min

When posting ads some platforms allow uploading the list of people who will see the ad in audience settings. There are special tools to parse ids from public pages but it’s much more interesting (and cheaper) to do it manually with Python and VK API. Today we will tell how we parsed the target audience for the LEFTJOIN promotional campaign and uploaded it to the advertising account.

Parsing of users

To send requests we will need a user token and the list of VK groups whose participants we want to get. We collected about 30 groups related to analytics, BI tools and Data Science.

import requests 
import time 

group_list = ['datacampus', '185023286', 'data_mining_in_action', '223456', '187222444', 'nta_ds_ai', 'business__intelligence', 'club1981711', 'datascience', 'ozonmasters', 'businessanalysts', 'datamining.team', 'club.shad', '174278716', 'sqlex', 'sql_helper', 'odssib', 'sapbi', 'sql_learn', 'hsespbcareer', 'smartdata', 'pomoshch_s_spss', 'dwhexpert', 'k0d_ds', 'sql_ex_ru', 'datascience_ai', 'data_club', 'mashinnoe_obuchenie_ai_big_data', 'womeninbigdata', 'introstats', 'smartdata', 'data_mining_in_action', 'dlschool_mipt'] 

token = 'your_token'

A request for getting the participants of VK groups will return a maximum of 1000 lines, to get the next 1000 ones we need to increment an offset parameter by 1. But we need to know when to stop incrementing so we will write a function that accepts an id of the group, receives the information about the number of group’s participants and returns the maximum number for the offset – the ratio of the total number of participants to 1000 as we can only get 1000 persons at a time.

def get_offset(group_id): 
    count = requests.get('https://api.vk.com/method/groups.getMembers', 
    params={ 
           'access_token':token, 
           'v':5.103, 
           'group_id': group_id, 
           'sort':'id_desc', 
           'offset':0, 
           'fields':'last_seen' 
    }).json()['response']['count'] 
    return count // 1000

In the next step, we will write a function that accepts the group’s ID, collects all the subscribers into a list and returns it. To do this we will send requests for receiving 1000 people till the offset is over, enter the data into the list and return it. When parsing each person, we will additionally check their last visit date and if they have not logged in since the middle of November, we won’t add them. The time is indicated in unixtime format.

def get_users(group_id): 
    good_id_list = [] 
    offset = 0 
    max_offset = get_offset(group_id) 
    while offset < max_offset: 
        response = requests.get('https://api.vk.com/method/groups.getMembers', 
        params={
        'access_token':token, 
        'v':5.103, 
        'group_id': group_id, 
        'sort':'id_desc', 
        'offset':offset, 
        'fields':'last_seen' }).json()['response'] 
        offset += 1 
        for item in response['items']: 
            try: 
                if item['last_seen']['time'] >= 1605571200:
                    good_id_list.append(item['id']) 
            except Exception as E: 
                continue 
    return good_id_list

Now we will parse all groups from the list, collect the participants, and add them into the all_users list. In the end, we will transfer the list into a set and then back into a list to get rid of the duplicates as the same people might have been members of different groups. After parsing each group, we will pause the program for a second to prevent reaching the requests limit.

all_users = [] 

for group in group_list: 
    print(group) 
    try: 
        users = get_users(group) 
        all_users.extend(users) 
        time.sleep(1) 
    except KeyError as E: 
        print(group, E) 
        continue 

all_users = list(set(all_users))

The last step will be writing each user to a file from a new line.

with open('users.txt', 'w') as f: 
    for item in all_users: 
        f.write("%s\n" % item)

Audience in the advertising account from a file

Let’s open our VK advertising account and choose a “Retargeting” tab. Here we will find the “Create audience” button:

Picture1.png

After clicking it, a new window will pop up where we will be able to choose a file as a source and indicate the name of the audience.

Picture2.png

The audience will be available some seconds after loading. First 10 minutes it will be indicated that the audience is too small, this is not true, and the panel will refresh soon if your audience really contains more than 100 people.

Picture3.png

Results

Let’s compare the average cost of the attracted participant in our group when using the ad with automatic audience targeting and the ad with the audience that we have scraped. In the first case, the average cost is 52.4 rubles, in the second case 33.2 rubles. The selection of a quality audience by parsing data from VK helped us to reduce the average costs by 37%.
We have prepared this post for our advertising campaign:
Hey! You see this ad because we have parsed your id and made a file targeting in VK advertising account. Do you want to know how to do this?
LEFTJOIN – a blog about analytics, visualizations, Data Science and BI. A blog contains a lot of material on different BI and SQL tools, data visualizations and dashboards, work with different APIs (from Google Docs to social networks to the amateurs of beer) and interesting Python libraries.

 No comments    185   1 mon   api   python   VK   VK api

Dbt Coalesce conference: best talks to watch

Estimated read time – 3 min

Eg2IiVMX0AI1eLj.jpg-large.jpeg

The Coalesce 2020 conference, which I’ve mentioned before, took place from 7 till 11 of December 2020. This year, the organizers decided to carry out the conference in 5 days with a bunch of talks.

On the one hand, it’s an advantage as due to the abundance of information you have a sense of choosing what’s more interesting to watch. On the other side, such an amount of information is tiring as often it’s impossible to tell if the presentation will be interesting and useful just based on its name. In my opinion, it’s too much to have more than 3 days for a conference as the audience loses interest. Moreover, the need to deal with personal and professional issues cannot disappear because of the event that although online takes your time.

However, I managed to watch most of the talk, sometimes skimming through. First of all, my overall impression, it is great to study the presentations from conferences like Coalesce as they mostly cover modern BI tools and cloud solutions. Almost every talk mentions Redshift / BigQuery / Snowflake or BI tools like Mode / Tableau / Looker / Metabase. Obviously, dbt is in the middle of everything.

The shortlist of talks that I recommend for studying:

  1. dbt 101 — an introductory talk on what dbt is and how to use it.
  2. Kimball in the context of the modern data warehouse: what’s worth keeping, and what’s not 
    — an interesting but extremely controversial video that raised a lot of questions in dbt. In short, the author suggests using wide analytical tables and giving up normal forms everywhere.
  3. Building a robust data pipeline with dbt, Airflow, and Great Expectations — a talk about a rather interesting tool called greatexpectations which is used for data validation.
  4. Orchestrating dbt with Dagster — a video seemed a bit boring for me, but if you want to learn about Dagster, you’ll like it.
  5. Supercharging your data team — the guys created a wrapper for dbt called dbt executor 9000 and presented it.
  6. Presenting: SQLFluff — a video about a really cool feature called SQLFluff that automatically edits SQL code according to the SQL rules.
  7. QQuickstart your analytics with Fivetran dbt packages — from this video, you’ll learn about Fivetran and find out how to use it with dbt.
  8. Perfect complements: Using dbt with Looker for effective data governance
    about the interaction of dbt with Looker, differences and similarities of the tools.
 No comments    8   6 mon   analytics   coalesce   conference   dbt   education
Earlier Ctrl + ↓