27 posts tagged

python

Mean VS median: how to choose a target metric?

Estimated read time – 7 min

In today’s article, we would like to highlight a simple, but important topic – how to choose a simple metric to evaluate a particular dataset. Everyone has been familiar with the arithmetic mean for a long time, almost every student knows very well that you should sum up all the available values, divide by their number and get the average value. However, school knowledge does not include any alternative options, of which, in fact, there are many in statistics – almost for every occasion. In solving research and marketing problems, people often take mean as a target. Is this legitimate or is there a better option? Let’s figure it out.

To begin with, it is worth remembering the definitions of the two metrics that we will talk about today.
Mean is the most popular statistic used to calculate a data center. What is the median? Median is a value that splits data, sorted in order of increasing values, into two equal parts. This means that the median shows the central value in the sample if the number of cases is odd and the arithmetic mean of the two values ​​if the number of cases in the sample is even.

Research tasks

So, the estimation of the sample mean is often important in many research questions. For instance, specialists studying demography often study changes in the number of regions in Russia in order to track the dynamics and reflect it in reports. Let’s try to calculate the average size of the Russian city, as well as the median, and then compare the results.
First, you need to find and load data by connecting the pandas library for this.

import pandas as pd
city = pd.read_csv('city.csv', sep = ';')

Then, you need to calculate the mean and median of the sample.

mean_pop = round (city.population_2020.mean (), 0)
median_pop = round (city.population_2020.median (), 0)

The values, of course, are different, since the distribution of observations in the sample is different from the normal one. In order to understand whether they are very different, let’s build a distribution graph and display the mean and median.

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_palette('rainbow')
fig = plt.figure(figsize = (20, 15))
ax = fig.add_subplot(1, 1, 1)
g = sns.histplot(data = city, x= 'population_2020', alpha=0.6, bins = 100, ax=ax)

g.axvline(mean_pop, linewidth=2, color='r', alpha=0.9, linestyle='--', label = 'Mean = {:,.0f}'.format(mean_pop).replace(',', ' '))
g.axvline(median_pop, linewidth=2, color='darkgreen', alpha=0.9, linestyle='--', label = 'Median = {:,.0f}'.format(median_pop).replace(',', ' '))

plt.ticklabel_format(axis='x', style='plain')
plt.xlabel("Population", fontsize=20)
plt.ylabel("Number of cities", fontsize=20)
plt.title("Distribution of population of russian cities", fontsize=20)
plt.legend(fontsize="xx-large")
plt.show()

first.jpg

Also, on this data it is worth building a boxplot for more accurate visualization with the main distribution quantiles, median, mean and outliers.

fig = plt.figure(figsize = (10, 10))
sns.set_theme(style="whitegrid")
sns.set_palette(palette="pastel")

sns.boxplot(y = city['population_2020'], showfliers = False)

plt.scatter(0, 550100, marker='*', s=100, color = 'black', label = 'Outlier')
plt.scatter(0, 560200, marker='*', s=100, color = 'black')
plt.scatter(0, 570300, marker='*', s=100, color = 'black')
plt.scatter(0, mean_pop, marker='o', s=100, color = 'red', edgecolors = 'black', label = 'Mean')
plt.legend()

plt.ylabel("Population", fontsize=15)
plt.ticklabel_format(axis='y', style='plain')
plt.title("Boxplot of population of russian cities", fontsize=15)
plt.show()

second.jpg

It follows from the graphs that the median is significantly less than the average, and it is also clear that this is a consequence of the presence of outliers which are Moscow and St. Petersburg. Since the arithmetic mean is an extremely sensitive metric to outliers, it is not worth relying on conclusions about the mean if they are present in the sample. An increase or decrease in the population of Moscow can greatly change the average population in Russia, but this will not affect the real regional trend.
Using the arithmetic mean, we say that the number of a typical (average) city in the Russian Federation is 268 thousand people. However, this misleads us, since the mean value is significantly higher than the median solely due to the size of the population of Moscow and St. Petersburg. In fact, the number of a typical Russian city is significantly less (2 times less!) and is approximately 104 thousand citizens.

Marketing tasks

In a business tasks, the difference between the mean and the median is also important, as using the wrong metric can seriously affect the results of the advertising campaign or make it difficult to achieve the goal. Let’s take a look at a real example of the difficulties an entrepreneur can face in retail if he chooses the wrong target metric.
To begin with, as in the previous example, let’s load a dataset about supermarket purchases. Let’s select the dataset columns necessary for analysis and rename them to simplify the code in the future. Since this data is not as well prepared as the previous ones, it is necessary to group all purchased items by receipts. In this case, it is necessary to group by two variables: by the customer’s id and by the date of purchase (the date and time is determined by the moment of closing the bill, therefore, all purchases within one bill coincide by date variable). Then, let’s name the resulting column “total_bill”, that is, the check amount and calculate the average and median.

df = pd.read_excel ('invoice_data.xlsx')
df.columns = ['user', 'total_price', 'date']
groupped_df = pd.DataFrame (df.groupby (['user', 'date']). total_price.sum ())
groupped_df.columns = ['total_bill']
mean_bill = groupped_df.total_bill.mean ()
median_bill = groupped_df.total_bill.median ()

Now, as in the previous example, you need to plot the distribution of customer checks and boxplot, and also display the median and mean on each of them.

sns.set_palette('rainbow')
fig = plt.figure(figsize = (20, 15))
ax = fig.add_subplot(1, 1, 1)
sns.histplot(groupped_df, x = 'total_bill', binwidth=200, alpha=0.6, ax=ax)
plt.xlabel("Purchases", fontsize=20)
plt.ylabel("Total bill", fontsize=20)
plt.title("Distribution of total bills", fontsize=20)
plt.axvline(mean_bill, linewidth=2, color='r', alpha=1, linestyle='--', label = 'Mean = {:.0f}'.format(mean_bill))
plt.axvline(median_bill, linewidth=2, color='darkgreen', alpha=1, linestyle='--', label = 'Median = {:.0f}'.format(median_bill))
plt.legend(fontsize="xx-large")
plt.show()

third.jpg

fig = plt.figure(figsize = (10, 10))
sns.set_theme(style="whitegrid")
sns.set_palette(palette="pastel")

sns.boxplot(y = groupped_df['total_bill'], showfliers = False)

plt.scatter(0, 1800, marker='*', s=100, color = 'black', label = 'Outlier')
plt.scatter(0, 1850, marker='*', s=100, color = 'black')
plt.scatter(0, 1900, marker='*', s=100, color = 'black')
plt.scatter(0, mean_bill, marker='o', s=100, color = 'red', edgecolors = 'black', label = 'Mean')
plt.legend()

plt.ticklabel_format(axis='y', style='plain')
plt.ylabel("Total bill", fontsize=15)
plt.title("Boxplot of total bills", fontsize=15)
plt.show()

forth.jpg

The graphs show that the distribution is different from normal, which means that the median and mean are not equal. The median value is smaller than the average by about 220 rubles.
Now, imagine that marketers have a task to increase the average buyer’s bill. A marketer may decide that since the average check is 601 rubles, the following promotion can be offered: “All buyers who make a purchase over 600 rubles, will get 20% discount on any good for 100 rubles.” In general, it is a reasonable offer, however, in reality, the average check is lower – 378 rubles. Thus, the majority of buyers will not be interested in the offer, since their purchase usually does not reach the proposed threshold. This means, that they will not take advantage of the offer and will not receive a discount, and the company will not be able to achieve its goal and increase the profit. The point is that the initial assumptions were wrong.

Conclusions

As you already understood, the mean often shows a more pleasant result, both for business and for research tasks, because it is always nicer to imagine the situation with the average check or the demographic situation in the country better than it really is. However, one must always remember about the shortcomings of mean in order to be able to correctly choose the appropriate analogue for assessing a particular situation.

 No comments    23   1 mon   analysis   matplotlib   pandas   python

Graph of telegram channels related to analytics

Estimated read time – 3 min

The authors of various Telegram-blogs often publish a selection of their favorite channels so as to share their professional choice with their audience. The idea, of course, is not new, but I decided not just to make a rating of interesting analytical telegram blogs, but to solve this problem analytically.

As part of the current course of my studies, I am learning many modern approaches to data analysis and visualization. At the very beginning of the course, there was a warm-up exercise: object-oriented programming in Python for collecting and iterative building a graph with TMDB API. Usually, this method is used to construct a connection graph of actors, where the connection is a role in the same film. However, I decided that I could apply it to another problem: building a graph of connections for the analytic community.

Since my time has been particularly limited recently, and I have already completed a similar task for the course, I decided to transfer this knowledge to someone else who is interested in analytics. Fortunately, at that moment, a candidate for the vacancy of a junior data analyst, Andrey, texted me in direct messages. He is now in the process of comprehending all the intricacies of analytics, so we agreed on an internship, in which Andrey parsed data from telegram channels.

Andrey’s main task was to collect all texts from the Internet analyst’s telegram channel, select the channels, which Aleksey Nikushin linked, collect texts from these telegram channels and links on these channels. “Link” means any mention of the channel: through @, through a link or repost. As a result of parsing, Andrey got two files: nodes and edges.
Now I will present to you the graph that I got based on this data and comment on the results.

I would like to take this opportunity to express my compliments to the karpov.courses team, as Andrey has excellent knowledge of the Python language!

As a result, the top 10 channels in terms of degree (number of connections) looks like this:

  1. Интернет-аналитика
  2. Reveal The Data
  3. Инжиниринг Данных
  4. Data Events
  5. Datalytics
  6. Чартомойка
  7. LEFT JOIN
  8. Epic Growth
  9. RTD: ссылки и репосты
  10. Дашбордец

In my opinion, it turned out extremely exciting and visually interesting, and Andrey is a great fellow! By the way, he also started his own channel ”Это разве аналитика?”, where analytics news is published.

Looking ahead, this problem has a continuation. With the help of the Markov chain, we modeled where the user ends up if he iteratively navigates through all the mentions in the channels. It turned out unexpectedly, but we will tell about it next time!

 No comments    26   1 mon   data analytics   python

Bubble charts basics: area vs radius

Estimated read time – 4 min

Data visualization is a skill used in any industry where data is present, because tables are only good for storing information. When there is a need to present data, or rather certain conclusions derived from them, the data must be presented on graphs of a suitable type. So, here you are faced with two tasks: the first is to choose the right type of graph, the second is to display the results in a plausible way. Today we will tell you about one mistake that designers sometimes make when visualizing data on bubble-charts and how this mistake can be avoided.

The crux of building a bubble-chart

Firstly, let us tell you a bit of boring theory before we start analyzing the data. Bubble-chart is a convenient way to show three numerical variables without building a 3D model. The usual X and Y axes indicate the values ​​of two parameters, and the third is shown by the size of the circle that corresponds to each observation. This is what makes it possible to avoid the need to build a complex 3D chart, that is, anyone who sees a bubble-chart will be able to draw conclusions about the data much faster.

A mistake that a designer, but not a data analyst, can make

With the metrics that are displayed on the axes of the graph, no questions arise. This is the usual way of visualizing data, but with the data shown by the size of the circles there is some difficulty: how to correctly and accurately display changes in the values ​​of a variable, if the control is not a point on the axis, but the size of this point?
The fact is that when building such a graph without using analytical tools, for example, in a graphics editor, the author can draw circles, taking the radius of the circle as its size. At first glance, everything seems to be absolutely correct – the larger the value of the variable, the larger the radius of the circle. However, in this case, the area of ​​the circle will increase not as a linear, but as a power function, because S = π × r2. For instance, the figure below shows that if you double the radius of a circle, the area will quadruple.


Draw a circle in Matplotlib  

fig = plt.figure (figsize = (10, 10))
ax = fig.add_subplot (1, 1, 1)
s = 4 * 10e3


ax.scatter (100, 100, s = s, c = 'r')
ax.scatter (100, 100, s = s / 4, c = 'b')
ax.scatter (100, 100, s = 10, c = 'g')
plt.axvline (99, c = 'black')
plt.axvline (101, c = 'black')
plt.axvline (98, c = 'black')
plt.axvline (102, c = 'black')


ax.set_xticks (np.arange (95, 106, 1))
ax.grid (alpha = 1)

plt.show ()

example.png

This means that the graph will look implausible, because the dimensions will not reflect the real change in the variable, and the viewer pays attention and compares exactly the area of ​​the circles on the graph.

How to build such a graph correctly?

Fortunately, if you build bubble-charts using Python libraries (Matplotlib and Seaborn), then the size of the circle will be determined by the area, which is absolutely correct from in terms of visualization.
Now, using the example of real data found on Kaggle, we will show how to build a bubble-chart. The data contains the following variables: country, population size, percentage of literate population. For the chart to be readable, let’s take a subsample of the top 10 countries after sorting all the data in order of increasing GDP.

First, let’s load all the necessary libraries:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Then, load the data, clear it from all rows with missing values ​​and transform the population of countries to millions of people:

data = pd.read_csv ('countries of the world.csv', sep = ',')
data = data.dropna ()
data = data.sort_values ​​(by = 'Population', ascending = False)
data = data.head (10)
data ['Population'] = data ['Population']. apply (lambda x: x / 1000000)

Now that all the preparations are complete, you can build a bubble-chart:

sns.set (style = "darkgrid")
fig, ax = plt.subplots (figsize = (10, 10))
g = sns.scatterplot (data = data, x = "Literacy (%)", y = "GDP ($ per capita)", size = "Population", sizes = (10,1500), alpha = 0.5)
plt.xlabel ("Literacy (Percentage of literate citizens)")
plt.ylabel ("GDP per Capita")
plt.title ('Chart with bubbles as area', fontdict = {'fontsize': 'x-large'})

def label_point (x, y, val, ax):
    a = pd.concat ({'x': x, 'y': y, 'val': val}, axis = 1)
    for i, point in a.iterrows ():
        ax.text (point ['x'], point ['y'] + 500, str (point ['val']))

label_point (data ['Literacy (%)'], data ['GDP ($ per capita)'], data ['Country'], plt.gca ())

ax.legend (loc = 'upper left', fontsize = 'medium', title = 'Population (in mln)', title_fontsize = 'large', labelspacing = 1)

plt.show ()

_32.png

This graph displays three metrics in an understandable way: the level of GDP per capita on the Y axis, the percentage of the literate population on the X axis, and the population – by the area of ​​the circle.

We recommend using size of the circle to show one of the variables, if there is a need to show 3 or more variables on one chart.

 No comments    25   1 mon   python   visualisation

How and why should you export reports from Jupyter Notebook to PDF

Estimated read time – 5 min

If you are a data analyst and you need to present a report to a client, if you are looking for a job and do not know how to draw up a test task in such a way that people will pay attention to you, if you have a lot of educational projects related to data analytics and visualization, this post will be very, very useful to you.
Looking at someone else’s code in a Jupyter Notebook can be problematic, because the result is often lost between lines of code with data preparation, importing the necessary libraries and a series of attempts to implement the idea. That is why a method such as exporting results to a PDF file in LaTeX format is a great option for final visualization. It will save time and look presentable. In scientific circles, articles and reports are very often formatted using LaTeX, since it has a number of advantages:

  • Math equations and formulas look neater.
  • The bibliography is automatically generated based on all references used in the document.
  • The author can focus on the content (not on the appearance of the document), since the layout of the text and other data is set automatically by specifying the necessary parameters in the code.

Today we will talk in detail about how to export such beautiful reports from Jupyter Notebook to PDF using LaTeX.

Installing LaTeX

The most important point in generating a report from a Jupyter Notebook in Python is exporting it to the final file. The main library you need to install is – nbconvert – which converts your notebook into any convenient document format: pdf (as in our case), html, latex, etc. This library needs not only to be installed, but some preinstalling of several other packages as well: Pandoc, TeX, and Chromium. According to the link to the library, the whole process is described in detail for each software, so we will not dwell on it here.
Once you have completed all the preliminary steps, you need to install and import the library into your Jupyter Notebook.

! pip install nbconvert
import nbconvert

Export tables to Markdown format

Usually, tables look a bit odd in reports, as they can be difficult to read quickly, but sometimes it is still necessary to add a small table to the final document. In order for the table to look neat, you need to save it in Markdown format. This can be done manually, but if there is a lot of data in the table, it is better to come up with a more convenient method. We suggest using the following simple pandas_df_to_markdown_table () function, which converts any dataframe to a markdown-table. Note: after the conversion, indices disappear, therefore, if they are important (as in our example), it is worth saving them into a variable in the first column of the dataframe.

data_g = px.data.gapminder ()
summary = round (data_g.describe (), 2)
summary.insert (0, 'metric', summary.index)

# Function to convert dataframe to Markdown Table
def pandas_df_to_markdown_table (df):
    from IPython.display import Markdown, display
    fmt = ['---' for i in range (len (df.columns))]
    df_fmt = pd.DataFrame ([fmt], columns = df.columns)
    df_formatted = pd.concat ([df_fmt, df])
    display (Markdown (df_formatted.to_csv (sep = "|", index = False)))

pandas_df_to_markdown_table (summary)

Export image to report

In this example, we will build a bubble-chart, the construction method of which was described in a recent post. Previously we used the Seaborn library, which shown that the display of data with the size of circles on the graph is correct. The same graphs can be created using the Plotly library.
In order to display the plot in the final report, you also need to complete an additional step. The point is that plt.show () will not help to display the graph when exporting. Therefore, you need to save the graph in the working directory, and then, using the iPython.display library, display it using the Image () function.

from IPython.display import Image
import plotly.express as px
fig = px.scatter (data_g.query ("year == 2007"), x = "gdpPercap", y = "lifeExp",
                 size = "pop", color = "continent",
                 log_x = True, size_max = 70)
fig.write_image ('figure_1.jpg')
Image (data = 'figure_1.jpg', width = 1000)

Formation and export of the report

When all stages of data analysis are completed, the report can be exported. If you need headings or text in the report, then write them in the cells of the notebook, changing the format from Code to Markdown. For export, you can use the terminal, running the second line there without an exclamation mark, or you can run the code written below in the cell of the Jupiter Notebook. We advise you not to load the report with code and use TemplateExporter.exclude_input = True parameter so that the cells with the code are not exported. Also, when you run this cell in your notebook, the code produces a standard output, and you need to write %% capture at the beginning of the cell not to export it.

%% capture
! jupyter nbconvert --to pdf --TemplateExporter.exclude_input = True ~ / Desktop / VALIOTTI / Reports / Sample \ LaTeX \ Report.ipynb
! open ~ / Desktop / VALIOTTI / Reports / Sample \ LaTeX \ Report.pdf

If you did everything correctly and methodically, then you will end up with a report similar to this one!
Present your data nicely :)

 No comments    58   1 mon   data analytics   python

Dashboard for the first 8 months of a child’s life

Estimated read time – 4 min

In December 2020, I became a dad, which means that our family life with my wife has changed drastically. Of course, I am sharing this news with you for a reason, but in the context of the data that we will study and research today. They are very personal for me, and therefore have some special magic and value. Today I want to show how dramatically the life of a family is changing by the example of my own analysis of the life data of the first 8 months of a baby.

Data collection

Initial data: tracking the main elements of caring for a baby in the first 8 months: sleep, nursing, changing a diaper. The data was collected using BabyTracker app.
My wife is a great fellow, because during the first 7 months she carefully and regularly monitored all the important points. She forgot to turn off the timer for nursing the baby at night only a couple of times, but I quickly saw noticeable outliers in the data, and the dataset was cleared of them.
Initially, I had several data visualization ideas in my head, and I tried to immediately implement them into the projected dashboard. I wanted to show the baby’s sleep intervals in the form of a vertical Gantt chart, but the night’s sleep went through the midnight (0:00), and it was completely incomprehensible how this could be corrected in Tableau. After a number of unsuccessful independent attempts to find a solution to this problem, I decided to consult with Roman Bunin. Unfortunately, we came to the conclusion together that there is no way to solve this. Then I had to write a little Python code that splits such time intervals and adds new lines to the dataset.
However, while we were texting, Roma sent an example identical to my idea! This example claims that a woman collected data on her child’s sleep and wakefulness in the first year of child’s life, and then wrote the code with which it turned out to be embroidered towel with pattern datavis baby sleep. For me, this was surprising, since it turned out that this way of visualization is the main method that allows you to show how difficult life and sleep of parents is in the first months of the birth of a child.
In my dashboard on Tableau Public I got three semantic blocks and several “KPIs” about which I would like to tell you in detail and share the basic everyday wisdom. At the top of the dashboard, you can see the key averages of the daytime and nighttime sleep hours, nursing hours, frequency of nursing, and the number of diaper changes in the first three months. I have allocated exactly three months, because I think this is the most difficult period, because significant changes that require serious adaptation are taking place in your life .

Dream

The left diagram – called “Towel” – illustrates the baby’s sleeping periods. In this diagram, it is important to pay attention to white gaps, especially at night. These are the hours when the baby is awake, which means that the parents are also awake. Look at how the chart changes, especially in the early months, when we gave up the habit of going to bed at 1 or 2 in the morning and fell asleep earlier. Roughly speaking, in the first three months (until March 2021), the child could fall asleep at 2 or 3 in the morning, but we were lucky that our child’s night sleep was quite long.
The right graph clearly illustrates how the baby’s day and night sleep length changes over time, and the boxplots show the distribution of the hours of daytime and nighttime sleep. The graph confirms the conclusion: “This is temporary and will definitely get better soon!”

Nursing

From the left diagram, you can see how the number and duration of nursing change. This number is gradually decreasing, and the duration of nursing periods is shortened. Since mid-July, we have changed the way we track nursing periods, so they are not valid for this analysis.
From my point of view, the findings are a great opportunity for couples planning a pregnancy, not to create illusions about the opportunity to work or do any other business in the first months after giving birth. Pay attention to the frequency and duration of nursing, all this time the parent is completely busy with the child. However, do not be overly alarmed: over time, the number of nursing periods will decrease.

Diaper change

The left graph is the highlight of this dashboard. As you can imagine, this is a map of the most fun moments – changing a diaper. The stars represent the moments of the day when you need to change the diaper, and the light gray color below shows the number of changes per day. The graph on the right shows diaper changes counted by part of the day. In general, the diagram does not show any interesting dependencies, however, it prepares you for the fact that this process is frequent, regular and happens at any time of the day.

Conclusions

It seems to me that the use of real personal data and such visualization is sometimes much more revealing than a lot of videos or books about what this period will be like. That is why I decided to share my findings and observations with you here. The main conclusion that I wanted you to draw from the dataviz: children are great! ❤️

 No comments    43   2 mon   data analytics   python
Earlier Ctrl + ↓