All content here is under a Creative Commons Attribution CC-BY 4.0 and all source code is released under a BSD-2 clause license.
Please reuse, remix, revise, and reshare this content in any way, keeping this notice.
# Run this cell once, at the start, to load the notebook's style sheet.
from IPython.core.display import HTML
css_file = './images/style.css'
HTML(open(css_file, "r").read())
In the prior module 9 you learned an approach to follow for any data analysis project, as well as some basic plots and statistics. In this module you will learn about what 5 objectives we see for data analysis, as well as some further plots and statistical concepts.
git clone git@github.com:kgdunn/python-basic-notebooks.git # If you already have the repo cloned: git pull
to update it to the later version.
You should have:
matplotlib
, with this tutorialmatplotlib
In the prior module we covered:
while in this module we will cover: 4. Data tables 5. Time-series, or sequence plots 6. Scatter plots 7. Creating better box plots
In between, throughout the notes, we will also introduce statistical and data science concepts. This way you will learn how to interpret the plots and also communicate your results with the correct language.
In the prior module I described my approach for any data analysis project. The first step is to define the goals. When I take a look at various projects I have worked on, the goals always fall into one or more of these categories, or 'application domains'.
I will describe these goals shortly. But why look at this? The reason is that certain goals can be solved with a subset of tools. The number of tools available to you is large. Knowing which one to use for which type of goal helps you along the way faster.
Goals 3, making predictions from the system, e.g. predicting what quality is being produced by the system; or how much longer a batch should be run before it is completed. The prediction is typically required to support other decisions, or to apply real-time control on the system.
Goal 4 also can take place on-line, and is used to ensure the system is operating in a stable manner, and if not, using the data to figure what is going wrong, or about to go wrong.
Goal 5 is typically off-line, and here we use the data to make longer term improvements. For example, we try to move the system to a different state of operation that is more optimal/profitable. This can also be done in real-time, where systems are continuously shifted around to track an optimum target.
This is just one way to to categorize data science problems. There are of course other ways to do this: such as if you are dealing with one variable (vector) or many variables (matrices). Or which type of technique you are using: *supervised* or *unsupervised*.
We will encounter these terms along the way. But for now, you should be able to see any problem where you have used data as fitting into one of these five categories above.
For example: your manager asks you to use data (whatever is available) to discover why we are seeing increased number of customers returning our most profitable product to the store. Your objective: Find reason(s) for increased returns of product.
Which of the 5 goals above are used?: Number 2 "Troubleshoot a problem that is occurring" is the most direct. But along the way to achieving that goal, you will almost certainly apply number 1: "Learn more about your system".
Following up: in the future, after you have found the reasons for returned product, you might have done number 5: "optimizing the system" to find settings for the machines, so that fewer low-quality products are produced. Then, in a different data science project, based on number 4: you "monitor the system in real-time" to prevent producing bad quality products". This might be done by applying number 3: "making predictions of the product quality" in real-time, while the system is operating.
As you can see, these 5 goals are generally very broad. Why do we mention them?
You might learn, in other courses and later in your career, about different tools to implement. Then you can interchange the tools in your toolbox. For example, linear regression is one type of prediction tool to achieve goal 3, but so is a neural network. If one tool does not work so well, you can swap it for another one in your pipeline.
Try breaking down the existing data-based project you are currently working in. Check which one or more of the five apply.
Data tables are an effective form of data visualization. Some tips:
Here's an example of the Blender Efficiency data set. It was a designed experiment to see how the blending efficiency can be improved, using 18 experiments.
import pandas as pd
blender = pd.read_csv('http://openmv.net/file/blender-efficiency.csv')
blender.sort_values('BlendingEfficiency', inplace=True)
blender
Click on the column header for BlendingEfficiency
and you can sort from low-to-high, or high-to-low. You can now instantly see that ParticleSize
has the greatest effect on blending efficiency. No plotting required.
In terms of the 5 goals above - here we have used the table to learn more about our process: what direction is the *correlation* between particle size and blending efficiency? Positive or negative correlation?
Create a box plot of blending efficiency against particle size. This will achieve the goal of learning more about our system even further, because then we can quantify the negative correlation. blender.boxplot('BlendingEfficiency', by='ParticleSize')
In Pandas data tables, especially for calculated variables, you might see too many decimals (the default is 5). If you want to adjust that, run this command: pd.set_option('precision', 2)
for 2 decimals. See the code in the next section for an example.
Run the code below to convince yourself that pie charts should not be used instead of a table. If you are pressured to use a pie chart instead of table, use the example below (and some of the links) to help argue your case.
from matplotlib import pyplot
%matplotlib inline
import pandas as pd
website = pd.read_csv('http://openmv.net/file/website-traffic.csv')
website.drop(columns=['MonthDay', 'Year'], inplace=True)
average_visits_per_day = website.groupby('DayOfWeek').mean()
percentage = average_visits_per_day / average_visits_per_day.sum() * 100
fig = pyplot.figure(figsize=(15, 4));
axes = pyplot.subplot(1, 2, 1)
percentage.plot.pie(y='Visits', ax=axes, legend=False)
axes.set_aspect('equal')
# Right plot: subplot(1,2,2) means: create 1 row, with 2 columns, and draw in the 2nd box
# Take the same grouped data from before, except sort it now:
percentage.sort_values('Visits', ascending=True, inplace=True)
percentage.plot.barh(ax=pyplot.subplot(1, 2, 2), legend=False)
pd.set_option('precision', 2)
percentage
The superiority of tables is not surprising here. The human eye excels at at finding differences in 2-dimensions with respect to length and location. But it is not good at estimating area and angles, yet a pie chart encodes its information only in terms of area and angles.
Need more convincing evidence?
Related to data tables is the concept of colour-coding the entries in the data table according to their values. High values get a specific colour (e.g. red), and low values another colour (e.g. blue) and then the in-between values are shaded in a transition. This is also related to a colour map: each value is mapped to a certain colour (more on that below, in the section on scatter plots).
This is helpful for emphasizing trends in the data which are not easy to pick up with the numbers alone.
These colour-coded tables are called heatmaps.
In the prior module we created box plots for the taste ratings given to various samples of Peas, based on their flavour attributes: flavour, sweetness, fruity flavour, off-flavour, mealiness and hardness.
The judges give scores on a scale of 1 to 10.
# Load the data
import pandas as pd
peas = pd.read_csv('https://openmv.net/file/peas.csv')
judges = peas.loc[:, 'Flavour': 'Hardness']
judges.head()
# Now visualize the table trends:
import seaborn as sns
%matplotlib inline
# Change the default figure size
sns.set(rc={'figure.figsize':(15, 5)})
# Look at the transpose of the heatmap instead
sns.heatmap(judges.T);
That visualization is somewhat helpful, because we get an idea already that some of the attributes move together: see that if Flavour
, Sweet
and Fruity
are low (dark colours), that they are jointly low. And that it is opposite to the other 3 flavour characteristics.
Now let's sort the data set and try again:
judges.sort_values(by='Hardness', inplace=True)
sns.heatmap(judges.T);
What a difference! Now the visualization is greatly improved, and actually tells a story. That's the purpose of any visualization.
Now we quickly see the opposite trends occurring which took us much longer to realize in the prior plot. *How would you describe the trends to someone*?
Note also that you could not have seen these trends from a box plot!
Next we can calculate the *correlation* value, which is a number between $-1$ and $+1$ that shows how strongly variables are related. A value of 0 is no correlation. A value of $-1$ is a perfect negative relationship, and $+1$ is a perfect positive relationship.
We will visualize what a strong, or a weak correlation is in a next section on scatter plots. Here we already see how the columns are correlated to each other: both in a table, and a heat map. Heat maps are great way to visualize correlations.
from IPython.display import display
import numpy as np
import pandas as pd
pd.set_option('precision', 3)
display(judges.corr())
corr = judges.corr()
# Create a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Generate a colormap for the correlations
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap,
square=True, linewidths=.2,
cbar_kws={"shrink": 0.5});
Instantly you can confirm your expectation of the trends in that data set:
Flavour
, Sweet
and Fruity
attributes are correlated together. As one goes up, the other also goes up.Off-flavour
, Mealiness
and Hardness
attributes are also correlated together. As one goes up, the other also goes up.The goal of this challenge is to discover how the columns in the cheese taste data set are related to each other. In this data set the concentrations of:
A subjective taste value is also provided as the 4th column.
*If you want to cheat* scroll down to see a partial solution.
If you have a single column of data, you may see interesting trends in the sequence of numbers when plotting it. These trends are not always visible when just looking at the numbers, and they definitely cannot be seen in a box plot.
An effect way of plotting these columns is horizontally, as a series plot, or a trace. We also call them time-series plots, if there is a second column of information indicating the corresponding time of each data point.
As promised in the prior notebook, we will now look at the time-based trends of the website visits data set.
Below we import the data.
import pandas as pd
website = pd.read_csv('http://openmv.net/file/website-traffic.csv')
dates = pd.to_datetime(website['MonthDay'], format='%B %d')
website.set_index(dates, inplace=True, drop=True)
website.plot(y='Visits', figsize=(15,5))
# Smooth it a bit, with a rolling mean
website['Visits'].rolling(5).mean().plot(linewidth=5);
Notice the common problem with smoothed rolling average data: it introduces a 'delay' into the time-series. The smoothed peaks are shifted to the right in time.
Copy and paste the above code, and try this again for the Ammonia dataset. Note in the code below:
pd.date_range(...)
). We were told the data were collected every 6 hours.Modify the code below:
'12H'
(12 hours), '2D'
(2 days), '30D'
, etc.ammonia = pd.read_csv('http://openmv.net/file/ammonia.csv')
datetimes = pd.date_range('1/1/2020', periods=ammonia.shape[0], freq='6H')
ammonia.set_index(datetimes, inplace=True)
ammonia['Ammonia'].plot(figsize=(15,5), color='lightblue')
ammonia['Ammonia'].rolling('2D').mean().plot(color='black', linewidth=3);
The goal of this challenge is to understand what a random walk looks like, visually, as a time-series.
In the prior module you created the numbers that represent a random walk. Then you looked only at the distribution. Here's the prior code:
from scipy.stats import norm
# 20 steps for a regular person, showing the deviation to the
# left (negative) or to the right (positive) when they are
# walking straight. Values are in centimeters.
regular_steps = norm.rvs(loc=0, scale=5, size = 20)
print('Regular walking: \n{}'.format(regular_steps))
# Consumed too much? Standard deviation (scale) is larger:
deviating_steps = norm.rvs(loc=0, scale=12, size = 20)
print('Someone who has consumed too much: \n{}'.format(deviating_steps))
In the space below, start with the code given above, then modify it to:
size=400
stepsax = df.plot(...) # the output of the plot function is an axis
ax.axhline(y = 0, color='k')
ax
to set labels: ax.set_xlabel(...)
or ax.set_ylabel(...)
Here's how my plot looked. Run your code several times to see how different the random walks appear.
from scipy.stats import norm
from matplotlib import style
# print(style.available)
style.use('ggplot')
N = 400
regular_steps = norm.rvs(loc=0, scale=5, size = N)
deviating_steps = norm.rvs(loc=0, scale=12, size = N)
datetimes = pd.date_range('1/1/2020', periods=N, freq='1S')
regular = pd.Series(regular_steps.cumsum(), index = datetimes)
regular.plot(figsize=(15,5))
deviating = pd.Series(deviating_steps.cumsum(), index = datetimes)
ax = deviating.plot()
ax.axhline(y=0, color='k', linestyle='-', linewidth=2)
ax.set_ylabel('Deviation from the starting point [cm]')
ax.legend(['Regular steps', 'Deviating steps']);
The goal of this challenge is to understand the growth rate (reaction kinetics) of bacteria. To see what the growth looks like visually, but also to discover when the growth rate is the fastest, the most productive.
Back in module 3 we integrated an equation for bacteria growing on a plate:
$$ \dfrac{dP}{dt} = rP $$where $P$ is the number of bacteria in the population, and $r$ is their exponential rate of growth [number of bacteria/minute]. This is not realistic. Eventually the bacteria will run out of space and their food source. So the equation is modified:
$$ \dfrac{dP}{dt} = rP - aP^2$$where they are limited by the factor $a$ in the equation.
The differential equation can be re-written as: $$P_{i+1} = P_i + \left[\,rP_i -a\,P_i^2\,\right]\delta t$$
which shows how the population at time point $i+1$ (one step in the future) is related to the population size now, at time $i$ over a short interval of time $\delta t$ minutes.
Starting from 500 cells initially with a rate $r=0.032$ and the coefficient $a = 1.4 \times 10^{-7}$ we can generate the growth curves and plot them.
import numpy as np
import pandas as pd
from IPython.display import display
p_initial = 500
r = 0.032
a = 1.6E-7
delta_t = 1 # minutes
time_final = 8*60
# Create the two outputs of interest
time = np.arange(start=0.0, stop=time_final, step=delta_t)
population = np.zeros(time.shape)
population[0] = p_initial
for idx, t_value in enumerate(time[1:]):
population[idx + 1] = population[idx] + (r*population[idx] - a * population[idx]**2) * delta_t
# Now plot the data
bugs = pd.DataFrame(data = {'bacteria': population}, index=time)
display(bugs.head())
ax = bugs.plot(figsize=(15,5))
ax.grid(True)
Scatter plots are widely used and easy to understand. *When should you use a scatter plot?* When your goal is to draw the reader's attention between the relationship of 2 (or more) variables.
In a scatter plot we use 2 sets of axes, at 90 degrees to each other. We place a marker at the intersection of the values shown on the horizontal (x) axis and vertical (y) axis.
Most often variable 1 and 2 (also called the dimensions) will be continuous variables. Or at least *ordinal variables*. You will seldom use categorical data on the $x$ and $y$ axes.
You can add a 3rd dimension: the marker's size indicates the value of a 3rd variable. It makes sense to use a numeric variable here, not a categorical variable.
You can add a 4th dimension: the marker's colour indicates the value of a 4th variable: usually this will be a categorical variable. E.g. red = category 1, blue = category 2, green = category 3. Continuous numeric transitions are hard to map onto colour. However it is possible to use transitions, e.g. values from low to high are shown on a sliding gray scale
You can add a 5th dimension: the marker's shape can indicate the discrete values of a 5th categorical variable. E.g. circles = category 1, squares = category 2, triangles = category 3, etc.
In summary:
Let's get started with some examples. We will start off with the example from the prior module where we considered the grades of students, and how long it took to write the exam.
# Standard imports required to show plots and tables
from matplotlib import pyplot
from IPython.display import display
%matplotlib inline
import pandas as pd
# Modify the code if you are behind a proxy server
grades = pd.read_csv('https://openmv.net/file/unlimited-time-test.csv')
ax = grades.plot.scatter(x = 'Time', y = 'Grade',
figsize = (8, 8),
# These remaining inputs are optional, but
# specified below so you can explicitly see them
# Size of the dots: change this to get a feeling
# for the range of values you should use
s = 50,
# Specify the colour
c = 'darkgreen',
# The shape of the marker
# See https://matplotlib.org/3.1.1/api/markers_api.html
marker = 'D'
)
Remember our objective from the prior notebook? Do students score a higher Grade
if they have a longer Time
to finish the exam? The idea was that students will have less stress with unlimited time, because they had all their books and notes with them. In theory these are fairly ideal exam conditions.
The scatter plot however shows there isn't anything conclusive in the data to believe that there is a relationship. Let us also quantify it with the correlation value we introduced above.
display(grades.corr())
The correlation value is $r=-0.044$, essentially zero. So now you get an idea of what a zero correlation means.
Think of the implication of that: you can calculate the $R^2$ value - the value often used to judge how good a linear regression is - without calculating the linear regression model!! Further, it shows that for linear regression it does not matter which variable is on your $x$-axis, or your $y$-axis: the $R^2$ value is the same.
If you understand these 2 points, you will understand why $R^2$ is not a great number at all to judge a linear regression model.
Let's look at some other correlations. If you completed the Cheese Challenge above, you have already seen what the correlation values are for that dataset.
Taste
and the amount of H2S
present (correlation of 0.756), whileLactic
acid present is also quite strongly correlated with the amount of H2S
(0.644).cheese = pd.read_csv('http://openmv.net/file/cheddar-cheese.csv')
cheese.set_index('Case', inplace=True)
pd.set_option('precision', 3)
from IPython.display import display
display(cheese.corr())
Now we will like to visualize these pairwise relationships. We can draw 6 scatter plots to show all the pairwise combinations of Acetic
, H2S
, Lactic
and Taste
.
The Seaborn library, based on matplotlib, does this in a single line of code, using their sns.pairplot(...)
function.
Visually relate the scatter plots below, with the numeric correlations in the table above. Get a feeling for what a correlation of $r=0.6$ or in other words $R^2 = 0.36$ is. It is fairly strong! You can see trends and relationships.
sns.set(rc={'figure.figsize':(15, 5)})
sns.pairplot(cheese);
We saw that we can alter the size (s = ...
), colour (c = ...
) and shape (marker = ...
) of the marker to indicate a 3rd, 4th or 5th dimension.
In the plots above you saw you to specify s
, c
and marker
if the all the values are the same. Below you see how to do that if they are different. You specify a vector for s
and c
, the same length as the data.
The vector for the size, s
, is often a function of the variable being plotted. Remember that a doubling of the circle's area is related to the square root of the radius.
The colour, c
is often a categorical variable. In the example below we use red for "Yes" (baffles are present) and black for "No".
We consider changing the markers' shape in the next piece of code.
from matplotlib import pyplot
yields = pd.read_csv('http://openmv.net/file/bioreactor-yields.csv')
baffles = yields['baffles'].values
# Idea: [f(x) if condition else g(x) for x in sequence]
colour = ['red' if b == 'Yes' else 'black' for b in baffles]
size = (pd.np.sqrt((yields['speed']-3200))-4)*10
ax = yields.plot.scatter(x='temperature', y='yield', figsize=(10,8),
s = size,
c = colour)
ax.set_xlabel('Temperature [°C]')
ax.set_ylabel('Yield [%]');
ax.set_title('Yield as a function of temperature [location], baffles [marker colour] and speed [marker size]');
From the above visualization we quickly see how the red points (baffles are present in the reactor) have a reducing effect on the yield. The yield also drops off with temperature.
What can you say about the marker size, which represents the speed of the impeller in the bioreactor?
We don't actually have a 5th dimension to visualize in this data set, to also change the marker shape. Marker shapes must be associated with a categorical variable. We will show how you could do it, based on the baffles
column. The idea is to iterate over each unique category, taking the colour and shape from a dictionary.
markers = {'No': 's', # square
'Yes': 'o'} # circle
colours = {'No': 'black',
'Yes': 'red'}
# Create an empty axis to plot in
ax = pyplot.subplot(1,1,1)
for baffle_type in yields['baffles'].unique():
subset = yields[yields['baffles'] == baffle_type]
subset.plot.scatter(ax = ax,
figsize = (10,8),
x = 'temperature', y='yield',
s = (pd.np.sqrt((subset['speed']-3200))-4)*10,
c = colours[baffle_type],
marker = markers[baffle_type]
)
ax.set_xlabel('Temperature [°C]')
ax.set_ylabel('Yield [%]');
ax.set_title('Yield as a function of temperature [location], baffles [colour and marker shape] and speed [marker size]');
If you have a sliding scale for colour, then you need to use a colour map. See the matplotlib colormap reference.
A colour map takes an input value between 0 and 255, and relates it to a particular colour. On that webpage you see various colour maps.
The goal of this challenge is to see how the judges' values for the 6 taste attributes of the peas are (co)related.
Generate a pairplot set of scatter plots for all of the 6 combinations.
This is done intentionally for you to get a visual idea of what the correlation coefficient $(r)$ means, as well as $R^2$.
import pandas as pd
import seaborn as sns
peas = pd.read_csv('https://openmv.net/file/peas.csv')
judges = peas.loc[:, 'Flavour': 'Hardness']
sns.pairplot(judges);
#judges.corr()
Here we will consider alternatives, or additions, to the box plot, which we saw in the prior module.
The additions, in the order of progressively adding more information, are:
All 3 options improve the box plot, by showing the distribution of the underlying data and raw values from the column being visualized.
# Get the data
import pandas as pd
ammonia = pd.read_csv('http://openmv.net/file/ammonia.csv')
# You might need the proxy server settings
from matplotlib import pyplot
%matplotlib inline
import seaborn as sns
# Change the default figure size
#sns.set(rc={'figure.figsize':(15, 5)})
fig = pyplot.figure(figsize=(15, 5));
axis1 = pyplot.subplot(1, 2, 1)
axis2 = pyplot.subplot(1, 2, 2)
sns.boxplot(data=ammonia, ax = axis1)
axis1.set_ylim(0, 70)
axis1.grid(True)
sns.violinplot(y='Ammonia', data=ammonia, ax=axis2,
# Play with these settings
inner = "box", # the default
# inner = "quartile"
linewidth=3)
axis2.set_ylim(0, 70)
axis2.grid(True);
Swarm plots can compliment a violin plot, as they show all the raw underlying data. Not just the distribution.
from matplotlib import pyplot
%matplotlib inline
import seaborn as sns
# Change the default figure size
#sns.set(rc={'figure.figsize':(15, 5)})
fig = pyplot.figure(figsize=(15, 5));
axis1 = pyplot.subplot(1, 2, 1)
axis2 = pyplot.subplot(1, 2, 2)
sns.boxplot(data=ammonia, ax = axis1)
axis1.set_ylim(0, 70)
axis1.grid(True)
sns.swarmplot(y='Ammonia', data=ammonia, ax=axis2,
# Play with these settings
#inner = "box", # the default
# inner = "quartile"
#linewidth=3
)
axis2.set_ylim(0, 70);
axis2.grid(True)
import pandas as pd
import ptitprince as pt
%matplotlib inline
fig = pyplot.figure(figsize=(15, 5));
axis1 = pyplot.subplot(1, 2, 1)
axis2 = pyplot.subplot(1, 2, 2)
sns.boxplot(data=ammonia, ax = axis1, orient='h')
axis1.set_xlim(0, 70)
axis1.grid(True)
pt.RainCloud(y = 'Ammonia',
ax=axis2,
data = ammonia,
width_viol = .8,
width_box = .4,
figsize = (12, 8), orient = 'h',
move = .0)
axis2.set_xlim(0, 70)
axis2.grid(True)
But where a raincloud plot really works well is with comparison of multiple variables. Let's go back to an earlier worksheet case study, where we compared the thickness of plastic film.
The thickness was measured at the 4 corners.
import pandas as pd
import ptitprince as pt
%matplotlib inline
films = pd.read_csv('http://openmv.net/file/film-thickness.csv')
films.set_index('Number', inplace=True)
ax = pt.RainCloud(#y = 'Ammonia',
data = films,
width_viol = .8,
width_box = .4,
figsize = (12, 8), orient = 'h',
move = .0)
ax.grid(True)
Below we give some challenges that go beyond, but build on, the topics covered in this worksheet.
The Seaborn library wraps Matplotlib up, and creates easy-to-use function for common data visualization steps. The goal of this challenge is to get more familiar with one of the most useful visualization libraries in Python.
Take a look at the Seaborn Gallery to see some examples. Which ones can you use in your next project?
Great blog post about 9 different visualizations. We have looked at all of these, but it is nice to see them all on one page.
Seaborn tutorial: a structured tutorial is always worthwhile. Look at this to understand topics we have not covered in detail:
colour maps,
adding and rotating text,
'contexts': adjusting the plot settings for different use cases: talks, posters, on paper or in notebooks
change plot style
adding a title
Interactively create Seaborn visualizations within this webpage (if you can handle all the advertising!)
Though not related to Seaborn, it is worth giving a link to the Visualizations from the Pandas library.
The goal of this challenge is to visualize data that is coming from a real-time live stream. The plots above are all static. But what if you want to monitor your process in real-time? See goal number 4 above.
Let's give it a try. We will monitor the CPU usage of your computer. You can install a small Python package to get the CPU percentage used. You will need the non-built in package called psutil
. Install it with: python3 -m pip install psutil
or with your package manager (e.g. Anaconda)
import psutil
# Measure the CPU used in a 0.9 second interval
psutil.cpu_percent(interval=0.9)
Run that code multiple times and check that the values change.
Now you would like to watch that value on a graph, changing in real-time. See these pages for inspiration on how to visualize that with Python:
You can apply this challenge directly for acquiring and plotting data from your own sensors. For example, you can inexpensively buy a Raspberry Pi board, add some sensors and create a home monitoring system for temperature, humidity and noise.
*Feedback and comments about this worksheet?* Please provide any anonymous comments, feedback and tips.