Notebook

- 0.1 Preparing for this module###
- 0.2 Summarizing data visually and numerically (statistics)
1 Five main goals with data science
- 1.1 Examples of using this categorization
- 1.2 Try it yourself
2 Data tables
3 Time-series, or a sequence plot
4 Scatter plots
- 4.1 Adding more dimensions to your scatter plots
5 Creating better box plots
6 Extended challenges

All content here is under a Creative Commons Attribution CC-BY 4.0 and all source code is released under a BSD-2 clause license.

Please reuse, remix, revise, and reshare this content in any way, keeping this notice.

In [ ]:

# Run  this cell once, at the start, to load the notebook's style sheet.
from IPython.core.display import HTML
css_file = './images/style.css'
HTML(open(css_file, "r").read())

Module 10: Overview¶

In the prior module 9 you learned an approach to follow for any data analysis project, as well as some basic plots and statistics. In this module you will learn about what 5 objectives we see for data analysis, as well as some further plots and statistical concepts.

Check out this repo using Git. Use your favourite Git user-interface, or at the command line:

git clone git@github.com:kgdunn/python-basic-notebooks.git

# If you already have the repo cloned:
git pull

to update it to the later version.

Preparing for this module###¶

You should have:

understood the core plotting library in Python, matplotlib, with this tutorial
read this post for more details about matplotlib
ensured you understand scatter plots and how you can plot 5 dimensions on a 2-dimensional plot; maybe 6 dimensions if you have mixed reality smartglasses.

Summarizing data visually and numerically (statistics)¶

In the prior module we covered:

Box plots
Bar plots (bar charts)
Histograms

while in this module we will cover: 4. Data tables 5. Time-series, or sequence plots 6. Scatter plots 7. Creating better box plots

In between, throughout the notes, we will also introduce statistical and data science concepts. This way you will learn how to interpret the plots and also communicate your results with the correct language.

Five main goals with data science¶

In the prior module I described my approach for any data analysis project. The first step is to define the goals. When I take a look at various projects I have worked on, the goals always fall into one or more of these categories, or 'application domains'.

Learning more about our system
Troubleshooting a problem that is occurring
Making predictions using (some) data from the system
Monitoring that system in real-time, or nearly real time
Optimizing the system

I will describe these goals shortly. But why look at this? The reason is that certain goals can be solved with a subset of tools. The number of tools available to you is large. Knowing which one to use for which type of goal helps you along the way faster.

Goals 1 and 2 take place off-line, using data that has been collected already.

Goals 3, making predictions from the system, e.g. predicting what quality is being produced by the system; or how much longer a batch should be run before it is completed. The prediction is typically required to support other decisions, or to apply real-time control on the system.

Goal 4 also can take place on-line, and is used to ensure the system is operating in a stable manner, and if not, using the data to figure what is going wrong, or about to go wrong.

Goal 5 is typically off-line, and here we use the data to make longer term improvements. For example, we try to move the system to a different state of operation that is more optimal/profitable. This can also be done in real-time, where systems are continuously shifted around to track an optimum target.

This is just one way to to categorize data science problems. There are of course other ways to do this: such as if you are dealing with one variable (vector) or many variables (matrices). Or which type of technique you are using: *supervised* or *unsupervised*.

We will encounter these terms along the way. But for now, you should be able to see any problem where you have used data as fitting into one of these five categories above.

Examples of using this categorization¶

For example: your manager asks you to use data (whatever is available) to discover why we are seeing increased number of customers returning our most profitable product to the store. Your objective: Find reason(s) for increased returns of product.

Which of the 5 goals above are used?: Number 2 "Troubleshoot a problem that is occurring" is the most direct. But along the way to achieving that goal, you will almost certainly apply number 1: "Learn more about your system".

Following up: in the future, after you have found the reasons for returned product, you might have done number 5: "optimizing the system" to find settings for the machines, so that fewer low-quality products are produced. Then, in a different data science project, based on number 4: you "monitor the system in real-time" to prevent producing bad quality products". This might be done by applying number 3: "making predictions of the product quality" in real-time, while the system is operating.

As you can see, these 5 goals are generally very broad. Why do we mention them?

You might learn, in other courses and later in your career, about different tools to implement. Then you can interchange the tools in your toolbox. For example, linear regression is one type of prediction tool to achieve goal 3, but so is a neural network. If one tool does not work so well, you can swap it for another one in your pipeline.

Try it yourself¶

Try breaking down the existing data-based project you are currently working in. Check which one or more of the five apply.

Data tables¶

Data tables are an effective form of data visualization. Some tips:

align numbers in the column, all at the decimal point, so trends can be scanned when reading from top to bottom.
alternate the background shading of each row
sort the table by a particular variable, to emphasize a particular message.

Here's an example of the Blender Efficiency data set. It was a designed experiment to see how the blending efficiency can be improved, using 18 experiments.

In [ ]:

import pandas as pd
blender = pd.read_csv('http://openmv.net/file/blender-efficiency.csv')
blender.sort_values('BlendingEfficiency', inplace=True)
blender

Click on the column header for BlendingEfficiency and you can sort from low-to-high, or high-to-low. You can now instantly see that ParticleSize has the greatest effect on blending efficiency. No plotting required.

In terms of the 5 goals above - here we have used the table to learn more about our process: what direction is the *correlation* between particle size and blending efficiency? Positive or negative correlation?

To try yourself¶

Create a box plot of blending efficiency against particle size. This will achieve the goal of learning more about our system even further, because then we can quantify the negative correlation. blender.boxplot('BlendingEfficiency', by='ParticleSize')

Setting the level of precision¶

In Pandas data tables, especially for calculated variables, you might see too many decimals (the default is 5). If you want to adjust that, run this command: pd.set_option('precision', 2) for 2 decimals. See the code in the next section for an example.

Pie charts, when tables will do¶

Run the code below to convince yourself that pie charts should not be used instead of a table. If you are pressured to use a pie chart instead of table, use the example below (and some of the links) to help argue your case.

In [ ]:

from matplotlib import pyplot
%matplotlib inline
import pandas as pd

website = pd.read_csv('http://openmv.net/file/website-traffic.csv')
website.drop(columns=['MonthDay', 'Year'], inplace=True)
average_visits_per_day = website.groupby('DayOfWeek').mean()  
percentage = average_visits_per_day / average_visits_per_day.sum()  * 100

fig = pyplot.figure(figsize=(15, 4));
axes = pyplot.subplot(1, 2, 1)
percentage.plot.pie(y='Visits', ax=axes, legend=False)
axes.set_aspect('equal')

# Right plot: subplot(1,2,2) means: create 1 row, with 2 columns, and draw in the 2nd box
# Take the same grouped data from before, except sort it now:
percentage.sort_values('Visits', ascending=True, inplace=True)  
percentage.plot.barh(ax=pyplot.subplot(1, 2, 2), legend=False)

pd.set_option('precision', 2)
percentage

Answer these questions, based on the above¶

From the pie chart alone: which day has the second highest number of visits? Note how long it takes you to discover that.
From the pie chart only: which percentage of visits occur on a Saturday? How accurate is your guess? Check it afterwards against the bar chart and the values in the table.

The superiority of tables is not surprising here. The human eye excels at at finding differences in 2-dimensions with respect to length and location. But it is not good at estimating area and angles, yet a pie chart encodes its information only in terms of area and angles.

Compare the bar plot with the table now: you get the same information from both, but in terms of the data:ink ratio concept, which is better?

Need more convincing evidence?

From pie to bar: https://www.darkhorseanalytics.com/portfolio/2016/1/7/data-looks-better-naked-pie-charts
Chartjunk in bar plots: https://www.darkhorseanalytics.com/blog/data-looks-better-naked
And the full essay on the lack of utility of pie charts: Save the Pies for Dessert

Colour-coded tables for heatmaps¶

Related to data tables is the concept of colour-coding the entries in the data table according to their values. High values get a specific colour (e.g. red), and low values another colour (e.g. blue) and then the in-between values are shaded in a transition. This is also related to a colour map: each value is mapped to a certain colour (more on that below, in the section on scatter plots).

This is helpful for emphasizing trends in the data which are not easy to pick up with the numbers alone.

These colour-coded tables are called heatmaps.

Example: Peas¶

In the prior module we created box plots for the taste ratings given to various samples of Peas, based on their flavour attributes: flavour, sweetness, fruity flavour, off-flavour, mealiness and hardness.

The judges give scores on a scale of 1 to 10.

In [ ]:

# Load the data
import pandas as pd
peas = pd.read_csv('https://openmv.net/file/peas.csv')
judges = peas.loc[:, 'Flavour': 'Hardness']
judges.head()

In [ ]:

# Now visualize the table trends:
import seaborn as sns
%matplotlib inline

# Change the default figure size
sns.set(rc={'figure.figsize':(15, 5)})

# Look at the transpose of the heatmap instead
sns.heatmap(judges.T);

That visualization is somewhat helpful, because we get an idea already that some of the attributes move together: see that if Flavour, Sweet and Fruity are low (dark colours), that they are jointly low. And that it is opposite to the other 3 flavour characteristics.

Now let's sort the data set and try again:

In [ ]:

judges.sort_values(by='Hardness', inplace=True)
sns.heatmap(judges.T);

What a difference! Now the visualization is greatly improved, and actually tells a story. That's the purpose of any visualization.

Now we quickly see the opposite trends occurring which took us much longer to realize in the prior plot. *How would you describe the trends to someone*?

Note also that you could not have seen these trends from a box plot!

Next we can calculate the *correlation* value, which is a number between $-1$ and $+1$ that shows how strongly variables are related. A value of 0 is no correlation. A value of $-1$ is a perfect negative relationship, and $+1$ is a perfect positive relationship.

We will visualize what a strong, or a weak correlation is in a next section on scatter plots. Here we already see how the columns are correlated to each other: both in a table, and a heat map. Heat maps are great way to visualize correlations.

In [ ]:

from IPython.display import display
import numpy as np
import pandas as pd
pd.set_option('precision', 3)

display(judges.corr())
corr = judges.corr()

# Create a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Generate a colormap for the correlations
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, 
            square=True, linewidths=.2, 
            cbar_kws={"shrink": 0.5});

Instantly you can confirm your expectation of the trends in that data set:

Flavour, Sweet and Fruity attributes are correlated together. As one goes up, the other also goes up.
Off-flavour, Mealiness and Hardness attributes are also correlated together. As one goes up, the other also goes up.
The first 3 attributes are negatively correlated to the other 3 attributes.
It would be very unusual to find a pea that had high values in all 6 attributes. Think about that: it makes sense! Flavour and off-flavour cannot be simultaneously high.

➜ Challenge yourself: correlation plot for cheese taste!¶

The goal of this challenge is to discover how the columns in the cheese taste data set are related to each other. In this data set the concentrations of:

acetic acid,
$\text{H}_2\text{S}$, and
lactic acid are give for 30 samples of mature cheddar cheese.

A subjective taste value is also provided as the 4th column.

Display the correlation matrix of every variable with the other variable. There are 4 variables, so there are 6 pairs of correlation values.
Visualize these correlations in a heat map.
Describe the correlations.

*If you want to cheat* scroll down to see a partial solution.

Time-series, or a sequence plot¶

If you have a single column of data, you may see interesting trends in the sequence of numbers when plotting it. These trends are not always visible when just looking at the numbers, and they definitely cannot be seen in a box plot.

An effect way of plotting these columns is horizontally, as a series plot, or a trace. We also call them time-series plots, if there is a second column of information indicating the corresponding time of each data point.

As promised in the prior notebook, we will now look at the time-based trends of the website visits data set.

Below we import the data.

Modify the code, if necessary, if you are behind a proxy server.
Note how we can force a particular column to be a time-based variable, if Pandas does not import it as time.
Lastly, we can set that time-based column to be our *index*. Do you recall that term about a Pandas series?

In [ ]:

import pandas as pd
website = pd.read_csv('http://openmv.net/file/website-traffic.csv')
dates = pd.to_datetime(website['MonthDay'], format='%B %d')
website.set_index(dates, inplace=True, drop=True)
website.plot(y='Visits', figsize=(15,5))

# Smooth it a bit, with a rolling mean
website['Visits'].rolling(5).mean().plot(linewidth=5);

Notice the common problem with smoothed rolling average data: it introduces a 'delay' into the time-series. The smoothed peaks are shifted to the right in time.

Try it yourself¶

Copy and paste the above code, and try this again for the Ammonia dataset. Note in the code below:

The dataset had no time-based column, so Pandas provides a simple function for doing that (pd.date_range(...)). We were told the data were collected every 6 hours.
Note how the plot's colours can be altered, and the line thickness.

Modify the code below:

Try different rolling window sizes: '12H' (12 hours), '2D' (2 days), '30D', etc.
Which smoother shows the trends clearly?
How would you describe these time-trends to a colleague?

In [ ]:

ammonia = pd.read_csv('http://openmv.net/file/ammonia.csv')
datetimes = pd.date_range('1/1/2020', periods=ammonia.shape[0], freq='6H')
ammonia.set_index(datetimes, inplace=True)
ammonia['Ammonia'].plot(figsize=(15,5), color='lightblue')
ammonia['Ammonia'].rolling('2D').mean().plot(color='black', linewidth=3);

➜ Challenge yourself: random walks again¶

The goal of this challenge is to understand what a random walk looks like, visually, as a time-series.

In the prior module you created the numbers that represent a random walk. Then you looked only at the distribution. Here's the prior code:

from scipy.stats import norm

# 20 steps for a regular person, showing the deviation to the 
# left (negative) or to the right (positive) when they are 
# walking straight. Values are in centimeters.
regular_steps = norm.rvs(loc=0, scale=5, size = 20)
print('Regular walking: \n{}'.format(regular_steps))

# Consumed too much? Standard deviation (scale) is larger:
deviating_steps = norm.rvs(loc=0, scale=12, size = 20)
print('Someone who has consumed too much: \n{}'.format(deviating_steps))

In the space below, start with the code given above, then modify it to:

create a series for size=400 steps
convert this to a Pandas series, using a frequency of 1 second
plot the random walks for 2 people: one regular, and one with deviating steps.
Remember: plot the cumulative sum of their steps, not the step changes.
You can add horizontal lines to an existing axis:

ax = df.plot(...)  # the output of the plot function is an axis
ax.axhline(y = 0, color='k')

You can also use the axis ax to set labels: ax.set_xlabel(...) or ax.set_ylabel(...)

Here's how my plot looked. Run your code several times to see how different the random walks appear.

In [ ]:

from scipy.stats import norm
from matplotlib import style
# print(style.available)
style.use('ggplot') 

N = 400
regular_steps   = norm.rvs(loc=0, scale=5,  size = N)
deviating_steps = norm.rvs(loc=0, scale=12, size = N)

datetimes = pd.date_range('1/1/2020', periods=N, freq='1S')
regular = pd.Series(regular_steps.cumsum(), index = datetimes)
regular.plot(figsize=(15,5))

deviating = pd.Series(deviating_steps.cumsum(), index = datetimes)
ax = deviating.plot()
ax.axhline(y=0, color='k', linestyle='-', linewidth=2)
ax.set_ylabel('Deviation from the starting point [cm]')
ax.legend(['Regular steps', 'Deviating steps']);

➜ Challenge yourself: growth rate of bacteria¶

The goal of this challenge is to understand the growth rate (reaction kinetics) of bacteria. To see what the growth looks like visually, but also to discover when the growth rate is the fastest, the most productive.

Back in module 3 we integrated an equation for bacteria growing on a plate:

$$ \dfrac{dP}{dt} = rP $$

where $P$ is the number of bacteria in the population, and $r$ is their exponential rate of growth [number of bacteria/minute]. This is not realistic. Eventually the bacteria will run out of space and their food source. So the equation is modified:

$$ \dfrac{dP}{dt} = rP - aP^2$$

where they are limited by the factor $a$ in the equation.

The differential equation can be re-written as: $$P_{i+1} = P_i + \left[\,rP_i -a\,P_i^2\,\right]\delta t$$

which shows how the population at time point $i+1$ (one step in the future) is related to the population size now, at time $i$ over a short interval of time $\delta t$ minutes.

Starting from 500 cells initially with a rate $r=0.032$ and the coefficient $a = 1.4 \times 10^{-7}$ we can generate the growth curves and plot them.

In [ ]:

import numpy as np
import pandas as pd
from IPython.display import display

p_initial = 500
r = 0.032
a = 1.6E-7
delta_t = 1       # minutes
time_final = 8*60

# Create the two outputs of interest
time = np.arange(start=0.0, stop=time_final, step=delta_t)
population = np.zeros(time.shape)
population[0] = p_initial

for idx, t_value in enumerate(time[1:]):
    population[idx + 1] = population[idx] + (r*population[idx] - a * population[idx]**2) * delta_t

# Now plot the data
bugs = pd.DataFrame(data = {'bacteria': population}, index=time)
display(bugs.head())
ax = bugs.plot(figsize=(15,5)) 
ax.grid(True)

Answer these questions:¶

Add an x-axis label, y-axis label and title to your figure. Hide the legend.
Add another time-series plot (as a subplot) to discover at which point in time the growth rate is the steepest. Estimate it from the plot, and also find it in the data frame. Look back at the equation: the growth rate is $ \dfrac{dP}{dt} = rP - aP^2$. So plot this value.
What steady-state population is reached?
If you start with 1000 bacteria, do you end up with a different final colony size?
Perhaps not necessary for these values on the plot, but usually with bacterial growth we use log scale plots on the y-axis. Change only 1 line in the above code to make this y-axis a log scale.

Scatter plots¶

Scatter plots are widely used and easy to understand. *When should you use a scatter plot?* When your goal is to draw the reader's attention between the relationship of 2 (or more) variables.

Data tables also show relationships between two or more variables, but the trends are sometimes harder to see.
A time-series plot shows the relationship between time and another variable. So also two variables, but one of which is time.

In a scatter plot we use 2 sets of axes, at 90 degrees to each other. We place a marker at the intersection of the values shown on the horizontal (x) axis and vertical (y) axis.

Most often variable 1 and 2 (also called the dimensions) will be continuous variables. Or at least *ordinal variables*. You will seldom use categorical data on the $x$ and $y$ axes.
You can add a 3rd dimension: the marker's size indicates the value of a 3rd variable. It makes sense to use a numeric variable here, not a categorical variable.
You can add a 4th dimension: the marker's colour indicates the value of a 4th variable: usually this will be a categorical variable. E.g. red = category 1, blue = category 2, green = category 3. Continuous numeric transitions are hard to map onto colour. However it is possible to use transitions, e.g. values from low to high are shown on a sliding gray scale
You can add a 5th dimension: the marker's shape can indicate the discrete values of a 5th categorical variable. E.g. circles = category 1, squares = category 2, triangles = category 3, etc.

In summary:

marker's size = numeric variable
marker's colour = categorical, maybe numeric, especially with a gray-scale
marker's shape = can only be categorical

Let's get started with some examples. We will start off with the example from the prior module where we considered the grades of students, and how long it took to write the exam.

In [ ]:

# Standard imports required to show plots and tables 
from matplotlib import pyplot
from IPython.display import display
%matplotlib inline
import pandas as pd

# Modify the code if you are behind a proxy server
grades = pd.read_csv('https://openmv.net/file/unlimited-time-test.csv')

ax = grades.plot.scatter(x = 'Time', y = 'Grade', 
                         figsize = (8, 8),
                         
                         # These remaining inputs are optional, but
                         # specified below so you can explicitly see them
                         
                         # Size of the dots: change this to get a feeling 
                         # for the range of values you should use
                         s = 50,  
                         
                         # Specify the colour
                         c = 'darkgreen',
                         
                         # The shape of the marker
                         # See https://matplotlib.org/3.1.1/api/markers_api.html
                         marker = 'D'
                        )

Remember our objective from the prior notebook? Do students score a higher Grade if they have a longer Time to finish the exam? The idea was that students will have less stress with unlimited time, because they had all their books and notes with them. In theory these are fairly ideal exam conditions.

The scatter plot however shows there isn't anything conclusive in the data to believe that there is a relationship. Let us also quantify it with the correlation value we introduced above.

In [ ]:

display(grades.corr())

The correlation value is $r=-0.044$, essentially zero. So now you get an idea of what a zero correlation means.

The correlation value is *symmetrical*: a value of -0.044 is the correlation between time and grades, and also the correlation between grades and time.
Interesting tip: the $R^2$ value from a regression model is that value squared: in other words, $R^2 = (-0.044229)^2 = 0.001956$.

Think of the implication of that: you can calculate the $R^2$ value - the value often used to judge how good a linear regression is - without calculating the linear regression model!! Further, it shows that for linear regression it does not matter which variable is on your $x$-axis, or your $y$-axis: the $R^2$ value is the same.

If you understand these 2 points, you will understand why $R^2$ is not a great number at all to judge a linear regression model.

Let's look at some other correlations. If you completed the Cheese Challenge above, you have already seen what the correlation values are for that dataset.

Strong relationship between Taste and the amount of H2S present (correlation of 0.756), while
the amount of Lactic acid present is also quite strongly correlated with the amount of H2S (0.644).

In [ ]:

cheese = pd.read_csv('http://openmv.net/file/cheddar-cheese.csv')
cheese.set_index('Case', inplace=True)
pd.set_option('precision', 3)
from IPython.display import display
display(cheese.corr())

Now we will like to visualize these pairwise relationships. We can draw 6 scatter plots to show all the pairwise combinations of Acetic, H2S, Lactic and Taste.

The Seaborn library, based on matplotlib, does this in a single line of code, using their sns.pairplot(...) function.

Confirm your knowledge¶

Visually relate the scatter plots below, with the numeric correlations in the table above. Get a feeling for what a correlation of $r=0.6$ or in other words $R^2 = 0.36$ is. It is fairly strong! You can see trends and relationships.

In [ ]:

sns.set(rc={'figure.figsize':(15, 5)})
sns.pairplot(cheese);

Adding more dimensions to your scatter plots¶

We saw that we can alter the size (s = ...), colour (c = ...) and shape (marker = ...) of the marker to indicate a 3rd, 4th or 5th dimension.

In the plots above you saw you to specify s, c and marker if the all the values are the same. Below you see how to do that if they are different. You specify a vector for s and c, the same length as the data.

The vector for the size, s, is often a function of the variable being plotted. Remember that a doubling of the circle's area is related to the square root of the radius.

The colour, c is often a categorical variable. In the example below we use red for "Yes" (baffles are present) and black for "No".

We consider changing the markers' shape in the next piece of code.

In [ ]:

from matplotlib import pyplot
yields = pd.read_csv('http://openmv.net/file/bioreactor-yields.csv')
baffles = yields['baffles'].values

# Idea: [f(x)   if condition  else g(x)     for x in sequence]
colour =  ['red' if b == 'Yes' else 'black' for b in baffles]
size   =  (pd.np.sqrt((yields['speed']-3200))-4)*10
ax = yields.plot.scatter(x='temperature', y='yield', figsize=(10,8), 
                         s = size,
                         c = colour)
ax.set_xlabel('Temperature [°C]')
ax.set_ylabel('Yield  [%]');
ax.set_title('Yield as a function of temperature [location], baffles [marker colour] and speed [marker size]');

From the above visualization we quickly see how the red points (baffles are present in the reactor) have a reducing effect on the yield. The yield also drops off with temperature.

What can you say about the marker size, which represents the speed of the impeller in the bioreactor?

We don't actually have a 5th dimension to visualize in this data set, to also change the marker shape. Marker shapes must be associated with a categorical variable. We will show how you could do it, based on the baffles column. The idea is to iterate over each unique category, taking the colour and shape from a dictionary.

In [ ]:

markers = {'No': 's',  # square
           'Yes': 'o'} # circle

colours = {'No': 'black', 
           'Yes': 'red'}

# Create an empty axis to plot in
ax = pyplot.subplot(1,1,1)
for baffle_type in yields['baffles'].unique():
    subset = yields[yields['baffles'] == baffle_type]
    subset.plot.scatter(ax = ax,
                        figsize = (10,8),
                        x = 'temperature', y='yield',
                        s = (pd.np.sqrt((subset['speed']-3200))-4)*10,
                        c =  colours[baffle_type],
                        marker = markers[baffle_type]
        )
ax.set_xlabel('Temperature [°C]')
ax.set_ylabel('Yield  [%]');
ax.set_title('Yield as a function of temperature [location], baffles [colour and marker shape] and speed [marker size]');

To investigate yourself¶

If you have a sliding scale for colour, then you need to use a colour map. See the matplotlib colormap reference.

A colour map takes an input value between 0 and 255, and relates it to a particular colour. On that webpage you see various colour maps.

➜ Challenge yourself: pairplots and the peas¶

The goal of this challenge is to see how the judges' values for the 6 taste attributes of the peas are (co)related.

Generate a pairplot set of scatter plots for all of the 6 combinations.

Compare and contrast¶

You have seen in the above plots a correlation of nearly zero (grades vs time for the exam).
In the cheddar cheese data you saw correlation of around 50 to 60% (the cheese dataset).
In this data set you see correlations of above 90 and 95% (this dataset for peas).

This is done intentionally for you to get a visual idea of what the correlation coefficient $(r)$ means, as well as $R^2$.

In [ ]:

import pandas as pd
import seaborn as sns
peas = pd.read_csv('https://openmv.net/file/peas.csv')
judges = peas.loc[:, 'Flavour': 'Hardness']
sns.pairplot(judges);
#judges.corr()

In [ ]:

Creating better box plots¶

Here we will consider alternatives, or additions, to the box plot, which we saw in the prior module.

The additions, in the order of progressively adding more information, are:

violin plot: shows the distribution
swarm plot: shows the raw data, and how it is distributed
raincloud plot: combines elements of the both of the above

All 3 options improve the box plot, by showing the distribution of the underlying data and raw values from the column being visualized.

In [ ]:

# Get the data

import pandas as pd
ammonia = pd.read_csv('http://openmv.net/file/ammonia.csv')
# You might need the proxy server settings

In [ ]:

from matplotlib import pyplot
%matplotlib inline
import seaborn as sns

# Change the default figure size
#sns.set(rc={'figure.figsize':(15, 5)})

fig = pyplot.figure(figsize=(15, 5));
axis1 = pyplot.subplot(1, 2, 1)
axis2 = pyplot.subplot(1, 2, 2)

sns.boxplot(data=ammonia, ax = axis1)
axis1.set_ylim(0, 70)
axis1.grid(True)
sns.violinplot(y='Ammonia', data=ammonia, ax=axis2,
              
               # Play with these settings
               inner = "box",   # the default
               # inner = "quartile"
               
               linewidth=3)
axis2.set_ylim(0, 70)
axis2.grid(True);

Swarm plots can compliment a violin plot, as they show all the raw underlying data. Not just the distribution.

In [ ]:

from matplotlib import pyplot
%matplotlib inline
import seaborn as sns

# Change the default figure size
#sns.set(rc={'figure.figsize':(15, 5)})

fig = pyplot.figure(figsize=(15, 5));
axis1 = pyplot.subplot(1, 2, 1)
axis2 = pyplot.subplot(1, 2, 2)

sns.boxplot(data=ammonia, ax = axis1)
axis1.set_ylim(0, 70)
axis1.grid(True)
sns.swarmplot(y='Ammonia', data=ammonia, ax=axis2,
              
               # Play with these settings
               #inner = "box",   # the default
               # inner = "quartile"
               
               #linewidth=3
             )
axis2.set_ylim(0, 70);
axis2.grid(True)

In [ ]:

import pandas as pd
import ptitprince as pt
%matplotlib inline

fig = pyplot.figure(figsize=(15, 5));
axis1 = pyplot.subplot(1, 2, 1)
axis2 = pyplot.subplot(1, 2, 2)

sns.boxplot(data=ammonia, ax = axis1, orient='h')
axis1.set_xlim(0, 70)
axis1.grid(True)

pt.RainCloud(y = 'Ammonia', 
             ax=axis2,
                  data = ammonia, 
                  width_viol = .8,
                  width_box = .4,
                  figsize = (12, 8), orient = 'h',
                  move = .0)
axis2.set_xlim(0, 70)
axis2.grid(True)

But where a raincloud plot really works well is with comparison of multiple variables. Let's go back to an earlier worksheet case study, where we compared the thickness of plastic film.

The thickness was measured at the 4 corners.

In [ ]:

import pandas as pd
import ptitprince as pt
%matplotlib inline

films = pd.read_csv('http://openmv.net/file/film-thickness.csv')
films.set_index('Number', inplace=True)
ax = pt.RainCloud(#y = 'Ammonia', 
             data = films, 
                  width_viol = .8,
                  width_box = .4,
                  figsize = (12, 8), orient = 'h',
                  move = .0)

ax.grid(True)
    

Extended challenges¶

Below we give some challenges that go beyond, but build on, the topics covered in this worksheet.

➜ Challenge yourself: Seaborn library¶

The Seaborn library wraps Matplotlib up, and creates easy-to-use function for common data visualization steps. The goal of this challenge is to get more familiar with one of the most useful visualization libraries in Python.

Take a look at the Seaborn Gallery to see some examples. Which ones can you use in your next project?
Great blog post about 9 different visualizations. We have looked at all of these, but it is nice to see them all on one page.
Seaborn tutorial: a structured tutorial is always worthwhile. Look at this to understand topics we have not covered in detail:
colour maps,
adding and rotating text,
'contexts': adjusting the plot settings for different use cases: talks, posters, on paper or in notebooks
change plot style
adding a title
Interactively create Seaborn visualizations within this webpage (if you can handle all the advertising!)
Though not related to Seaborn, it is worth giving a link to the Visualizations from the Pandas library.

➜ Challenge yourself: plotting data in real-time¶

The goal of this challenge is to visualize data that is coming from a real-time live stream. The plots above are all static. But what if you want to monitor your process in real-time? See goal number 4 above.

Let's give it a try. We will monitor the CPU usage of your computer. You can install a small Python package to get the CPU percentage used. You will need the non-built in package called psutil. Install it with: python3 -m pip install psutil or with your package manager (e.g. Anaconda)

import psutil
# Measure the CPU used in a 0.9 second interval
psutil.cpu_percent(interval=0.9)

Run that code multiple times and check that the values change.

Now you would like to watch that value on a graph, changing in real-time. See these pages for inspiration on how to visualize that with Python:

https://makersportal.com/blog/2018/8/14/real-time-graphing-in-python
https://learn.sparkfun.com/tutorials/graph-sensor-data-with-python-and-matplotlib/all (scroll to the end of the page)

You can apply this challenge directly for acquiring and plotting data from your own sensors. For example, you can inexpensively buy a Raspberry Pi board, add some sensors and create a home monitoring system for temperature, humidity and noise.

*Feedback and comments about this worksheet?* Please provide any anonymous comments, feedback and tips.

Table of Contents

Module 10: Overview¶

Preparing for this module###¶

Summarizing data visually and numerically (statistics)¶

Five main goals with data science¶

Examples of using this categorization¶

Try it yourself¶

Data tables¶

To try yourself¶

Setting the level of precision¶

Pie charts, when tables will do¶

Answer these questions, based on the above¶

Colour-coded tables for heatmaps¶

Example: Peas¶

➜ Challenge yourself: correlation plot for cheese taste!¶

Time-series, or a sequence plot¶

Try it yourself¶

➜ Challenge yourself: random walks again¶

➜ Challenge yourself: growth rate of bacteria¶

Answer these questions:¶

Scatter plots¶

Confirm your knowledge¶

Adding more dimensions to your scatter plots¶

To investigate yourself¶

➜ Challenge yourself: pairplots and the peas¶

Compare and contrast¶

Creating better box plots¶

Extended challenges¶

➜ Challenge yourself: Seaborn library¶

➜ Challenge yourself: plotting data in real-time¶