All content here is under a Creative Commons Attribution CC-BY 4.0 and all source code is released under a BSD-2 clause license.
Please reuse, remix, revise, and reshare this content in any way, keeping this notice.
This is the fourth module of several (11, 12, 13, 14, 15 and 16), which refocuses the course material in the prior 10 modules in a slightly different way. It places more emphasis on
In short: *how to extract value from your data*.
In this module we will cover
Requirements before starting
# General import. To ensure that Plotly, and not matplotlib, is the default plotting engine
import pandas as pd
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "iframe" # "notebook" # jupyterlab
pd.options.plotting.backend = "plotly"
In the prior module you learned about box plots, histogram plot, time-series (or sequence) plots, and scatter plots. We will revise some of those, and build on that knowledge a bit further.
Start with the data from an actual plant, where we have 5 columns of measurements from a flotation cell. Read the link if you need a quick overview of what flotation is.
flot = pd.read_csv("https://openmv.net/file/flotation-cell.csv")
Some things to do with a new data set called df
:
df.head()
and df.tail()
to check you have the right datadf.describe()
to get some basic statisticsdf.info()
to see the data typesIn the space below, apply these to the data you just read in:
Next plot sequence plots of all data columns, using this command
ax = flot.plot()
Notice that the x-axis is not time-based, even though there is a column in data frame called "Date and time"
. So what went wrong?
When reading in a new data frame you might need to first:
type
of date and time, so Pandas can use it in the plotsand then you can proceed with your plotting and data analysis.
To set a column to the right type, you can use the pd.to_datetime(...)
function. Many times Pandas will get it right, but if it doesn't you can give it some help.
So try this first below. If it works, you are lucky, and can continue.
flot["Timestamp"] = pd.to_datetime(flot["Date and time"])
Note that we created a new column. Check it with flot.info()
again, to see if it is of the right type. You can of course simply overwrite your previous column.
If the conversion did not work, you could have given it some guidance.
For example:
pd.to_datetime("20/12/21", yearfirst=True) # it is supposed to be 21 December 2020
pd.to_datetime("20/12/21", dayfirst=True) # it is supposed to be 20 December 2021
pd.to_datetime("20/12/21", format="%d/%m", exact=False)
For the format
specifier, you can see all the options available from this page: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
Once you have the column correctly as a date and time stamp, you probably want this to be your data frame index.
flot=flot.set_index("Timestamp")
# and drop the original "Date and time" column, since we don't need it anymore
flot.drop(columns="Date and time", inplace=True)
flot.plot()
Now you will see a short break in the data around 09:00 on 16 December 2004 which was not visible before.
From the provided Excel file, read in the data. Convert the date and time column to the desired format:
A box plot can be shown per column in one simple line for a data frame df
:
df.plot.box()
Does it make sense to plot box plots for all columns, especially when units and orders of magnitude are so different?
So now rather plot only the box plot for "Upstream pH":
Notice that there are so many outliers beyond the whiskers. What is going on? Look at the time-based plot of that column:
df["name of column"].plot.line()
Similar to df.plot.line()
and df.plot.box()
to get a line and box plot, you can also use df.plot.hist()
to get a histogram.
But this tries to put all histograms in one plot, which is not so useful. But we will see a better way below.
If you use this code, you will get all the line plots in the same plot:
flot["Timestamp"] = pd.to_datetime(flot["Date and time"])
flot=flot.set_index("Timestamp")
flot.plot()
But if you want each plot in its own axis instead, you need to use a loop to create multiple plots:
print(flot.columns)
for column in flot.columns:
print(column)
display(flot[column].plot())
Pandas can only plot columns of numeric data. If the column is non-numeric, it will create an error. So to ensure the loop only goes through numeric columns, you can filter on that. Change the first lines to
flot["artificial column"] = "abc"
flot.head()
for column in flot.select_dtypes("number"):
# add the loop content here, indented appropriately
We saw the correlation matrix can be calculated with this handy one-liner:
df.corr()
Do this below for the flotation data. Any interesting leads to investigate?
The scatter plot matrix is a visual tool to help create a scatter plot of each combination. The plot on the diagonal would not be an interesting scatter plot, so this is often replaced with a histogram or a kernel density estimate (kde) plot.
Use the code below to try creating both types of plots on the diagonal:
from pandas.plotting import scatter_matrix
scatter_matrix(df, alpha = 0.2, figsize=(10, 8), diagonal = 'kde');
scatter_matrix(df, alpha = 0.2, figsize=(10, 8), diagonal = 'hist');
Filtering and grouping data is part of the daily work of anyone working with data. The reason is because once you have filtered the data or grouped it, then you want to calculate some statistics or create a visualization on the result. So your workflow becomes:
- Import all your data;
- Filter or group to get a subset of the data;
- Do calculations and create visualizations on the subset of the data.
Some typical examples of filtering and grouping:
The "Blender Efficiency" data set is related to a set of designed experiments. There are 4 factors being changed to affect the blending efficiency:
particle size
mixer diameter
mixer rotational speed
, andblending time
.Last time we mentioned 6 steps in a data workflow:
Step 2 and 3: get your data and explore it
import pandas as pd
blender = pd.read_csv('http://openmv.net/file/blender-efficiency.csv')
Tips to explore your data.
Sort the table by the outcome value (the BlendingEfficiency
column). Values from low to high. Visually, in the table, which columns appear to be related to it?
Is a box plot useful?
Now move on to calculations and other visualizations inspired by those calculations:
Some more models/calculations: the particle size (discrete values at 2, 5 and 8) seem to have an interesting relationship to the outcome variable.
Let's look at this a bit more. Start with the scatter plot of just these 2 variables:
blender.plot.scatter(x="ParticleSize", y="BlendingEfficiency")
Next, we will create a subset of the data set showing just the results when the particle size is "2":
blender["ParticleSize"] == 2
will create an 'indicator' variable with True
values where the condition is met. We only want the rows where the condition is true.
In module 12, in the sub-section on "Accessing entries", you saw how you can do this.
blender[blender["ParticleSize"] == 2]
returns just the 4 rows where this condition is true.
ParticleSize
≤5.ParticleSize
>5. How many rows are that?
Now you can do interesting things on this subset. The subset is just a regular data frame, so you can plot them or do further calculations with them.
blender[blender["ParticleSize"] == 2].mean()
will calculate the average of only these rows.
Next, calculate the average of only the "BlendingEfficiency" column when particle size is 2, 5 and 8. In other words, calculate 3 averages.
You probably end up with something like this:
print(blender[blender["ParticleSize"] == 2]["BlendingEfficiency"].mean())
print(blender[blender["ParticleSize"] == 5]["BlendingEfficiency"].mean())
print(blender[blender["ParticleSize"] == 8]["BlendingEfficiency"].mean())
Can it be done more cleanly? Perhaps you could do it in a loop?
The df.groupby()
function in Panadas is a way to do that in a single line.
blender.groupby(by="ParticleSize").mean() # simplify it: leave out the "by="
Now go wild. Try it with different types of functions:
blender.groupby("ParticleSize").std()
blender.groupby("ParticleSize").max()
blender.groupby("ParticleSize").plot()
# what do you think this does? Guess before testing it!blender.groupby("ParticleSize").plot.scatter(x="BlendingTime", y="BlendingEfficiency")
You will find, that if you use the plotly backend for plotting, you will notice
you don't get plots displayed if you use the code above, while with the matplotlib
backend it will show plots.
So, to use Plotly, you will need to call use groupby
in a loop instead:
import time
for psize, subset in blender.groupby(by="ParticleSize"):
print(psize)
display(subset)
subset.plot()
# Then add code here to do something with the "subset" plot.
# For example, such as changing the axis titles, or figure size
time.sleep(0.2) # pause for 200 milliseconds
Newer versions of packages are released frequently. You can update your packages (libraries), with this command::
conda update -n base conda
conda update --all
You will come across people recommending different packages in Python for all sorts of interesting applications. For example, the library seaborn
is often recommended for visualization. But you might not have it installed yet.
This is how you can install the package called seaborn
in your virtual environment called myenv
:
conda activate myenv <--- change the last word in the command to the name of your actual environment
conda install seaborn
Or in one line:
conda install -n myenv seaborn
Similar to the above, you can update a package to the latest version. Just change install
to update
instead.
Or in one line:
conda update -n myenv seaborn
There is another data set, about the taste of Cheddar cheese: https://openmv.net/info/cheddar-cheese
Read the data set in:
cheese = pd.read_csv("https://openmv.net/file/cheddar-cheese.csv")
Hint: look at the documentation for scatter_matrix
to see how to do this. You can look at the documentation inside Jupyter in several ways:
help(scatter_matrix)
scatter_matrix?
and then hit Ctrl-Enter.
The pulp digester is an industrial unit operating in the pulp and paper industry. You can find the data on this page: https://openmv.net/info/kamyr-digester
Some things to try when exploring the data:
'Y-Kappa'
.Y-Kappa
as the 7th column.kde
for the diagonal plots.