All content here is under a Creative Commons Attribution CC-BY 4.0 and all source code is released under a BSD-2 clause license.
Please reuse, remix, revise, and reshare this content in any way, keeping this notice.
This is the third module of several (11, 12, 13, 14, 15 and 16), which refocuses the course material in the prior 10 modules in a slightly different way. It places more emphasis on
In short: *how to extract value from your data*.
This is the third of 6 modules. In this module we will cover
Requirements before starting
*After years of experience, and working with data you will find your own approach.*
Here is my 6-step approach (not linear, but iterative): Define, Get, Explore, Clean, Manipulate, Communicate
Then your next steps become clear.
Look for and get your data (or it will be given to you by a colleague). Since you have your objective clarified, it is clearer now which data, and how much data you need.
Then start looking at the data. Are the data what we expect? This is the explore step. Use plots and table summaries.
Clean up your data. This step and the prior step are iterative. As you explore your data you notice problems, bad data, you ask questions, you gain a bit of insight into the data. You clean, and re-explore, but always with the goal(s) in mind. Or perhaps you realize already this isn't the right data to reach your objective. You need other data, so you iterate.
Modifying, making calculations from, and manipulate the data. This step is also called modeling, if you are building models, but sometimes you are simply summarizing your data to get the objective solved.
From the data models and summaries and plots you start extracting the insights and conclusions you were looking for. Again, you can go back to any of the prior steps if you realize you need that to better achieve your goal(s). You communicate clear visualizations to your colleagues, with crisp, short text explanations that meet the objectives.
The above work flow (also called a 'pipeline') is not new or unique to this course. Other people have written about similar approaches:
You can read their article, as well as this view on it, which is bit more lighthearted.
What has been your approach so far?
In this module we want to show how you can quickly create visualizations of your data with Pandas.
But first, you should check if you are using the appropriate visualization tool. This website helps you select: https://www.data-to-viz.com
In this module we will consider:
We will implement the 6-step workflow suggested above.
Our (1) objective is to
Describe what time-based trends we see in the ammonia concentration of a wastewater stream. We have a single measurement, taken every six hours.
We will first see how we can summarize the data.
The next step is to (2) get the data. We have a data file from this website where there is 1 column of numbers and several rows of ammonia measurements.
Step 3 and 4 of exploring the data are often iterative and can happen interchangeably. We will (3) explore the data and see if our knowledge that ammonia concentrations should be in the range of 15 to 50 mmol/L is true. We might have to sometimes (4) clean up the data if there are problems.
We will also summarize the data by doing various calculations, also called (5) manipulations, and we will (6) communicate what we see with plots.
Let's get started. There are 3 ways to get the data:
# Import the plotly library
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "iframe" # "notebook" # jupyterlab
import pandas as pd
pd.options.plotting.backend = "plotly"
# Loading the data from a local file, if you have it saved to your own computer
data_file = r'C:\location\of\file\ammonia.csv'
waste = pd.read_csv(data_file)
# Read the CSV file directly from a web server:
import pandas as pd
waste = pd.read_csv('https://openmv.net/file/ammonia.csv')
# If you are on a work computer behind a proxy server, you
# have to take a few more steps. Uncomment these lines of code.
#
# import io
# import requests
# proxyDict = {"http" : "http://replace.with.proxy.address:port"}
# url = "http://openmv.net/file/ammonia.csv"
# s = requests.get(url, proxies=proxyDict).content
# web_dataset = io.StringIO(s.decode('utf-8'))
# waste = pd.read_csv(web_dataset)
Show only the first few lines of the data table (by default it will show 5 lines)
waste
Print the last 10 rows of the data to the screen:
Once we have opened the data we check with the .head(...)
command if our data are within the expected range. At least the first few values. Similar for the .tail(...)
values.
Those two commands are always good to check first.
Now we are ready to move on, to explore further with the .describe(...)
command.
# Run this single line of code, and answer the questions below
waste.describe()
There are ______ rows of data. Measured at 6 hours apart, this represents ______ days of sensor readings.
We expected ammonia concentrations to typically be in the range of 15 to 50 mmol/L. Is that the case from the description?
What is the average ammonia concentration?
Sort the ammonia values from low to how, and store the result in a new variable called ammonia_sorted
.
What does the 25th percentile mean? Below the 25th percentile value we will find ____% of the values, and above the 25th percentile we find ____% of the values. In this case that means the 25th percentile will be close to value of the 360th entry in the sorted vector of data. Try it:
ammonia_sorted[358:362]
What does the 75th percentile mean? Below the 75th percentile value we will find ____% of the values, and above the 75th percentile we find ____% of the values. In this case that means the 75th percentile will be close to value of the 1080th entry in the sorted vector of data. Try it:
ammonia_sorted[1078:1082]
So therefore: between the 25th percentile and the 75th percentile, we will find ____% of the values in our vector.
Given this knowledge, does this match with the expectation we have that our Ammonia concentration values should lie between 15 to 50 mmol/L?
And there is the key reason why you are given the 25th and 75th percentile values. Half of the data in the sorted data vector lie between these two values. 25% of the data lie below the 25th percentile, and the other 25% lie above the 75th percentile, and the bulk of the data lie between these two values.
# Add your code here to answer the above questions.
We have looked at the extremes with .head()
and .tail()
, and we have learned about the mean and the median.
What about the typical values? What do we even mean by typical or usual or common values? Could we use the 25th and 75th percentiles to help guide us?
One way to get a feel for that is to plot these numbers: 25th, 50th and 75th percentiles. Let's see how, by using a boxplot.
waste.boxplot()
In general, it is worth taking a look at the documentation for the function you are using here. This is available on the Pandas website:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html
which you can quickly find by searching for pandas boxplot
, and the first link as result will likely be the one above.
The boxplot gives you an idea of the distribution, the spread, of the data.
The key point is the orange center line, the line that splits the centre square (actually it is a rectangle, but it looks squarish). That horizontal line is the median.
It is surprising to see that middle chunk, that middle 50% of the sorted data values fall in such a narrow range of the rectangle.
The bottom 25% of the data falls below the box, and the top 25% of the data falls above the box. That is indicated to some extent by the whiskers, the lines leaving the middle square/rectangle shape. The whiskers tell how much spread there is in our data. We we see 2 single circles below the bottom whisker. These are likely outliers, data which are unusual, given the context of the rest of the data. More about outliers later.
Now let us try plotting a histogram of these same data.
waste.hist()
# Search for the documentation for this function. Adjust the number of bins to 30.
# Add code here for a histogram with 30 bins.
# Make the figure less wide and less high:
waste.hist(width=400, height=400)
# AIM: create a histogram, with extra annotations
# `fig` is a Plotly figure:
fig = waste.hist(nbins=30)
# With this variable you can further manipulate the plot. Use it, for example, to add an x-axis label:
fig.update_layout(xaxis_title_text='Ammonia concentration [mmol/L]')
#
# Update the y-axis label here:
# Superimpose the 25th and the 75th percentiles as vertical lines (vlines) on the histogram
fig.add_vline(x=waste['Ammonia'].quantile(0.25), line_color="purple", line_width=1, line_dash='solid')
fig.add_vline(x=waste['Ammonia'].quantile(0.75), line_color="purple", line_width=1, line_dash='solid' )
fig.add_hline(y= 80, line_dash='longdash', annotation_text="Cutoff")
fig.add_vline(
x=waste['Ammonia'].quantile(0.50),
line_color="orange",
line_width=3,
annotation_text="Median",
line_dash='longdash') # 'dot', 'dash', 'longdash', 'dashdot', 'longdashdot'
# Make the range bigger (e.g. to match requirements in a technical report)
fig.update_layout(xaxis_range=[0, 70],
yaxis_range=[0, 200])
display(fig)
# NOTE: the 0.5 quantile, is the same as the 50th percentile, is the same as the median.
print(f'The 50th percentile (also called the median) is at: {waste["Ammonia"].quantile(0.5)}')
All of this you can get from this single table which you can create with .describe()
:
Which brings us to two important points:
.describe()
function to get a (tabular) feel for your data.We have not solved our complete objective yet. Scroll up, and recall what we needed to do: "describe what time-based* trends we see in the ammonia concentration of a wastewater stream*". We will look at that next.
We have learned quite a bit in this section so far:
If you have a single column of data, you may see interesting trends in the sequence of numbers when plotting it. These trends are not always visible when just looking at the numbers, and they definitely cannot be seen in a box plot.
An effective way of plotting these columns is horizontally, as a series plot, or a trace. We also call them time-series plots, if there is a second column of information indicating the corresponding time of each data point.
Below we import the data.
pd.date_range(...)
.waste.shape[0]
waste = pd.read_csv('http://openmv.net/file/ammonia.csv')
datetimes = pd.date_range('1/1/2020', periods=waste.shape[0], freq='6H')
print(datetimes)
# What is this "datetimes" variable we have just created?
waste.set_index(datetimes, inplace=True)
# Why "inplace" ?
# The code to plot the data as a time-series sequence:
fig = waste.plot.line() # you can also say: waste.plot()
fig
waste.plot()
# Make the plot look a bit different:
fig = waste.plot.line(
markers=True,
log_y=True,
range_y=[5, 85],
color_discrete_sequence=["orange"],
width=800,
height=400,
)
fig.update_layout(showlegend=False)
fig.update_layout(plot_bgcolor="black")
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=waste.index,
y=waste['Ammonia'],
name="Ammonia",
)
)
fig.update_layout(showlegend=True)
# The above code is the same as
waste.plot.line()
# Now calculation the the 5-day moving average:
waste['Ammonia'].rolling('5D').mean()
# OK, now plot that moving average:
# Add a rolling average on top of the raw data, to help see the signal through the noise.
# First plot the original data
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=waste.index,
y=waste['Ammonia'],
name="Ammonia concentration",
line_color="rosybrown",
)
)
# Then add the moving average on top
fig.add_trace(
go.Scatter(
x=waste.index,
y=waste['Ammonia'].rolling('5D', center=True).mean(),
name="5 day moving average",
line_color="black",
)
)
fig.update_layout(showlegend=True)
fig.show()
# Later on this code will make more sense; for now, hopefully it is useful in your daily work.
'12H'
(12 hours), '2D'
(2 days), '30D'
, etc.Last time we looked at some basic data frame operations. Let's recap some important ones.
We will use this made-up data set, showing how much food is used by each country. You can replace these data with numbers and columns and rows which make sense to your application.
import pandas as pd data = {'Herring': [27, 13, 52, 54, 5, 19], 'Coffee': [90, 94, 96, 97, 30, 73], 'Tea': [88, 48, 98, 93, 99, 88]} countries = ['Germany', 'Belgium', 'Netherlands', 'Sweden', 'Ireland', 'Switzerland'] food_consumed = pd.DataFrame(data, index=countries) print(data) print(countries) print(type(data)) print(type(countries)) print(type(food_consumed)) food_consumed
import pandas as pd
data = {'Herring': [27, 13, 52, 54, 5, 19],
'Coffee': [90, 94, 96, 97, 30, 73],
'Tea': [88, 48, 98, 93, 99, 88]}
countries = ['Germany', 'Belgium', 'Netherlands', 'Sweden', 'Ireland', 'Switzerland']
food_consumed = pd.DataFrame(data, index=countries)
print(data)
print(countries)
print(type(data))
print(type(countries))
print(type(food_consumed))
food_consumed
# The first rows:
food_consumed.head()
# The last rows:
food_consumed.tail()
# Some basic statistics
food_consumed.describe()
# Some information about the data structure: missing values, memory usage, etc
food_consumed.info()
# There were 6 countries, and 3 food types. Verify:
food_consumed.shape
# Transposed and then shape:
food_consumed.T.shape
# Interesting: what shapes do summary vectors have?
food_consumed.mean().shape
food_consumed
food_consumed['Tea'].unique()
# Unique names of the rows: (not so useful in this example, because they are already unique)
food_consumed.index.unique()
# Get counts (n) of the unique entries:
food_consumed.nunique() # in each column
food_consumed.nunique(axis=1) # in each row
# Works just like a dictionary!
# If the data are in the same row order
food_consumed['Yoghurt'] = [30, 20, 53, 2, 3, 48]
print(food_consumed)
display(food_consumed)
# Note the row order is different this time:
more_foods = pd.DataFrame(index=['Belgium', 'Germany', 'Ireland', 'Netherlands', 'Sweden', 'Switzerland'],
data={'Garlic': [29, 22, 5, 15, 9, 64]})
print(food_consumed)
print(more_foods)
# Merge 'more_foods' into the 'food_consumed' data frame. Merging works, even if row order is not the same!
food_consumed = food_consumed.join(more_foods)
food_consumed
# Collect the new data in a Series. Note that 'Tea' is (intentionally) missing!
portugal = pd.Series({'Coffee': 72, 'Herring': 20, 'Yoghurt': 6, 'Garlic': 89},
name = 'Portugal')
food_consumed = food_consumed.append(portugal)
# See the missing value created?
print(food_consumed)
# What happens if you run the above commands more than once?
# Drop a column, and returns its values to you
coffee_column = food_consumed.pop('Coffee')
print(coffee_column)
print(food_consumed)
# Leaves the original data untouched; returns only
# a copy, with those columns removed
food_consumed.drop(['Garlic', 'Yoghurt'], axis=1)
print(food_consumed)
# Leaves the original data untouched; returns only
# a copy, with those rows removed.
non_EU_consumption = food_consumed.drop(['Switzerland', ], axis=0)
# Returns a COPY of the array, with no missing values:
cleaned_data = food_consumed.dropna()
# Makes the deletion inplace; you do not not have to assign the output to a new variable.
# Inplace is not always faster!
food_consumed.dropna(inplace=True)
# Remove only rows where all values are missing:
food_consumed.dropna(how='all')
food_consumed.sort_values(by="Garlic")
food_consumed.sort_values(by="Garlic", inplace=True)
food_consumed.sort_values(by="Garlic", inplace=True, ascending=False)
Scatter plots are widely used and easy to understand. *When should you use a scatter plot?* When your goal is to draw the reader's attention between the relationship of 2 (or more) variables.
In a scatter plot we use 2 sets of axes, at 90 degrees to each other. We place a marker at the intersection of the values shown on the horizontal (x) axis and vertical (y) axis.
Most often variable 1 and 2 (also called the dimensions) will be continuous variables. Or at least *ordinal variables*. You will seldom use categorical data on the $x$ and $y$ axes.
You can add a 3rd dimension: the marker's size indicates the value of a 3rd variable. It makes sense to use a numeric variable here, not a categorical variable.
You can add a 4th dimension: the marker's colour indicates the value of a 4th variable: usually this will be a categorical variable. E.g. red = category 1, blue = category 2, green = category 3. Continuous numeric transitions are hard to map onto colour. However it is possible to use transitions, e.g. values from low to high are shown on a sliding gray scale
You can add a 5th dimension: the marker's shape can indicate the discrete values of a 5th categorical variable. E.g. circles = category 1, squares = category 2, triangles = category 3, etc.
In summary:
Let's use the Bioreactor yields data set. There is information about it here:
http://openmv.net/info/bioreactor-yields
Read in the data into a Pandas data frame, and use the .describe
function to check it:
# Standard imports required to show plots and tables
import pandas as pd
yields = pd.read_csv('http://openmv.net/file/bioreactor-yields.csv')
Are all 5 columns shown in the summary? Modify the .describe
function call to show information on all 5 columns.
yields
Now plot the data as a scatter plot, using this code as a guide. We want to see if there is a relationship between temperature and yield.
fig = yields.plot.scatter(
x='temperature',
y='yield',
width=500,
height=400,
title='Yield [%] as a function of temperature [°C]',
)
fig.update_layout(xaxis_title_text='Temperature [°C]')
fig.update_layout(yaxis_title_text='Yield [%]')
The objective of this data file was to check if there is a relationship between Temperature
and Yield
. Visually that is confirmed.
Let us also quantify it with the correlation value we introduced above. Calculate the correlation with this code:
display(yields.corr())
display(yields.corr())
The correlation value is $r=-0.746$, essentially negative 75%.
Think of the implication of that: you can calculate the $R^2$ value - the value often used to judge how good a linear regression is - without calculating the linear regression model!! Further, it shows that for linear regression it does not matter which variable is on your $x$-axis, or your $y$-axis: the $R^2$ value is the same.
If you understand these 2 points, you will understand why $R^2$ is not a great number at all to judge a linear regression model.
We saw that we can alter the size, colour, and shape of the marker to indicate a 3rd, 4th or 5th dimension.
We consider changing the markers' colour and shape in the next piece of code.
Colour and shape are perfect for categorical variables, but unfortunately in this data set we only have 1 categorical variable. So use it for both the marker's shape (symbol) and colour.
yields['baffles'].unique()
# So we see the "baffles" column is actually text: "Yes" or "No".
# We call this a categorical variable.
# Use the "baffles" column to pick colours
fig = yields.plot.scatter(
x='temperature',
y='yield',
width=500,
height=400,
title='Yield [%] as a function of temperature [°C]; colours indicate presence/absense of baffles',
color="baffles",
symbol="baffles",
size="speed",
)
fig.update_layout(xaxis_title_text='Temperature [°C]')
fig.update_layout(yaxis_title_text='Yield [%]')
In the code below we want to make the marker size proportional to the speed of the impeller.
The size
input of the .scatter
function can be specified as the name of the column to determine the marker size:
# Modify the speed column to be more "spread out" in size
yields['Speed(plot)'] = (yields['speed']-3200).pow(2) / 1000
yields
fig = yields.plot.scatter(
x='temperature',
y='yield',
width=500,
height=400,
title='Yield [%] as a function of temperature [°C]; colours for baffles; size related to impeller speed',
color="baffles",
size='Speed(plot)',
)
fig.update_layout(xaxis_title_text='Temperature [°C]')
fig.update_layout(yaxis_title_text='Yield [%]')
Once you have created your plot you can of course include it in a document. Click on the "camera" icon at the top right hover area.
Some examples will be shown on what you can do with data frames.
Try the following in the space below:
from datetime import datetime
now = datetime.now()
# Do some things with `now`:
print(now)
print(now.year)
print(f"Which weekday is it today? It is day: {now.isoweekday()} in the week")
print(now.second)
print(now.seconds) # use singular
After trying the above, try these lines below. Comment out the lines that cause errors.
later = datetime.now()
print(later)
print(type(later))
print(later - now)
print(now - later)
print(now + later)
delta = later - now
print(delta)
print(type(delta))
print(f"There were this many seconds between 'now' and 'later': {delta.total_seconds()}")
print(later + delta)
sometime_in_the_future = later + delta*1000
print(sometime_in_the_future)
print(sometime_in_the_future - now)