.query
functiongroupby
All content here is under a Creative Commons Attribution CC-BY 4.0 and all source code is released under a BSD-2 clause license.
Please reuse, remix, revise, and reshare this content in any way, keeping this notice.
This is the sixth, and final, module of several (11, 12, 13, 14, 15 and 16), which refocuses the course material in the first 10 modules in a slightly different way. It places more emphasis on
In short: *how to extract value from your data*.
In module 11 we learned about
type
,math
library,In the module 12 we took this a step further:
Series
and DataFrame
objects,dict
ionary.Module 13 we introduced:
a general workflow for data processing
and how to visualize data with Pandas:
Module 14 we saw how to create:
groupby
function, which does actions repeatedly on sub-groups of your data.Then in module 15 we saw:
LinearRegression
tool from a new library, scikit-learn
.seaborn
, to visualize these regression models.In this module we will cover a collection of last loose ends. Things you will use regularly in your work.
groupby
capability of PandaMost of them come from this list, with some modifications: https://towardsdatascience.com/30-examples-to-master-pandas-f8a2da751fa4
We will use a data set that is related to food consumption. It shows, in a relative way, the food consumption habits of European (and soon to be former EU) countries.
import time
import pandas as pd
import numpy as np
import seaborn as sns
from scipy.interpolate import UnivariateSpline
from scipy.signal import savgol_filter
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "iframe" # "notebook" # jupyterlab
pd.options.plotting.backend = "plotly"
df = pd.read_csv("https://openmv.net/file/food-consumption.csv")
display(df.info())
display(df)
Visualizing the correlation matrix is essential to help understanding relationships. Use the code and the plot below to help answer:
sns.set(rc={'figure.figsize': (15, 15)})
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(df.corr(), cmap=cmap, square=True, linewidths=0.5, cbar_kws={"shrink": 0.8});
"List comprehensions" are a quick way to make a list. You can read more, and see some examples here: https://realpython.com/list-comprehension-python/#using-list-comprehensions
print( [i for i in range(10)] )
print( [i*2+1 for i in range(10)] )
print( [i*2 for i in range(10) if i > 4] )
print( [i for i in range(10) if i % 2 == 1] )
print( [i for i in range(10) if i % 2 == 0] )
print( [i for i in range(10) if i % 4 == 1] )
Imagine you had a large data set, and only needed certain rows for your calculations/visualization later on. You can use the nrows
and skiprows
arguments to read only a subset of the data.
df_subset = pd.read_csv("https://openmv.net/file/food-consumption.csv", nrows=5)
display(df_subset)
df_partial = pd.read_csv("https://openmv.net/file/food-consumption.csv", skiprows=[2, 3, 4])
display(df_partial)
# Requires an extra `engine` input
df_bottom = pd.read_csv("https://openmv.net/file/food-consumption.csv", skipfooter=12) #, engine='python') <- intentionally left out for demo
display(df_bottom)
# Skipping every 3rd row, using a list comprehension...
print([i for i in range(40) if i%3 ==1])
df_partial = pd.read_csv("https://openmv.net/file/food-consumption.csv",
skiprows=[i for i in range(40) if i%3 ==1])
display(df_partial)
If you know the names of the columns you need, you can use the usecols
input.
Note: this also works for Excel files! You can say, for example, usecols="F,G,BQ"
if you need columns F, G and BQ only.
df_subset = pd.read_csv("https://openmv.net/file/food-consumption.csv",
usecols=["Country", "Sweetener", "Biscuits", "Powder soup", "Tin soup"])
display(df_subset)
Conversely, you can read in the whole data set, and drop away the columns or rows you do not need.
df = (
pd.read_csv("https://openmv.net/file/food-consumption.csv")
.drop(["Sweetener", "Biscuits", "Powder soup", "Tin soup"], axis=1)
)
display(df)
df.shape
# Also drop some rows: drop away every 3rd row.
# You can also leave away 'axis=0' (because that's the default)
df_subset = df.drop([i for i in range( df.shape[0] ) if i%3 ==1] , axis=0)
display(df_subset)
You can always make a column from your dataframe to be your index
, using the set_index
function.
df = pd.read_csv("https://openmv.net/file/food-consumption.csv")
df = df.set_index('Country')
display(df)
# Or, in a single line, in a chained operation
df = (
pd.read_csv("https://openmv.net/file/food-consumption.csv")
.drop([i for i in range( df.shape[0] ) if i%3 ==1] , axis=0)
.set_index('Country')
)
display(df)
Pandas generally handles missing values well: for example, the df.mean()
function will work even if there are missing values. But some mathematical tools cannot have missing values, such as when performing a linear regression. So deleting missing data first is an option. It is therefore helpful that you can:
# Which columns have missing values:
df = pd.read_csv("https://openmv.net/file/food-consumption.csv").set_index('Country')
display(df.isna().sum())
# Which rows have missing values:
df.isna().sum(axis=1)
# Display missing values in a heat map
sns.set(rc={'figure.figsize': (10, 10)})
sns.heatmap(df.isna(), square=True, cbar_kws={"shrink": 0.5});
Confirm that the "Sweetener", "Biscuits", and "Yoghurt" columns are not present after running this command (these columns had missing values in them):
# Delete columns with missing values
df.dropna(axis=1)
Confirm that the rows for "Sweden", "Finland", and "Spain", which had missing entries, are not present after this:
# Delete rows with missing values
df.dropna(axis=0)
Dropping missing values in all rows, but only for a subset of the columns is possible. For example, drop only rows in the columns for "Sweetener" and "Yoghurt" (ignore the column for "Biscuits"):
display(df.dropna(subset=["Sweetener", "Yoghurt"], axis=0))
# Note: you can also flip this around. Specify a subset of row names
# in `subset` and delete from all columns, using `axis=1`.
df.dropna(subset=["Sweden"], axis=1)
We learned about .iloc
in the prior module. Let's look at this again, and emphasize the difference between .iloc
and .loc
. This article gives more details about the two if you want some more explanation.
df = pd.read_csv("https://openmv.net/file/food-consumption.csv").set_index('Country')
# "Instant coffee" is column 1: make all these values missing
df.iloc[:, 1] = np.nan
display(df)
# But what if don't know, or care, which column index it is?
# When we know the column's name, then use ".loc"
df.loc[:, "Tea"] = np.nan
df
# Or you can use a list of column names:
df.loc[:, ["Potatoes","Frozen fish"]] = 98.76
df
# You can use a mixture of .iloc and .loc:
df.iloc[[0, 1, 2], :].loc[:, "Tin soup"]
# but this is less code:
df.iloc[[0, 1, 2], :]["Tin soup"]
# or even less this way:
df.iloc[[0, 1, 2]]["Tin soup"]
# Or using .loc
df.loc["Germany":"France", "Tin soup"]
If you want to delete a column only if there are more than a certain number of missing values:
# Read the data, and make every 3rd row a missing value for column "Tea"
df = pd.read_csv("https://openmv.net/file/food-consumption.csv").set_index('Country')
df.iloc[[i for i in range(16) if i%3 == 1]]["Tea"] = np.nan
# The above code generates a warning. Why?
display(df)
# How to make this warning go away? As suggested by the warning, use ".loc" instead.
# df.loc[row_indexer, col_indexer] = np.nan
# Create a variable containing all row names:
row_indexer = df.index
# Now take every third row name:
row_indexer = df.index[ [i for i in range(16) if i%3 ==1] ]
row_indexer
# Then, set these rows to have missing values:
df.loc[row_indexer, "Tea"] = np.nan
display(df)
df.isna().sum()
# Finally, we can now delete columns with a threshold (degree) of missing values
# What value should you fill in here?
display(df.dropna(thresh=11, axis=1))
"Olive oil"
consumption of more than 50?df = pd.read_csv("https://openmv.net/file/food-consumption.csv").set_index('Country')
df[ df["Olive oil"] > 50 ]
"Olive oil"
more than 50, and "Garlic"
more than 40?df[ (df["Olive oil"] > 50) & (df["Garlic"] > 40) ]
"Tea"
more than 80, or "Oranges"
more than 90?df[(df["Tea"] > 80) | (df["Oranges"] > 90)]
.query
function¶It is sometimes more natural to filter with the .query
function:
display(df.query("30 < Tea < 80"))
# or use backticks if the column name has a space:
df.query("10 < `Tin soup` < 20")
You can have multiple queries:
Find the countries which have "Real coffee" and "Tea" consumption above 70.
df.query("(`Real coffee` > 70) or (Tea > 70)")
Really powerful is the ability to reference one column against another.
Find all countries where more "Instant coffee"
is drunk more than "Real coffee"
. These are countries to avoid visiting. What else do you notice about these countries eating habits?
df.query("`Instant coffee` > `Real coffee`")
For the rest of the notebook we will switch to a new data set, where we characterize the properties of a raw material. As each batch of raw material is acquired, there are 6 measurements taken. There is also an indicator variable (categorical variable) on whether the raw materials outcome was (Adequate
), or not (Poor
).
df = pd.read_csv("https://openmv.net/file/raw-material-characterization.csv").set_index("Lot number")
display(df)
# Note that the Outcome column is an object. We can explicitly convert it to a categorical variable:
df["Outcome"] = df["Outcome"].astype('category')
display(df.info())
groupby
¶Recall the groupby
function from two modules ago, which we applied as follows in a loop to create a plot for each group:
# Groupby: for plotting
for outcome, subset in df.groupby("Outcome"):
fig=subset.plot.scatter(x='Size5', y="Size15")
fig.update_layout(
xaxis_range=[10, 16],
yaxis_range=[18, 45],
width=500,
)
fig.show()
time.sleep(0.5)
Sometimes you want the plots shown in a matrix, or a grid. This is called a subplot. You can read the Plotly documentation about subplots on their site.
Sometimes you want to split your (continuous) data into smaller groups (or categories). This can be done with the cut function in Pandas.
# Cut the "TMA" variable. These values lie between 46.2 and 68.0, so let's create 3 bins, with 4 cuts at [40, 50, 55, 100]
display(pd.cut(df['TMA'], bins=[40, 50, 55, 100]))
df['TMA categories'] = pd.cut(df['TMA'], bins=[40, 50, 55, 100])
print(f'Using groupby you can loop; there would be {df.groupby("TMA categories").ngroups} groups. But we will use subplots instead ...')
# You can use this code below as a general recipe for creating subplots.
nrows = 2
ncols = 2
from plotly.subplots import make_subplots
fig = make_subplots(
rows=nrows,
cols=ncols,
shared_xaxes=False,
shared_yaxes=False,
vertical_spacing=0.15,
horizontal_spacing=0.10,
subplot_titles=[str(val) for val in df.groupby("TMA categories").groups.keys()],
start_cell="top-left") # "bottom-left"
# In a loop, create each subplot
row = col = 1
for category, subset in df.groupby("TMA categories"):
fig.add_trace(
go.Scatter(
x=subset['TMA'],
y=subset['Size15'],
mode="markers",
name=str(category),
showlegend=True,
),
row=row,
col=col,
)
# Bump up the counters for the next plot ...
col += 1
if col > ncols:
col = 1
row += 1
fig.update_layout(width=1000)
fig.show()
# Or using groupby for a single summary
display(df.groupby("Outcome").mean())
display(df.groupby("Outcome").std())
# Or, call all the summaries together. We will explain the .agg function below.
df.groupby("Outcome").agg(["mean", "std"]).round(2)
We can also use groupby
for multiple levels. Imagine we have a second categorical variable, or some other variable with few discrete values:
# Using what you learned above, you can see how we can quickly create a new column with .cut(...)
df["Size"] = pd.cut(df['Size5'], bins=[0, 13, np.inf]) # --> intentionally left out for now: labels=["Small", "Large"]
df
# Now you can use a multi-level groupby:
display(df.groupby(["Outcome", "Size", ]).count()) # <-- redundant, use the next one instead
display(df.groupby(["Outcome", "Size", ]).size())
In the above, we had to write 2 lines with the groupby
function: once for size
and once for mean
. But you can get them both in 1 table, using the agg
function. agg
is short hand for aggregation (which means to form things into a cluster).
# These 2 lines do exactly the same:
display( df.groupby(["Outcome", "Size"]).mean() )
display( df.groupby(["Outcome", "Size"]).agg('mean') )
# Now extend it: we have 2 groups (vertical index axis) and 2 .agg functions (horizontal column axis):
display( df.groupby(["Outcome", "Size"]).agg(["mean", "std"]) )
# You can specify an entire collection of aggregations, and on which columns you want to do that:
agg_func_math = ['count', 'mean', 'median', 'min', 'max', 'std']
df.groupby(['Outcome'])[["Size5", "TGA"]].agg(agg_func_math).round(2)
We saw above that we can create a new column, but that it automatically gets added on the right-hand side of the data frame. If you would like the column elsewhere, use the .insert()
function.
df.insert(0, 'EmptyColumn', np.nan)
df.insert(3, 'Column of ones', [1] * df.shape[0])
df
We can do a "search and replace" function on the values in a data frame.
Imagine if we wanted to change the Outcome
column, and instead of Adequate
and Poor
we would rather have Good
and Bad
.
df["Outcome-newname"] = df['Outcome'].replace({"Adequate": "Good", "Poor": "Bad"})
df
# Try setting the `Outcome` column to numeric values: Adequate -> 1 and Poor -> 0
To help emphasize your message in a table, you might want to colour your table appropriately.
You can read about all the options on this page in the Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html
df = pd.read_csv("https://openmv.net/file/raw-material-characterization.csv").set_index("Lot number")
df.style.bar(color=['lightblue'])
# How to style only a subset of the columns:
(df.style
.hide_index() # if you don't need your index column, you can drop it away
.bar(color='green', subset=['Size5', 'Size10', 'Size15'])
.set_caption('Raw material outcomes')
)
import seaborn as sns
cmap = cmap=sns.diverging_palette(0, 50, as_cmap=True)
# Double sort: first on `Outcome`, then on `Size5`
df.sort_values(["Outcome", "Size5"], inplace=True)
(df[["Outcome", "Size5", "Size10", "Size15"]].style
.background_gradient(cmap)
.format(precision=2) # number of places after the decimal
)
# Show missing values with a colour. First, create an artificial missing value:
df.iloc[4, 3] = np.nan
df.head(7).style.format(precision=2).highlight_null('red')
# Show the minimum and maximum values with different colours:
(df.style
.format(precision=2)
.highlight_min(axis=0, color="lightblue")
.highlight_max(axis=0, color='orange')
.highlight_null('red')
)
We often want a smoother version of the raw data. One option to use the Savitzky-Golay filter; though there are a number of other options.
seq = [1.87, 1.88, 1.89, 1.9, 1.92, 1.96, 2.0, 2.1, 2.12, 2.27, 2.29, 2.28, 2.44, 2.48, 2.52, 2.53, 2.54, 2.55, 2.56, 2.57]
absorbances = pd.Series(seq)
time_points = [ 81 + i*9 for i in range(len(seq)) ]
df = pd.DataFrame(dict(absorbances=absorbances, time_points=time_points))
fig=df.plot.line(x="time_points", y="absorbances", title="Raw data")
fig.add_hline(y=2.25, line_color="purple", line_dash="dash")
from scipy.interpolate import UnivariateSpline
from scipy.signal import savgol_filter
filtered_series = savgol_filter(
x=seq,
window_length=5,
polyorder=3,
)
# Create a data frame of this and plot it:
df_smoothed = pd.DataFrame(
dict(
time_points=time_points,
filtered_series=filtered_series
)
)
# Plot the raw data and smoothed data:
fig = go.Figure()
fig.add_trace(
go.Scatter(
x= df['time_points'],
y= df["absorbances"],
mode="markers",
name="Raw data",
)
)
fig.add_trace(
go.Scatter(
x=df_smoothed['time_points'],
y=df_smoothed["filtered_series"],
mode="lines",
name="Smoothed fit",
)
)
fig.add_hline(y=2.25, line_color="purple", line_dash="dash")
# Note! here we flip x and y around: we want to know what is the expected
# values of time, given the absorbance value of 2.25.
spline = UnivariateSpline(
x = df_smoothed["filtered_series"],
y = df_smoothed['time_points'],
#bc_type='not-a-knot',
#extrapolate=None
)
# Interpolate a new x-axis (absorbance axis) on a very fine scale, between 2.1 and 2.4
interpolated_abs = np.arange(2.1, 2.4, 0.01)
predicted_time = spline(interpolated_abs)
# Create a data frame of this and plot it:
df_interpolated = pd.DataFrame(
dict(
predicted_time=predicted_time,
interpolated_abs=interpolated_abs
)
)
# At what timepoint does the line cross 2.25? Answer is shown in the table: 166.357 seconds
display(df_interpolated.iloc[(df_interpolated["interpolated_abs"] - 2.25).abs().argmin()])
# Plot the raw data and smoothed data:
fig = go.Figure()
fig.add_trace(
go.Scatter(
x= df['time_points'],
y= df["absorbances"],
mode="markers",
name="Raw data",
)
)
fig.add_trace(
go.Scatter(
x=df_smoothed['time_points'],
y=df_smoothed["filtered_series"],
mode="lines",
name="Smoothed fit",
)
)
fig.add_trace(
go.Scatter(
x= df_interpolated['predicted_time'],
y= df_interpolated["interpolated_abs"],
mode="lines",
name="Interpolated spline fit",
)
)
fig.add_hline(y=2.25, line_color="purple", line_dash="dash")
fig.show()
End of this notebook.