1 Review so far
2 Data set import and basic checks
3 List comprehensions
4 Reading only certain rows
5 Import specific columns only
6 Dropping columns or rows
7 Setting an index
8 Visualizing and deleting missing values
9 Using iloc and loc
10 Dropping missing values, specifying a threshold
11 Filtering rows
12 Filtering with the .query function
13 New data set: raw material properties
14 Single level groupby
15 Two new concepts: cutting and subplots
16 Extending groupby to show multiple outputs
17 Multilevel groupby
18 Multiple groupby summaries
19 Add a new column, not at the end (right-hand side)
20 Replace values in a column
21 Styling table displays
22 Smoothing a curve
23 Challenge

All content here is under a Creative Commons Attribution CC-BY 4.0 and all source code is released under a BSD-2 clause license.

Please reuse, remix, revise, and reshare this content in any way, keeping this notice.

Course overview¶

This is the sixth, and final, module of several (11, 12, 13, 14, 15 and 16), which refocuses the course material in the first 10 modules in a slightly different way. It places more emphasis on

dealing with data: importing, merging, filtering;
calculations from the data;
visualization of it.

In short: *how to extract value from your data*.

Review so far¶

In module 11 we learned about

creating variables, and showing their type,
performing basic calculations, and the math library,
lists, as one of the most fundamental Python objects.

In the module 12 we took this a step further:

and introduced the Pandas library, for Series and DataFrame objects,
learned how to import and write Excel files,
do basic operations on DataFrames, and
learned about another fundamental Python type, the dictionary.

Module 13 we introduced:

a general workflow for data processing
and how to visualize data with Pandas:
- box plot,
- time series (sequence) plot, and
- scatter plots [including showing how you can visualize 5 dimensions!]

Module 14 we saw how to create:

for loops, for when we need to do things over and over,
but we also saw the groupby function, which does actions repeatedly on sub-groups of your data.
We also introduced the correlation matrix.

Then in module 15 we saw:

that we could visualize the correlation matrix (2D histogram), to find candidates for regression,
using the LinearRegression tool from a new library, scikit-learn.
We also used another new library, seaborn, to visualize these regression models.

Module 16 Overview¶

In this module we will cover a collection of last loose ends. Things you will use regularly in your work.

Reading in subsets of data
Handling missing values with Pandas
Filtering data, and the multi-level groupby capability of Panda
Effective table display in Pandas

Most of them come from this list, with some modifications: https://towardsdatascience.com/30-examples-to-master-pandas-f8a2da751fa4

Data set import and basic checks¶

We will use a data set that is related to food consumption. It shows, in a relative way, the food consumption habits of European (and soon to be former EU) countries.

In [ ]:

import time 
import pandas as pd
import numpy as np
import seaborn as sns
from scipy.interpolate import UnivariateSpline
from scipy.signal import savgol_filter
import plotly.graph_objects as go
import plotly.io as pio

In [ ]:

pio.renderers.default = "iframe" # "notebook" # jupyterlab
pd.options.plotting.backend = "plotly"

In [ ]:

df = pd.read_csv("https://openmv.net/file/food-consumption.csv")
display(df.info())
display(df)

Visualizing the correlation matrix is essential to help understanding relationships. Use the code and the plot below to help answer:

Countries which consume garlic more than average, also seem to consume a higher amount of ...
Which variables are negatively correlated with "Real coffee" consumption?
Countries with higher consumption of "Crisp bread" also show high consumption of which other products?

In [ ]:

sns.set(rc={'figure.figsize': (15, 15)})
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(df.corr(), cmap=cmap,  square=True, linewidths=0.5, cbar_kws={"shrink": 0.8});

List comprehensions¶

"List comprehensions" are a quick way to make a list. You can read more, and see some examples here: https://realpython.com/list-comprehension-python/#using-list-comprehensions

In [ ]:

print(  [i     for i in range(10)]  )
print(  [i*2+1 for i in range(10)]  )
print(  [i*2   for i in range(10) if i > 4]  )
print(  [i     for i in range(10) if i % 2 == 1]  )
print(  [i     for i in range(10) if i % 2 == 0]  )
print(  [i     for i in range(10) if i % 4 == 1]  )

Reading only certain rows¶

Imagine you had a large data set, and only needed certain rows for your calculations/visualization later on. You can use the nrows and skiprows arguments to read only a subset of the data.

In [ ]:

df_subset = pd.read_csv("https://openmv.net/file/food-consumption.csv", nrows=5)
display(df_subset)

df_partial = pd.read_csv("https://openmv.net/file/food-consumption.csv", skiprows=[2, 3, 4])
display(df_partial)

# Requires an extra `engine` input
df_bottom = pd.read_csv("https://openmv.net/file/food-consumption.csv", skipfooter=12) #, engine='python') <- intentionally left out for demo
display(df_bottom)

In [ ]:

# Skipping every 3rd row, using a list comprehension...
print([i for i in range(40) if i%3 ==1])
df_partial = pd.read_csv("https://openmv.net/file/food-consumption.csv", 
                         skiprows=[i for i in range(40) if i%3 ==1])
display(df_partial)

Import specific columns only¶

If you know the names of the columns you need, you can use the usecols input.

Note: this also works for Excel files! You can say, for example, usecols="F,G,BQ" if you need columns F, G and BQ only.

In [ ]:

df_subset = pd.read_csv("https://openmv.net/file/food-consumption.csv", 
                        usecols=["Country", "Sweetener", "Biscuits", "Powder soup", "Tin soup"])
display(df_subset)

Dropping columns or rows¶

Conversely, you can read in the whole data set, and drop away the columns or rows you do not need.

In [ ]:

df = (
    pd.read_csv("https://openmv.net/file/food-consumption.csv")
    .drop(["Sweetener", "Biscuits", "Powder soup", "Tin soup"], axis=1)
)
display(df)
df.shape

# Also drop some rows: drop away every 3rd row.
# You can also leave away 'axis=0' (because that's the default)
df_subset = df.drop([i for i in range( df.shape[0] ) if i%3 ==1] , axis=0)  
display(df_subset)

Setting an index¶

You can always make a column from your dataframe to be your index, using the set_index function.

In [ ]:

df = pd.read_csv("https://openmv.net/file/food-consumption.csv")
df = df.set_index('Country')
display(df)

# Or, in a single line, in a chained operation
df = (
    pd.read_csv("https://openmv.net/file/food-consumption.csv")
    .drop([i for i in range( df.shape[0] ) if i%3 ==1] , axis=0)
    .set_index('Country')
)
display(df)

Visualizing and deleting missing values¶

Pandas generally handles missing values well: for example, the df.mean() function will work even if there are missing values. But some mathematical tools cannot have missing values, such as when performing a linear regression. So deleting missing data first is an option. It is therefore helpful that you can:

Find how many missing values are there per column? Or per row?
Delete columns with missing values.
Deleting rows with any missing values.

In [ ]:

# Which columns have missing values:
df = pd.read_csv("https://openmv.net/file/food-consumption.csv").set_index('Country')
display(df.isna().sum())

# Which rows have missing values:
df.isna().sum(axis=1)

# Display missing values in a heat map
sns.set(rc={'figure.figsize': (10, 10)})
sns.heatmap(df.isna(), square=True, cbar_kws={"shrink": 0.5});

Confirm that the "Sweetener", "Biscuits", and "Yoghurt" columns are not present after running this command (these columns had missing values in them):

In [ ]:

# Delete columns with missing values
df.dropna(axis=1)

Confirm that the rows for "Sweden", "Finland", and "Spain", which had missing entries, are not present after this:

In [ ]:

# Delete rows with missing values
df.dropna(axis=0)

Dropping missing values in all rows, but only for a subset of the columns is possible. For example, drop only rows in the columns for "Sweetener" and "Yoghurt" (ignore the column for "Biscuits"):

In [ ]:

display(df.dropna(subset=["Sweetener", "Yoghurt"], axis=0))

# Note: you can also flip this around. Specify a subset of row names
#       in `subset` and delete from all columns, using `axis=1`.
df.dropna(subset=["Sweden"], axis=1)

Using iloc and loc¶

We learned about .iloc in the prior module. Let's look at this again, and emphasize the difference between .iloc and .loc. This article gives more details about the two if you want some more explanation.

In [ ]:

df = pd.read_csv("https://openmv.net/file/food-consumption.csv").set_index('Country')

# "Instant coffee" is column 1: make all these values missing
df.iloc[:, 1] = np.nan
display(df)

In [ ]:

# But what if don't know, or care, which column index it is? 
# When we know the column's name, then use ".loc" 
df.loc[:, "Tea"] = np.nan
df

In [ ]:

# Or you can use a list of column names:
df.loc[:, ["Potatoes","Frozen fish"]] = 98.76
df

In [ ]:

# You can use a mixture of .iloc and .loc:
df.iloc[[0, 1, 2], :].loc[:, "Tin soup"]

In [ ]:

# but this is less code:
df.iloc[[0, 1, 2], :]["Tin soup"]

In [ ]:

# or even less this way:
df.iloc[[0, 1, 2]]["Tin soup"]

In [ ]:

# Or using .loc
df.loc["Germany":"France", "Tin soup"]

Dropping missing values, specifying a threshold¶

If you want to delete a column only if there are more than a certain number of missing values:

Read the data
Make a column have a high number of missing values (for demonstration purposes; normally the column is already problematic)
Remove that column, because it has a high degree of missing values.

In [ ]:

# Read the data, and make every 3rd row a missing value for column "Tea"
df = pd.read_csv("https://openmv.net/file/food-consumption.csv").set_index('Country')

df.iloc[[i for i in range(16) if i%3 == 1]]["Tea"] = np.nan

# The above code generates a warning. Why?
display(df)

In [ ]:

# How to make this warning go away? As suggested by the warning, use ".loc" instead.
#     df.loc[row_indexer, col_indexer] = np.nan

# Create a variable containing all row names:
row_indexer = df.index

# Now take every third row name:
row_indexer = df.index[  [i for i in range(16) if i%3 ==1]  ]
row_indexer

In [ ]:

# Then, set these rows to have missing values:
df.loc[row_indexer, "Tea"] = np.nan
display(df)
df.isna().sum()

In [ ]:

# Finally, we can now delete columns with a threshold (degree) of missing values
# What value should you fill in here?
display(df.dropna(thresh=11, axis=1))

Filtering rows¶

Find which countries have "Olive oil" consumption of more than 50?

In [ ]:

df = pd.read_csv("https://openmv.net/file/food-consumption.csv").set_index('Country')
df[ df["Olive oil"] > 50  ]

Which countries have "Olive oil" more than 50, and "Garlic" more than 40?

In [ ]:

df[   (df["Olive oil"] > 50) & (df["Garlic"] > 40)   ]

Which countries have "Tea" more than 80, or "Oranges" more than 90?

In [ ]:

df[(df["Tea"] > 80) | (df["Oranges"] > 90)]

Filtering with the `.query` function¶

It is sometimes more natural to filter with the .query function:

In [ ]:

display(df.query("30 < Tea < 80"))

# or use backticks if the column name has a space:
df.query("10 < `Tin soup` < 20")

You can have multiple queries:

Find the countries which have "Real coffee" and "Tea" consumption above 70.

In [ ]:

df.query("(`Real coffee` > 70) or (Tea > 70)")

Really powerful is the ability to reference one column against another.

Find all countries where more "Instant coffee" is drunk more than "Real coffee". These are countries to avoid visiting. What else do you notice about these countries eating habits?

In [ ]:

df.query("`Instant coffee` > `Real coffee`")

New data set: raw material properties¶

For the rest of the notebook we will switch to a new data set, where we characterize the properties of a raw material. As each batch of raw material is acquired, there are 6 measurements taken. There is also an indicator variable (categorical variable) on whether the raw materials outcome was (Adequate), or not (Poor).

In [ ]:

df = pd.read_csv("https://openmv.net/file/raw-material-characterization.csv").set_index("Lot number")
display(df)

# Note that the Outcome column is an object. We can explicitly convert it to a categorical variable:
df["Outcome"] = df["Outcome"].astype('category')
display(df.info())

Single level `groupby`¶

Recall the groupby function from two modules ago, which we applied as follows in a loop to create a plot for each group:

In [ ]:

# Groupby: for plotting
for outcome, subset in df.groupby("Outcome"):
    fig=subset.plot.scatter(x='Size5', y="Size15")
    fig.update_layout(
        xaxis_range=[10, 16], 
        yaxis_range=[18, 45],
        width=500,
    )
    fig.show()
    time.sleep(0.5)

Two new concepts: cutting and subplots¶

Sometimes you want the plots shown in a matrix, or a grid. This is called a subplot. You can read the Plotly documentation about subplots on their site.

Sometimes you want to split your (continuous) data into smaller groups (or categories). This can be done with the cut function in Pandas.

In [ ]:

# Cut the "TMA" variable. These values lie between 46.2 and 68.0, so let's create 3 bins, with 4 cuts at [40, 50, 55, 100]
display(pd.cut(df['TMA'], bins=[40, 50, 55, 100]))
df['TMA categories'] = pd.cut(df['TMA'], bins=[40, 50, 55, 100])

In [ ]:

print(f'Using groupby you can loop; there would be {df.groupby("TMA categories").ngroups} groups. But we will use subplots instead ...')

# You can use this code below as a general recipe for creating subplots.
nrows = 2
ncols = 2
from plotly.subplots import make_subplots
fig = make_subplots(
    rows=nrows, 
    cols=ncols, 
    shared_xaxes=False,
    shared_yaxes=False,
    vertical_spacing=0.15,
    horizontal_spacing=0.10,
    subplot_titles=[str(val) for val in df.groupby("TMA categories").groups.keys()],
    start_cell="top-left") # "bottom-left"

# In a loop, create each subplot 
row = col = 1
for category, subset in df.groupby("TMA categories"):
    fig.add_trace(
        go.Scatter(
            x=subset['TMA'], 
            y=subset['Size15'],
            mode="markers",
            name=str(category),
            showlegend=True,
        ),
        row=row, 
        col=col,       
    )
    
    # Bump up the counters for the next plot ...
    col += 1
    if col > ncols:
        col = 1
        row += 1

fig.update_layout(width=1000)
fig.show()

Extending groupby to show multiple outputs¶

In [ ]:

# Or using groupby for a single summary
display(df.groupby("Outcome").mean())
display(df.groupby("Outcome").std())

# Or, call all the summaries together. We will explain the .agg function below.
df.groupby("Outcome").agg(["mean", "std"]).round(2)

Multilevel groupby¶

We can also use groupby for multiple levels. Imagine we have a second categorical variable, or some other variable with few discrete values:

In [ ]:

# Using what you learned above, you can see how we can quickly create a new column with .cut(...)
df["Size"] = pd.cut(df['Size5'], bins=[0, 13, np.inf])   # --> intentionally left out for now:   labels=["Small", "Large"]
df

In [ ]:

# Now you can use a multi-level groupby:
display(df.groupby(["Outcome", "Size", ]).count())  # <-- redundant, use the next one instead
display(df.groupby(["Outcome", "Size", ]).size())

Multiple groupby summaries¶

In the above, we had to write 2 lines with the groupby function: once for size and once for mean. But you can get them both in 1 table, using the agg function. agg is short hand for aggregation (which means to form things into a cluster).

In [ ]:

# These 2 lines do exactly the same:
display(  df.groupby(["Outcome", "Size"]).mean()  )
display(  df.groupby(["Outcome", "Size"]).agg('mean') )

In [ ]:

# Now extend it: we have 2 groups (vertical index axis) and 2 .agg functions (horizontal column axis):
display(  df.groupby(["Outcome", "Size"]).agg(["mean", "std"]) )

In [ ]:

# You can specify an entire collection of aggregations, and on which columns you want to do that:
agg_func_math = ['count', 'mean', 'median', 'min', 'max', 'std']
df.groupby(['Outcome'])[["Size5", "TGA"]].agg(agg_func_math).round(2)

Add a new column, not at the end (right-hand side)¶

We saw above that we can create a new column, but that it automatically gets added on the right-hand side of the data frame. If you would like the column elsewhere, use the .insert() function.

In [ ]:

df.insert(0, 'EmptyColumn', np.nan)
df.insert(3, 'Column of ones', [1] * df.shape[0])
df

Replace values in a column¶

We can do a "search and replace" function on the values in a data frame.

Imagine if we wanted to change the Outcome column, and instead of Adequate and Poor we would rather have Good and Bad.

In [ ]:

df["Outcome-newname"] = df['Outcome'].replace({"Adequate": "Good", "Poor": "Bad"})
df

# Try setting the `Outcome` column to numeric values: Adequate -> 1 and Poor -> 0

Styling table displays¶

To help emphasize your message in a table, you might want to colour your table appropriately.

You can read about all the options on this page in the Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html

In [ ]:

df = pd.read_csv("https://openmv.net/file/raw-material-characterization.csv").set_index("Lot number")
df.style.bar(color=['lightblue'])

In [ ]:

# How to style only a subset of the columns:
(df.style
   .hide_index() # if you don't need your index column, you can drop it away 
   .bar(color='green', subset=['Size5', 'Size10', 'Size15'])
   .set_caption('Raw material outcomes')
)

In [ ]:

import seaborn as sns
cmap = cmap=sns.diverging_palette(0, 50, as_cmap=True)

# Double sort: first on `Outcome`, then on `Size5`
df.sort_values(["Outcome", "Size5"], inplace=True)

(df[["Outcome", "Size5", "Size10", "Size15"]].style
    .background_gradient(cmap)
    .format(precision=2)   # number of places after the decimal
)

In [ ]:

# Show missing values with a colour. First, create an artificial missing value:
df.iloc[4, 3] = np.nan
df.head(7).style.format(precision=2).highlight_null('red')

In [ ]:

# Show the minimum and maximum values with different colours:
(df.style
   .format(precision=2)
   .highlight_min(axis=0, color="lightblue")
   .highlight_max(axis=0, color='orange')
   .highlight_null('red')
)

Smoothing a curve¶

We often want a smoother version of the raw data. One option to use the Savitzky-Golay filter; though there are a number of other options.

In [ ]:

seq = [1.87, 1.88, 1.89, 1.9, 1.92, 1.96, 2.0, 2.1, 2.12, 2.27, 2.29, 2.28, 2.44, 2.48, 2.52, 2.53, 2.54, 2.55, 2.56, 2.57]
absorbances = pd.Series(seq)
time_points = [    81 + i*9  for i in range(len(seq)) ]
df = pd.DataFrame(dict(absorbances=absorbances, time_points=time_points))
fig=df.plot.line(x="time_points", y="absorbances", title="Raw data")
fig.add_hline(y=2.25, line_color="purple", line_dash="dash")

In [ ]:

from scipy.interpolate import UnivariateSpline
from scipy.signal import savgol_filter

filtered_series = savgol_filter(
    x=seq,             
    window_length=5, 
    polyorder=3, 
)  

# Create a data frame of this and plot it:
df_smoothed = pd.DataFrame(
    dict(
        time_points=time_points, 
        filtered_series=filtered_series 
    )
)

# Plot the raw data and smoothed data:
fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x= df['time_points'], 
        y= df["absorbances"],
        mode="markers",
        name="Raw data",
    )
)
fig.add_trace(
    go.Scatter(
        x=df_smoothed['time_points'], 
        y=df_smoothed["filtered_series"],
        mode="lines",
        name="Smoothed fit",
    )
)
fig.add_hline(y=2.25, line_color="purple", line_dash="dash")

In [ ]:

# Note! here we flip x and y around: we want to know what is the expected
# values of time, given the absorbance value of 2.25.
spline = UnivariateSpline(
    x = df_smoothed["filtered_series"],   
    y = df_smoothed['time_points'], 
    #bc_type='not-a-knot',
    #extrapolate=None
)

# Interpolate a new x-axis (absorbance axis) on a very fine scale, between 2.1 and 2.4
interpolated_abs = np.arange(2.1, 2.4, 0.01)
predicted_time = spline(interpolated_abs)

# Create a data frame of this and plot it:
df_interpolated = pd.DataFrame(
    dict(
        predicted_time=predicted_time, 
        interpolated_abs=interpolated_abs 
    )
)
# At what timepoint does the line cross 2.25? Answer is shown in the table: 166.357 seconds
display(df_interpolated.iloc[(df_interpolated["interpolated_abs"] - 2.25).abs().argmin()])

# Plot the raw data and smoothed data:
fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x= df['time_points'], 
        y= df["absorbances"],
        mode="markers",
        name="Raw data",
    )
)
fig.add_trace(
    go.Scatter(
        x=df_smoothed['time_points'], 
        y=df_smoothed["filtered_series"],
        mode="lines",
        name="Smoothed fit",
    )
)
fig.add_trace(
    go.Scatter(
        x= df_interpolated['predicted_time'], 
        y= df_interpolated["interpolated_abs"],
        mode="lines",
        name="Interpolated spline fit",
    )
)
fig.add_hline(y=2.25, line_color="purple", line_dash="dash")
fig.show()

Challenge¶

Apply table styling to the Foods data set, at the start of this notebook. Can you visualize, in a colourful way, some of the food consumption trends?
Apply this same table styling to a small/medium sized data set of your own.

In [ ]:

End of this notebook.

Table of Contents

Course overview¶

Review so far¶

Module 16 Overview¶

Data set import and basic checks¶

List comprehensions¶

Reading only certain rows¶

Import specific columns only¶

Dropping columns or rows¶

Setting an index¶

Visualizing and deleting missing values¶

Using iloc and loc¶

Dropping missing values, specifying a threshold¶

Filtering rows¶

Filtering with the .query function¶

New data set: raw material properties¶

Single level groupby¶

Two new concepts: cutting and subplots¶

Extending groupby to show multiple outputs¶

Multilevel groupby¶

Multiple groupby summaries¶

Add a new column, not at the end (right-hand side)¶

Replace values in a column¶

Styling table displays¶

Smoothing a curve¶

Challenge¶

Filtering with the `.query` function¶

Single level `groupby`¶