Dictionary
data typeSeries
DataFrame
All content here is under a Creative Commons Attribution CC-BY 4.0 and all source code is released under a BSD-2 clause license.
Please reuse, remix, revise, and reshare this content in any way, keeping this notice.
In module 5 and module 6 you used NumPy to create arrays, and perform mathematical calculations on them. Even though module 6 was about Python functions in general, the applications were all with NumPy.
Now we take a look at Pandas. This is currently the best library for data manipulation. Along the way we will also learn about Jupyter notebooks, and Python's other important built-in data type, called *dictionaries*.
Once again, don't forget to use your version control system. Commit your work regularly, where ever you see this icon; actually even more frequently. We will show this icon fewer times, as it should now be almost automatic to make regular commits (*several per hour!*)
You should have
Dictionary
data type¶A dictionary is a Python *object* that is a flexible data container for other objects. It contains these objects using what are called *key* - *value* pairs. You create a dictionary like this:
random_objects = {'an int': 45,
'a float': 12.34,
'short_list': [1, 4, 7],
'longer list': [2, 4, 6, 9, 12, 16, 20, 25, 30, 36, 42],
'website': "https://learnche.org",
'nested_tuple': (1, 2.0, [3, 4], '5', (6, 7.0))
}
print(random_objects)
Notice that it is unlikely when you print the random_objects
variable, that you see the dictionary written back in the order shown above. Dictionaries are an *unordered* container.
The *keys* and *values* of this dictionary are:
# These both return a list:
random_objects.keys()
random_objects.values()
# What is the "type" of this dictionary?
print('The object is of: {}'.format(type(random_objects)))
# You can access individual elements:
random_objects['short_list']
In the above example, the keys were all *string* objects. But that is not required. You can use integers, floating point values, strings, tuples, or a mixture of them. There are other options too, but these are comprehensive enough.
Dictionary values may be any *objects*, even other dictionaries. Yes, so a dictionary within a dictionary is possible. We will use this below. That is why we partially why we waited to introduce dictionaries until now.
Dictionary objects are excellent *containers*. If you need to return several objects from a function, collect them in a dictionary, and return them in that single object. It is not required, but it can make your code neater, and more logical.
Create a dictionary for yourself with 5 key
-value
pairs, which summarizes a regression model. The key
is the first item below, followed by a description of what you should create as the value
:
intercept
: containing a floating-point value which is the intercept of your linear modelslope
: a floating-point slope valueR2
: the $R^2$ value of the regression modelstandard_error
: a value which is the model's standard errorresiduals
: a NumPy vector of residualsYou can create the above dictionary in a single line of code. But what if you want to add something new to an existing dictionary later?
d = { ... } # create your dictionary
d['new key'] = 'additional value'
And you can overwrite an existing key-value pair in the same way:
random_objects['an int'] = 'replaced with a string!'
This implies you can never have 2 keys which are the same. If you try to create a second key which already exists, it will overwrite the value associated with the existing key.
At a later time, checkout your code from the linear regression function you wrote earlier. Modify the code to return a single dictionary, instead of 4 items in a tuple.
Why use pandas
if you already can use NumPy?
DataFrame
. Think of a spreadsheet.You can load the Pandas library, similar to how you load the NumPy library, with this command:
import pandas as pd
pd.__version__ # ensure you have a version >= 0.20
Before we start with DataFrames, there is a simpler object in Pandas, called a Series
; roughly the equivalent of a vector in NumPy.
Let's see some characteristics of a Series
:
# Create a Series from a list. Put your own numbers here:
s = pd.Series([ ... ])
print(s)
Notice the index (the column to the left of your numbers)? Let's look at another example:
>>> s = pd.Series([ 5, 9, 1, -4, float('nan'), 5 ])
>>> print(s)
0 5.0
1 9.0
2 1.0
3 -4.0
4 NaN
5 5.0
dtype: float64
If you do not provide any labels for the rows, the these will be automatically generated for you, starting from 0.
What if you have your own labels already?
# You call the function with two inputs. One input is
# mandatory (the first one), the other is optional.
s = pd.Series(data = [5, 9, 1, -4, float('nan'), 5 ],
index = ['a', 'b', 'c', 'd', 'e', 'f'])
s.values
type(s.values)
Ah ha! See what you get there in the output from s.values
? Pandas is built on top of the NumPy library. The underlying data are still stored as NumPy arrays, and you can access them with the .values
attribute. This is partly why understanding NumPy first is helpful before using Pandas.
Lastly, give your series a nice name:
s.name = 'Random values'
print(s)
Series
¶The series you created above, can be used in calculations. Notice how missing data are handled seamlessly.
import pandas as pd
s = pd.Series(data = [5, 9, 1, -4, float('nan'), 5 ],
index = ['a', 'b', 'c', 'd', 'e', 'f'],
name = 'Calculations')
s * 5 + 2
import numpy as np
np.sqrt(s)
The last line shows that Pandas and NumPy are compatible with each other. You can call NumPy operations on a Pandas object, and the result is returned as a Pandas object to you, with the row labels (indexes) intact.
Also notice, that taking the square root of a negative number is not defined for real values, so the square root of $-4$ in row d
returns a NaN
.
Logical operations are possible too:
s > 4
s.isna()
np.sqrt(s).isna()
s.notna()
Like in NumPy, you can access the data entries using the square bracket notation. In Pandas:
s[2]
s['e']
Selected subsets from the series can be accessed too, again using square brackets:
s[[2, 4, 0]]
s[['f', 'd', 'b']]
# Selection based on logic: I want only values greater than 4
s[s > 4]
You can also access a range
of entries:
s[0:2]
s['a':'c']
Take a careful look at that output. You might have expected them to be the same length, but they are not! When accessing with the index names, you get the range inclusive of the last entry. When accessing by index number, it behaves consistent with Python and NumPy.
That makes sense. Names of the rows, the index, do not necessarily have to be sequential, like ['a', 'b', ... 'f']
as in this example. Often the index is unordered. E.g. if you have a series related to different Canadian cities:
['Toronto', 'Vancouver', 'Ottawa', 'Montréal', 'Halifax']
then with ['Vancouver':'Montréal']
you expect to see the middle 3 entries, inclusive of Montréal
.
Now we can combine two new concepts you have just learned: Dictionaries and Pandas.
raw_data = {'Germany': 27, 'Belgium': 13, 'Netherlands': 52, 'Sweden': 54, 'Ireland': 5}
tons_herring_eaten = pd.Series(raw_data)
print(tons_herring_eaten)
The row names (index) are taken from the dictionary keys, associated with each value. Because dictionaries are not ordered, the rows in the series will not necessarily be in the order written above.
tons_herring_eaten.max()
command!tons_herring_eaten.sort_values()
. Print the variable afterwards.tons_herring_eaten.sort()
instead.tons_herring_eaten.sort_index()
DataFrame
¶Back in module 5 you created a NumPy matrix of 5 temperatures for 4 cities (columns).
import numpy as np temp_np = np.array([[7, 9, 12, 10], [1, 4, 5, 2], [-3, 1, -2, -3], [-2, -1, -2, -2], [-3, -1, -2, -4]]) print(('The temperatures are given one column per ' 'city, each row is a daily average ' 'temperature:\n{}').format(temp_np)) max_value_0 = np.amax(temp_np, axis=0)
Now let's try the same in Pandas, creating a DataFrame
from a list-of-lists:
import pandas as pd
rawdata = [[7, 9, 12, 10],
[1, 4, 5, 2],
[-3, 1, -2, -3],
[-2, -1, -2, -2],
[-3, -1, -2, -4]]
temp_df = pd.DataFrame(data=rawdata, columns = ['Toronto', 'Vancouver', 'Ottawa', 'Montreal'])
We saw then that you could calculate things on each *axis* of the NumPy array. You can also do this in Pandas:
# NumPy and Pandas do different things here!
temp_np.max()
temp_df.max()
# Here they are are consistent
temp_np.max(axis=0)
temp_df.max(axis=0)
# Calculate across the rows, over all cities
temp_np.max(axis=1)
temp_df.max(axis=1)
In general, the same things you can calculate in NumPy, you can repeat in Pandas:
NumPy | Pandas | Description |
---|---|---|
np.sum |
df.sum |
Sum of entries |
np.prod |
df.prod |
Product (multiplication) of all entries |
np.mean |
df.mean |
Arithmetic average |
np.median |
df.median |
Median value |
np.std |
df.std |
Standard deviation |
np.var |
df.var |
Variance |
np.min |
df.min |
Minimum value |
np.max |
df.max |
Maximum value |
np.argmin |
df.idxmin |
Index of the minimum |
np.argmax |
df.idxmax |
Index of the maximum |
np.diff |
df.diff |
Difference between entries |
temp_df['Toronto'] * 4 - temp_df['Montreal']
The above does exactly what you think it should.
>>> temp_df.diff().abs().max()
# and this?
>>> temp_df.diff().abs().max().argmax()
You can stack up your sequential operations quite compactly in Pandas. It works because the output from one function is the input for the next one to the right. Refer back to the section on functions, if necessary.
You used a list-of-lists approach above to create your DataFrame. You can also use dictionaries. Each key in the dictionary can contain a list of equal length:
data = {'Herring': [27, 13, 52, 54, 5, 19],
'Coffee': [90, 94, 96, 97, 30, 73],
'Tea': [88, 48, 98, 93, 99, 88]}
countries = ['Germany', 'Belgium', 'Netherlands', 'Sweden', 'Ireland', 'Switzerland']
food_consumed = pd.DataFrame(data, index=countries)
Try to interpret what the following lines return, based on the values you see when you run each command:
food_consumed.T
food_consumed.values
food_consumed.columns
food_consumed.index
food_consumed.head()
food_consumed.tail()
food_consumed.describe()
The first one is particularly helpful, if you need to switch rows and columns around.
*Hint: whenever you create a data frame (by hand, or by loading a file, which we will see next), always* use .head()
and .describe()
to check you have the data correctly loaded. It will save you a lot of time from making errors, only to discover them later.
We will show code for these commonly-used Pandas operations: shape of an array, unique entries, adding and merging columns, adding rows, deleting rows, and removing missing values.
import pandas as pd data = {'Herring': [27, 13, 52, 54, 5, 19], 'Coffee': [90, 94, 96, 97, 30, 73], 'Tea': [88, 48, 98, 93, 99, 88]} countries = ['Germany', 'Belgium', 'Netherlands', 'Sweden', 'Ireland', 'Switzerland'] food_consumed = pd.DataFrame(data, index=countries)
# There were 6 countries, and 3 food types. Verify:
food_consumed.shape
# Transposed and then shape:
food_consumed.T.shape
# Interesting: what shapes do summary vectors have?
food_consumed.mean().shape
# Access the column names directly.
# Does not work if there is a space in the name though :(
food_consumed.Tea.unique()
# So this is clearer, in my opinion. It is also more programmatic.
# In other words, you can replaced 'Tea' with a string variable, and
# the code will still work.
food_consumed['Tea'].unique()
# Names (indexes) of the unique rows:
food_consumed.index.unique()
# In newer versions of Pandas, you can get counts (n) of
# the unique entries:
food_consumed.nunique() # in each column
food_consumed.nunique(axis=1) # in each row
# Works just like a dictionary!
# If the data are in the same row order
food_consumed['Yoghurt'] = [30, 20, 53, 2, 3, 48]
# Note the row order is different this time:
more_foods = pd.DataFrame(index=['Belgium', 'Germany', 'Ireland', 'Netherlands', 'Sweden', 'Switzerland'],
data={'Garlic': [29, 22, 5, 15, 9, 64]})
# Merge 'more_foods' into the 'food_consumed' data frame
food_consumed = food_consumed.join(more_foods)
# Collect the new data in a Series. Note that 'Tea' is (intentionally) missing!
portugal = pd.Series({'Coffee': 72, 'Herring': 20, 'Yoghurt': 6, 'Garlic': 89},
name = 'Portugal')
food_consumed = food_consumed.append(portugal)
# See the missing value created?
# Drop a column, and returns its values to you
coffee_column = food_consumed.pop('Coffee')
print(coffee_column)
print(food_consumed)
# Leaves the original data untouched; returns only
# a copy, with those columns removed
food_consumed.drop(['Garlic', 'Yoghurt'], axis=1)
# Leaves the original data untouched; returns only
# a copy, with those rows removed.
non_EU_consumption = food_consumed.drop(['Switzerland', ], axis=0)
# Returns a COPY of the array, with no missing values:
cleaned_data = food_consumed.dropna()
# Makes the deletion inplace; more efficient for large data sets
food_consumed.dropna(inplace=True)
# Remove only rows where all values are missing:
food_consumed.dropna(how='all')
You can skip this theory section for now, and go straight to the application below. Read this later, if it is interesting.
CSV stands for Comma-Separated Values. It is a table of data, in a plain text file, where the numbers are separated by commas.
In the figure below, you see an example CSV file and what it might look like if you open one in a text editor. Download the file: https://openmv.net/file/batch-yield-and-purity.csv to follow along. Verify that it looks similar to what is shown.
How is a CSV file processed by Python?
Terminology: the comma symbol is used to separate one number from the next. In this situation, the comma is also known as a *delimiter*. You can imagine that you can devise a file format where you specify some other symbol as the delimiter, for example the 'tab' character, or the semi-colon, or even a space.
You will see the word *delimiter* used below in the code. Now you know what it means.
*Problems* with CSV files:
*Advantages*
NaN
between two commas. For example: 71,NaN,73
indicates a value is missing between the 71
and the 73
.The Pandas function pd.read_csv
has a lot of flexibility and smart processing built-in to make reading CSV files easy with headers, missing values, and other settings. It is a swiss-army knife function: very versatile, but you need to know how to use it.
Download this CSV file: http://openmv.net/file/batch-yield-and-purity.csv and adjust the code below, where necessary:
import os
import pandas as pd
directory = r'C:\location\of\file'
filename = 'batch-yield-and-purity.csv'
full_filename = os.path.join(directory, filename)
yield_purity_pd = pd.read_csv(full_filename)
import pandas as pd
yield_purity_pd = pd.read_csv('http://openmv.net/file/batch-yield-and-purity.csv')
# If you are on a work computer behind a proxy server, you
# have to take a few more steps. Add these 6 lines of code.
import io
import requests
proxyDict = {"http" : "http://replace.with.proxy.address:port"}
url = "http://openmv.net/file/batch-yield-and-purity.csv"
s = requests.get(url, proxies=proxyDict).content
web_dataset = io.StringIO(s.decode('utf-8'))
# Convert the file fetched from the web to a Pandas dataframe
yield_purity_pd = pd.read_csv(web_dataset)
This is as simple as can be:
yield_purity_pd.to_csv('output_filename_here.csv')
Read in the Batch yield and Purity dataset above. The dataset is described here http://openmv.net/info/batch-yield-and-purity
Advisable, the moment you have created a dataset with Pandas, is to call df.head()
on the file. In this case: yield_purity_pd.head()
. Does the data frame match the website's description?
What is the lowest yield
recorded? Make sure your command returns only a floating point value.
And the highest?
What is the average purity
of the raw material?
Check the output of yield_purity_pd.describe()
: can you interpret each row of output?
In the description given on the website, there is the idea that batch yield is affected by purity. For a cause-and-effect relationship to exist, there should be a correlation. Do you see that? Use yield_purity_pd.corr()
to calculate the correlation between the two columns.
We use relative standard deviation (RSD) as a way to judge how noisy a variable is. For a single variable, this is defined as: $$\text{RSD} = \dfrac{\text{standard deviation}}{\text{average}}$$
Calculate the numerator (yield_purity_pd.std()
) and the denominator separately. Do the values look reasonable? Now divide them. Does yield
or purity
have the highest RSD?
8. Create a new calculation hypothesis
= $ 4\times(\text{purity} - 50)$ and add that as a new column to the existing data frame.
9. Write the data frame, now with 3 columns, to a CSV file.
10. Open the CSV file in a text editor to ensure the data are properly stored, with the expected accuracy.
There is not too much to say here, other than to show the basic commands:
colour_data = pd.read_excel(excel_filename,
sheet_name='Colours',
skiprows=5,
index_col=0)
colour_data.head()
You can call the function with various inputs, depending on your situation. Read the full documentation for reading Excel files: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
Similarly, for writing Excel files, it is often enough to just use:
df = pd.DataFrame(...)
df.to_excel("output.xlsx", sheet_name='Summary')
and it is worth checking the documentation for further function options: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html
Back in prior worksheet 6 you simulated the cooling process taking place in a fridge.
By the end of the exercise you had assembled 3 columns of equal length in 3 separate variables. Here is how your code might have looked:
import numpy as np
def simulate_cooling(time_final=30, initial_temp=25, delta_t=2.5):
"""Models Newton's Law of Cooling: \dfrac{dT}{dt} = -k (T-F)
The fridge has a constant temperature, F=5 [°C]
Heat transfer coefficient = k = 0.08 [1/minutes]
Discretizing: T_{i+1} = T_i - k (\delta t)(T_i - F)
Temperature at time point $i+1$ (one step in the future, T_{i+1}) is related
to the temperature now, at time $i$, T_{i}.
The total simulation time is `time_final` minutes, starting at
t=0. The initial temperature of the object in the fridge is
``initial_temp` °C, and the simulation time steps are `delta_t`
minutes apart.
"""
F_temperature = 5 # °C
heat_transfer_coeff = 0.08 # 1/minutes
# Create the two outputs of interest
time = np.arange(start=0.0, stop=time_final, step=delta_t)
temp = np.zeros(time.shape)
temp[0] = initial_temp
for idx, t_value in enumerate(time[1:]):
temp[idx + 1] = temp[idx] - heat_transfer_coeff * delta_t * (temp[idx] - F_temperature)
# After the loop:
time[idx + 1] = t_value + delta_t
# Exact value
exact = F_temperature + (initial_temp - F_temperature) * np.exp(-heat_transfer_coeff * time)
return (time, temp, exact)
time, temperature, true_value = simulate_cooling(time_final=30, initial_temp=25)
The moment it happens that you collect several variables as output which logically belong together, then you should combine them in a single variable, a Pandas data frame. Put each variable as a new column so your function output is simplified:
simulation = simulate_cooling(time_final=30, initial_temp=25)
Modify the above code. Remove the last 2 lines inside the function, and replace them with code that:
time
and one for temperature
.exact
, which contains the exact solution, but calculated from the values in the first two columns.error
, which contains the difference between the true value and the simulated value.time
, to become your index! That actually removes that column. If your data frame were called results
inside the function, you do this by writing: results.set_index('time')
. Check the results.shape
before and after doing this.Load the CSV file from https://openmv.net/info/raw-material-height which is actual data from a process, where the height is a critical parameter to monitor, to ensure it does not go too low.
data['Level'].plot()
We will return to plotting in a later worksheet.
Back in worksheet 4 you used a data set from The Dutch meteorological service (KNMI): temperature readings from a location in The Netherlands, since 1901. Download the file, or use the direct web address: http://projects.knmi.nl/klimatologie/onderzoeksgegevens/homogeen_260/tg_hom_mnd260.txt
The first column is the station number, the next is the date, and the third column is the temperature, measured in units of °C.
Unfortunately, the KNMI service has used commas as the delimiter in line 27 for the column headings, and then spaces in the rest of the file as delimiter. This makes it hard to find the right settings to import the file. Nevertheless, try using the documentation for pd.read_csv
.
*Hint:*
knmi = pd.read_csv('URL here, or the file name on your computer', delimiter=..., skiprows=..., header=0)
knmi.head()
Warning: it can be a frustrating exercise dealing with other people's badly formatted data. *But that's reality.*
Wrap up this section by committing all your work. Have you used a good commit message? Push your work, to refer to later, but also as a backup.
*Feedback and comments about this worksheet?* Please provide any anonymous comments, feedback and tips.
# IGNORE this. Execute this cell to load the notebook's style sheet.
from IPython.core.display import HTML
css_file = './images/style.css'
HTML(open(css_file, "r").read())