All content here is under a Creative Commons Attribution CC-BY 4.0 and all source code is released under a BSD-2 clause license.
Please reuse, remix, revise, and reshare this content in any way, keeping this notice.
In the prior module 7 you had an introduction to main Pandas objects: Series
and DataFrame
. You were also introduced to dictionaries. In this worksheet, we only see a bit more of dictionaries, and get to apply Pandas to solving practical problems you have seen in prior modules.
Check out this repo using Git. Use your favourite Git user-interface, or at the command line:
git clone git@github.com:kgdunn/python-basic-notebooks.git # If you already have the repo cloned: git pull
to update it to the later version.
You should have completed worksheet 7.
Dictionary
objects¶It was said earlier that a dictionary is a Python *object* which is a flexible data container for other objects. It contains objects using what are called *key* - *value* pairs. You create a dictionary like this:
random_objects = {'key1': 45,
2: 'Yes, keys can even be integers!',
3.0: 'Or floating point objects',
(4,5): 'Or tuples!',
}
print(random_objects)
Once you have a dictionary, it is common to operate on the keys, or values, or both - in an iterative loop:
for key, value in random_objects.items():
print('The key is "{}" and the value is: {}'.format(key, value))
random_objects[key] = value * 2
If you need only the values, and not the keys:
for value in random_objects.values():
# Do something here with
value
or, if you need only the keys, and not the values:
for key in random_objects.keys():
# Do something here with
key
We already saw how to set a new key or overwrite an existing key:
random_objects['key1'] = 'will now be replaced'
random_objects['key2'] = 'is newly added'
You can get a value, from a given key, using the square bracket notation, and then immediately use it for further calculation or processing:
uppercase_value = random_objects['key2'].upper()
# but this will fail:
random_objects['key3']
with a KeyError
, because you are trying to access a non-existent key. Here are two possible solutions to deal with the case if you are not sure if the key exists, but you need your code to continue running without failing:
# Option 1: try-except
try:
value = random_objects['key3']
except KeyError:
# Key not present: use a missing value as fallback
value = float('nan')
# Now "value" is guaranteed to exist after these 4 lines.
# Or, option 2, in a single line of code:
value = random_objects.get('key3', float('nan'))
You probably will prefer using the last version, since it is compact, and provides the same functionality as the first option.
Dictionaries are an *unordered* container; though in the very recent versions of Python 3.7 above they are now ordered in the order that you add key-values.
That means the above dictionary is created in a certain order (not necessarily as shown in the code!), but once you add new key-values sequentially, they will retain that order. This means if you create an empty dictionary, and add pairs ...
testing_order = {}
testing_order['key1'] = 45
testing_order[2] = 'Yes, keys can even be integers!'
testing_order[3.0] = 'Or floating point objects'
testing_order.keys()
... that they will retain the order you added them. Because this is such a new feature, and people do not quickly upgrade their Python version, you probably should not count on it being available.
If you need to test the Python version in the code, use the sys.version_info
attribute:
import sys
if (sys.version_info.major >= 3) and (sys.version_info.minor >= 7):
print('I can rely on ordered dictionaries!')
testing_order = dict()
else:
print('Use the OrderedDict class from "import collections".')
from collections import OrderedDict
testing_order = OrderedDict()
testing_order['key1'] = 45
testing_order[2] = 'Yes, keys can even be integers!'
testing_order[3.0] = 'Or floating point objects'
# Guaranteed to be in order, no matter which version of Python you use!
testing_order.keys()
Create a dictionary containing the molar mass of pure species. Let the key be the chemical element (as a string), and the value be a floating point molar mass:
C
: carbon = 12.0107O
: oxygen = 15.999N
: nitrogen = 14.0067H
: hydrogen = 1.00784S
: sulfur = 32.065P
: phosphorous = 30.973762Now write a function calculate_molar_mass
which accepts 1 input, a chemical formula as a string, and returns the calculated molar mass.
Water, $\text{H}_2\text{O}$ has 2 hydrogens and 1 oxygen. It could be represented as H2O1
, and therefore has the molar mass of $(2 \times 1.00784) + (1 \times 15.999)$ = 18.01468.
Now try it yourself for an amino acid, Methionine, which is $\text{C}_5\text{H}_{11}\text{N}\text{O}_2 \text{S}$:
# make life easier: explicitly add the '1' for single atoms
methionine = 'C5H11N1O2S1'
met_mm = molar_mass(methionine)
The molar mass of Methionine is 149.21 g/mol. Try your function on some other amino acids, such as Lysine, $\text{C}_6\text{H}_{14}\text{N}_2\text{O}_2$, which has a molar mass of 146.190 g/mol.
Suggested solution approach:
Work backwards: start with the dictionary written below (formula = {'C': 5, 'H': 11, 'N': 1, 'O': 2, 'S': 1}
), and implement the last 2 bullet points here. Then write the code to create that dictionary:
.isnumeric()
on each character)C
..isalpha()
on each character)5
.C
as the *key, and the 5
numeric part as a value*.formula = {'C': 5, 'H': 11, 'N': 1, 'O': 2, 'S': 1}
Challenge yourself even more: adjust the code so that it can work with natural formulas, where the '1'
parts are not given. E.g. your function should be able to handle methionine = 'C5H11NO2S'
instead of 'C5H11N1O2S1'
.
A common problem in automated data analysis is reading data from many files in a directory, or sub-directories. Try this:
import os import fnmatch pattern = '*.xlsx' # Dataframe for the result: result = pd.DataFrame(___) for root, dirs, files in os.walk(r'C:\location\to\your\files'): for name in fnmatch.filter(files, pattern): full_filename = os.path.join(root, name) # Use Pandas to read the Excel file excel_values = pd.____ # Add the result as a new row or column # in your Pandas DataFrame, df: result.____ # Finally, write the dataframe to CSV or Excel result.to_excel("output.xlsx", sheet_name='All file results')
You can also use a dictionary instead of a Pandas DataFrame. The keys of the dictionary could be full_filename
, while the values of each key could be a list of the number(s) you extracted from the Excel file.
Back in module 3 you had a challenge problem of calculating the moving average from a long vector of data.
You downloaded and used the Ammonia
series of data: http://openmv.net/info/ammonia and calculated the moving average over $n=5$ values; called a window of 5 values.
If you look back at your original code, it was probably many lines. Now you can make it even shorter: reduce it down to 3 lines!
import pandas as pd
# Read the ammonia.csv files as a Pandas data frame:
ammonia = pd.read_csv(___)
# Calculate the moving average:
ammonia.___
The last line is obviously the key to solving this. Look at the documentation for df.rolling
: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html
Compare the solution in module 3 with the solution from Pandas.
*Feedback and comments about this worksheet?* Please provide any anonymous comments, feedback and tips.
# IGNORE this. Execute this cell to load the notebook's style sheet.
from IPython.core.display import HTML
css_file = './images/style.css'
HTML(open(css_file, "r").read())