Notebook

- 0.1 Preparing for this module###
1 More about Dictionary objects
2 ➜ Challenge yourself: working with dictionaries
3 ➜ Challenge yourself: reading data from many files
4 ➜ Challenge yourself: moving average
5 Further tips

All content here is under a Creative Commons Attribution CC-BY 4.0 and all source code is released under a BSD-2 clause license.

Please reuse, remix, revise, and reshare this content in any way, keeping this notice.

Module 8: Overview¶

In the prior module 7 you had an introduction to main Pandas objects: Series and DataFrame. You were also introduced to dictionaries. In this worksheet, we only see a bit more of dictionaries, and get to apply Pandas to solving practical problems you have seen in prior modules.

Check out this repo using Git. Use your favourite Git user-interface, or at the command line:

git clone git@github.com:kgdunn/python-basic-notebooks.git

# If you already have the repo cloned:
git pull

to update it to the later version.

Preparing for this module###¶

You should have completed worksheet 7.

In [ ]:

More about `Dictionary` objects¶

It was said earlier that a dictionary is a Python *object* which is a flexible data container for other objects. It contains objects using what are called *key* - *value* pairs. You create a dictionary like this:

random_objects = {'key1': 45,
                  2:      'Yes, keys can even be integers!',
                  3.0:    'Or floating point objects',
                  (4,5):  'Or tuples!',
                 }
print(random_objects)

Iterating over the keys-values of a dictionary¶

Once you have a dictionary, it is common to operate on the keys, or values, or both - in an iterative loop:

for key, value in random_objects.items():
    print('The key is "{}" and the value is: {}'.format(key, value))
    random_objects[key] = value * 2

If you need only the values, and not the keys:

for value in random_objects.values():
    # Do something here with
    value

or, if you need only the keys, and not the values:

for key in random_objects.keys():
    # Do something here with 
    key

Setting and getting key-values¶

We already saw how to set a new key or overwrite an existing key:

random_objects['key1'] = 'will now be replaced'
random_objects['key2'] = 'is newly added'

You can get a value, from a given key, using the square bracket notation, and then immediately use it for further calculation or processing:

uppercase_value = random_objects['key2'].upper()

# but this will fail:
random_objects['key3']

with a KeyError, because you are trying to access a non-existent key. Here are two possible solutions to deal with the case if you are not sure if the key exists, but you need your code to continue running without failing:

# Option 1: try-except
try:
    value = random_objects['key3']
except KeyError:
    # Key not present: use a missing value as fallback 
    value = float('nan')

# Now "value" is guaranteed to exist after these 4 lines.
# Or, option 2, in a single line of code:
value = random_objects.get('key3', float('nan'))

You probably will prefer using the last version, since it is compact, and provides the same functionality as the first option.

Ordered vs Unordered dictionaries (advanced)¶

Dictionaries are an *unordered* container; though in the very recent versions of Python 3.7 above they are now ordered in the order that you add key-values.

That means the above dictionary is created in a certain order (not necessarily as shown in the code!), but once you add new key-values sequentially, they will retain that order. This means if you create an empty dictionary, and add pairs ...

testing_order = {}
testing_order['key1'] = 45
testing_order[2] = 'Yes, keys can even be integers!'
testing_order[3.0] = 'Or floating point objects'
testing_order.keys()

... that they will retain the order you added them. Because this is such a new feature, and people do not quickly upgrade their Python version, you probably should not count on it being available.

If you need to test the Python version in the code, use the sys.version_info attribute:

import sys

if (sys.version_info.major >= 3) and (sys.version_info.minor >= 7):
    print('I can rely on ordered dictionaries!')
    testing_order = dict()
else:
    print('Use the OrderedDict class from "import collections".')
    from collections import OrderedDict
    testing_order = OrderedDict()

testing_order['key1'] = 45
testing_order[2] = 'Yes, keys can even be integers!'
testing_order[3.0] = 'Or floating point objects'

# Guaranteed to be in order, no matter which version of Python you use!
testing_order.keys()

In [ ]:

➜ Challenge yourself: working with dictionaries¶

Create a dictionary containing the molar mass of pure species. Let the key be the chemical element (as a string), and the value be a floating point molar mass:

C: carbon = 12.0107
O: oxygen = 15.999
N: nitrogen = 14.0067
H: hydrogen = 1.00784
S: sulfur = 32.065
P: phosphorous = 30.973762

Now write a function calculate_molar_mass which accepts 1 input, a chemical formula as a string, and returns the calculated molar mass.

Water, $\text{H}_2\text{O}$ has 2 hydrogens and 1 oxygen. It could be represented as H2O1, and therefore has the molar mass of $(2 \times 1.00784) + (1 \times 15.999)$ = 18.01468.

Now try it yourself for an amino acid, Methionine, which is $\text{C}_5\text{H}_{11}\text{N}\text{O}_2 \text{S}$:

# make life easier: explicitly add the '1' for single atoms
methionine = 'C5H11N1O2S1'  
met_mm = molar_mass(methionine)

The molar mass of Methionine is 149.21 g/mol. Try your function on some other amino acids, such as Lysine, $\text{C}_6\text{H}_{14}\text{N}_2\text{O}_2$, which has a molar mass of 146.190 g/mol.

Suggested solution approach:

Work backwards: start with the dictionary written below (formula = {'C': 5, 'H': 11, 'N': 1, 'O': 2, 'S': 1}), and implement the last 2 bullet points here. Then write the code to create that dictionary:

The input string will always start with an alphabetical letter, not a number.
Start by iterate over every character in the string, until you encounter a number (use .isnumeric() on each character)
Keep the preceding character(s): in this example, it will be C.
Keep iterating until the numeric value switches back to an alphabetic one (use .isalpha() on each character)
Then you have the value(s). In this example, 5.
Store, in a dictionary that letter C as the *key, and the 5 numeric part as a value*.
Keep going, until you have built up a dictionary that should appear as:

formula = {'C': 5, 'H': 11, 'N': 1, 'O': 2, 'S': 1}

Now iterate over the dictionary, looking up the molar mass in a second dictionary, and add up the molecular weight.

Challenge yourself even more: adjust the code so that it can work with natural formulas, where the '1' parts are not given. E.g. your function should be able to handle methionine = 'C5H11NO2S' instead of 'C5H11N1O2S1'.

In [ ]:

➜ Challenge yourself: reading data from many files¶

A common problem in automated data analysis is reading data from many files in a directory, or sub-directories. Try this:

Create about 4 to 8 Excel files for yourself in the same directory.
Put different values in the cells, but always use the same cell location in the files. Here's an example:

* Save each of the files in the directory. * Create two or three sub-directories, and spread the files into some of those. * Now read the files, modifying the template code below:

import os
import fnmatch
pattern = '*.xlsx'

# Dataframe for the result:
result = pd.DataFrame(___)
for root, dirs, files in os.walk(r'C:\location\to\your\files'):
   for name in fnmatch.filter(files, pattern):
       full_filename = os.path.join(root, name)

       # Use Pandas to read the Excel file
       excel_values = pd.____

       # Add the result as a new row or column
       # in your Pandas DataFrame, df:
       result.____


# Finally, write the dataframe to CSV or Excel
result.to_excel("output.xlsx", sheet_name='All file results')

You can also use a dictionary instead of a Pandas DataFrame. The keys of the dictionary could be full_filename, while the values of each key could be a list of the number(s) you extracted from the Excel file.

In [ ]:

➜ Challenge yourself: moving average¶

Back in module 3 you had a challenge problem of calculating the moving average from a long vector of data.

You downloaded and used the Ammonia series of data: http://openmv.net/info/ammonia and calculated the moving average over $n=5$ values; called a window of 5 values.

Accumulate the first 5 entries in the window and calculate the average.
Then throw away the first entry, add the 6th entry to update your window.
Calculate the average based on the 2nd to the 6th values.
Keep going until you run out of values.

If you look back at your original code, it was probably many lines. Now you can make it even shorter: reduce it down to 3 lines!

import pandas as pd

# Read the ammonia.csv files as a Pandas data frame:
ammonia = pd.read_csv(___)

# Calculate the moving average:
ammonia.___

The last line is obviously the key to solving this. Look at the documentation for df.rolling: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html

Compare the solution in module 3 with the solution from Pandas.

In [ ]:

Further tips¶

Read about different user interfaces for writing and editing your Python code: https://www.datacamp.com/community/tutorials/data-science-python-ide

You have already seen and used Spyder and PyCharm, which are the top two listed. But have you tried Atom, or Jupyter Notebooks?

Get even more comfortable with Pandas DataFrames: https://www.datacamp.com/courses/manipulating-dataframes-with-pandas

Follow the first chapter of that online course for free.
See how to slice, filter and transform your dataframes.

*Feedback and comments about this worksheet?* Please provide any anonymous comments, feedback and tips.

In [ ]:

# IGNORE this. Execute this cell to load the notebook's style sheet.
from IPython.core.display import HTML
css_file = './images/style.css'
HTML(open(css_file, "r").read())

In [ ]:

Table of Contents

Module 8: Overview¶

Preparing for this module###¶

More about Dictionary objects¶

Iterating over the keys-values of a dictionary¶

Setting and getting key-values¶

Ordered vs Unordered dictionaries (advanced)¶

➜ Challenge yourself: working with dictionaries¶

➜ Challenge yourself: reading data from many files¶

➜ Challenge yourself: moving average¶

Further tips¶

More about `Dictionary` objects¶