Notebook

All content here is under a Creative Commons Attribution CC-BY 4.0 and all source code is released under a BSD-2 clause license.

Please reuse, remix, revise, and reshare this content in any way, keeping this notice.

Module 7: Overview¶

In module 5 and module 6 you used NumPy to create arrays, and perform mathematical calculations on them. Even though module 6 was about Python functions in general, the applications were all with NumPy.

Now we take a look at Pandas. This is currently the best library for data manipulation. Along the way we will also learn about Jupyter notebooks, and Python's other important built-in data type, called *dictionaries*.

Once again, don't forget to use your version control system. Commit your work regularly, where ever you see this icon; actually even more frequently. We will show this icon fewer times, as it should now be almost automatic to make regular commits (*several per hour!*)

Preparing for this module###¶

You should have

completed worksheet 6
finished the short project on DataCamp about Jupyter notebooks.

In [ ]:

`Dictionary` data type¶

A dictionary is a Python *object* that is a flexible data container for other objects. It contains these objects using what are called *key* - *value* pairs. You create a dictionary like this:

random_objects = {'an int': 45,
                  'a float': 12.34,
                  'short_list': [1, 4, 7],
                  'longer list': [2, 4, 6, 9, 12, 16, 20, 25, 30, 36, 42],
                  'website': "https://learnche.org",
                  'nested_tuple': (1, 2.0, [3, 4], '5', (6, 7.0))
                 }
print(random_objects)

Notice that it is unlikely when you print the random_objects variable, that you see the dictionary written back in the order shown above. Dictionaries are an *unordered* container.

The *keys* and *values* of this dictionary are:

# These both return a list:
random_objects.keys()
random_objects.values()

# What is the "type" of this dictionary?
print('The object is of: {}'.format(type(random_objects)))

# You can access individual elements:
random_objects['short_list']

In the above example, the keys were all *string* objects. But that is not required. You can use integers, floating point values, strings, tuples, or a mixture of them. There are other options too, but these are comprehensive enough.

Dictionary values may be any *objects*, even other dictionaries. Yes, so a dictionary within a dictionary is possible. We will use this below. That is why we partially why we waited to introduce dictionaries until now.

Dictionary objects are excellent *containers*. If you need to return several objects from a function, collect them in a dictionary, and return them in that single object. It is not required, but it can make your code neater, and more logical.

Try it¶

Create a dictionary for yourself with 5 key-value pairs, which summarizes a regression model. The key is the first item below, followed by a description of what you should create as the value:

intercept: containing a floating-point value which is the intercept of your linear model
slope: a floating-point slope value
R2: the $R^2$ value of the regression model
standard_error: a value which is the model's standard error
residuals: a NumPy vector of residuals

You can create the above dictionary in a single line of code. But what if you want to add something new to an existing dictionary later?

d = { ... } # create your dictionary
d['new key'] = 'additional value'

And you can overwrite an existing key-value pair in the same way:

random_objects['an int'] = 'replaced with a string!'

This implies you can never have 2 keys which are the same. If you try to create a second key which already exists, it will overwrite the value associated with the existing key.

Extend yourself (later)¶

At a later time, checkout your code from the linear regression function you wrote earlier. Modify the code to return a single dictionary, instead of 4 items in a tuple.

In [ ]:

Using the Pandas library¶

Why use pandas if you already can use NumPy?

In NumPy you have arrays of data. Pandas adds column headings and row labels (indexes) and calls the result a DataFrame. Think of a spreadsheet.
But much better than a spreadsheet, Pandas can merge two tables together, to align data from different sources.
If the axis is time-based, it can be taken advantage of: e.g. you can then average over a week, or a month. In other languages you have to manually program that averaging, including taking into account that months sometimes have 28, 29, 30 or 31 days.
Data which are not time-based are equally well handled.
If you do something on a data frame, like calculate an average over all rows, then the result has the labels, the column headings in this case, kept in place.
Pandas takes care of missing data handling.
It has a database-type thinking, so in later modules, when we handle databases, it will not be an unfamiliar topic.

You can load the Pandas library, similar to how you load the NumPy library, with this command:

import pandas as pd
pd.__version__ # ensure you have a version >= 0.20

Before we start with DataFrames, there is a simpler object in Pandas, called a Series; roughly the equivalent of a vector in NumPy.

Let's see some characteristics of a Series:

# Create a Series from a list. Put your own numbers here:
s = pd.Series([ ... ]) 
print(s)

Notice the index (the column to the left of your numbers)? Let's look at another example:

>>> s = pd.Series([ 5, 9, 1, -4, float('nan'), 5 ])
>>> print(s)  
0    5.0
1    9.0
2    1.0
3   -4.0
4    NaN
5    5.0
dtype: float64

If you do not provide any labels for the rows, the these will be automatically generated for you, starting from 0.

What if you have your own labels already?

# You call the function with two inputs. One input is 
# mandatory (the first one), the other is optional.
s = pd.Series(data  = [5,   9,   1,   -4,  float('nan'), 5 ], 
              index = ['a', 'b', 'c', 'd', 'e',         'f'])
s.values
type(s.values)

Ah ha! See what you get there in the output from s.values? Pandas is built on top of the NumPy library. The underlying data are still stored as NumPy arrays, and you can access them with the .values attribute. This is partly why understanding NumPy first is helpful before using Pandas.

Lastly, give your series a nice name:

s.name = 'Random values'
print(s)

In [ ]:

Using a Pandas `Series`¶

Mathematical calculations¶

The series you created above, can be used in calculations. Notice how missing data are handled seamlessly.

import pandas as pd
s = pd.Series(data  = [5,   9,   1,   -4,  float('nan'), 5 ], 
              index = ['a', 'b', 'c', 'd', 'e',         'f'],
              name = 'Calculations')
s * 5 + 2

import numpy as np
np.sqrt(s)

The last line shows that Pandas and NumPy are compatible with each other. You can call NumPy operations on a Pandas object, and the result is returned as a Pandas object to you, with the row labels (indexes) intact.

Also notice, that taking the square root of a negative number is not defined for real values, so the square root of $-4$ in row d returns a NaN.

Logical operations are possible too:

s > 4
s.isna()
np.sqrt(s).isna()
s.notna()

Accessing entries¶

Like in NumPy, you can access the data entries using the square bracket notation. In Pandas:

s[2]
s['e']

Selected subsets from the series can be accessed too, again using square brackets:

s[[2, 4, 0]]
s[['f', 'd', 'b']]

# Selection based on logic: I want only values greater than 4
s[s > 4]

You can also access a range of entries:

s[0:2]
s['a':'c']

Take a careful look at that output. You might have expected them to be the same length, but they are not! When accessing with the index names, you get the range inclusive of the last entry. When accessing by index number, it behaves consistent with Python and NumPy.

That makes sense. Names of the rows, the index, do not necessarily have to be sequential, like ['a', 'b', ... 'f'] as in this example. Often the index is unordered. E.g. if you have a series related to different Canadian cities:

['Toronto', 'Vancouver', 'Ottawa', 'Montréal', 'Halifax']

then with ['Vancouver':'Montréal'] you expect to see the middle 3 entries, inclusive of Montréal.

Creating a Series from a dictionary¶

Now we can combine two new concepts you have just learned: Dictionaries and Pandas.

raw_data = {'Germany': 27, 'Belgium': 13, 'Netherlands': 52, 'Sweden': 54, 'Ireland': 5}
tons_herring_eaten = pd.Series(raw_data)
print(tons_herring_eaten)

The row names (index) are taken from the dictionary keys, associated with each value. Because dictionaries are not ordered, the rows in the series will not necessarily be in the order written above.

Write the Pandas command to determine which country eats the most herring. It is not with the tons_herring_eaten.max() command!
And the least herring?
What does this do? tons_herring_eaten.sort_values(). Print the variable afterwards.

If this command fails, you might have an older version of Pandas. Try tons_herring_eaten.sort() instead.

And what does this do then? tons_herring_eaten.sort_index()

In [ ]:

Pandas `DataFrame`¶

Back in module 5 you created a NumPy matrix of 5 temperatures for 4 cities (columns).

import numpy as np
temp_np = np.array([[7, 9, 12, 10], 
                   [1, 4, 5, 2], 
                   [-3, 1, -2, -3], 
                   [-2, -1, -2, -2], 
                   [-3, -1, -2, -4]])
print(('The temperatures are given one column per '
      'city, each row is a daily average '
      'temperature:\n{}').format(temp_np))

max_value_0 = np.amax(temp_np, axis=0)

Now let's try the same in Pandas, creating a DataFrame from a list-of-lists:

import pandas as pd
rawdata = [[7, 9, 12, 10], 
           [1, 4, 5, 2], 
           [-3, 1, -2, -3], 
           [-2, -1, -2, -2], 
           [-3, -1, -2, -4]]
temp_df = pd.DataFrame(data=rawdata, columns = ['Toronto', 'Vancouver', 'Ottawa', 'Montreal'])

We saw then that you could calculate things on each *axis* of the NumPy array. You can also do this in Pandas:

# NumPy and Pandas do different things here!
temp_np.max()     
temp_df.max() 

# Here they are are consistent
temp_np.max(axis=0) 
temp_df.max(axis=0) 

# Calculate across the rows, over all cities
temp_np.max(axis=1) 
temp_df.max(axis=1)

In general, the same things you can calculate in NumPy, you can repeat in Pandas:

NumPy	Pandas	Description
`np.sum`	`df.sum`	Sum of entries
`np.prod`	`df.prod`	Product (multiplication) of all entries
`np.mean`	`df.mean`	Arithmetic average
`np.median`	`df.median`	Median value
`np.std`	`df.std`	Standard deviation
`np.var`	`df.var`	Variance
`np.min`	`df.min`	Minimum value
`np.max`	`df.max`	Maximum value
`np.argmin`	`df.idxmin`	Index of the minimum
`np.argmax`	`df.idxmax`	Index of the maximum
`np.diff`	`df.diff`	Difference between entries

Try it:¶

Calculations on certain columns. The beauty of Pandas is how easy it is to write equations, based on the columns:

temp_df['Toronto'] * 4 - temp_df['Montreal']

The above does exactly what you think it should.

What does this do?

>>> temp_df.diff().abs().max()

# and this? 
>>> temp_df.diff().abs().max().argmax()

What is the interpretation of that long command?

You can stack up your sequential operations quite compactly in Pandas. It works because the output from one function is the input for the next one to the right. Refer back to the section on functions, if necessary.

In [ ]:

Create a DataFrame from a dictionary¶

You used a list-of-lists approach above to create your DataFrame. You can also use dictionaries. Each key in the dictionary can contain a list of equal length:

data = {'Herring':  [27, 13, 52, 54,  5, 19], 
        'Coffee':   [90, 94, 96, 97, 30, 73],
        'Tea':      [88, 48, 98, 93, 99, 88]}
countries = ['Germany', 'Belgium', 'Netherlands', 'Sweden', 'Ireland', 'Switzerland']
food_consumed = pd.DataFrame(data, index=countries)

Try to interpret what the following lines return, based on the values you see when you run each command:

food_consumed.T
food_consumed.values
food_consumed.columns
food_consumed.index
food_consumed.head()
food_consumed.tail()
food_consumed.describe()

The first one is particularly helpful, if you need to switch rows and columns around.

*Hint: whenever you create a data frame (by hand, or by loading a file, which we will see next), always* use .head() and .describe() to check you have the data correctly loaded. It will save you a lot of time from making errors, only to discover them later.

In [ ]:

DataFrame operations¶

We will show code for these commonly-used Pandas operations: shape of an array, unique entries, adding and merging columns, adding rows, deleting rows, and removing missing values.

import pandas as pd
data = {'Herring':  [27, 13, 52, 54,  5, 19], 
       'Coffee':   [90, 94, 96, 97, 30, 73],
       'Tea':      [88, 48, 98, 93, 99, 88]}
countries = ['Germany', 'Belgium', 'Netherlands', 'Sweden', 'Ireland', 'Switzerland']
food_consumed = pd.DataFrame(data, index=countries)

1. Shape of a data frame¶

# There were 6 countries, and 3 food types. Verify:
food_consumed.shape

# Transposed and then shape:
food_consumed.T.shape

# Interesting: what shapes do summary vectors have?
food_consumed.mean().shape

2. Unique entries¶

# Access the column names directly. 
# Does not work if there is a space in the name though :(
food_consumed.Tea.unique()

# So this is clearer, in my opinion. It is also more programmatic.
# In other words, you can replaced 'Tea' with a string variable, and
# the code will still work.
food_consumed['Tea'].unique()

# Names (indexes) of the unique rows:
food_consumed.index.unique()

# In newer versions of Pandas, you can get counts (n) of
# the unique entries:
food_consumed.nunique()       # in each column 
food_consumed.nunique(axis=1) # in each row

3. Add a new column¶

# Works just like a dictionary!
# If the data are in the same row order
food_consumed['Yoghurt'] = [30, 20, 53, 2, 3, 48]

4. Merging dataframes¶

# Note the row order is different this time:
more_foods = pd.DataFrame(index=['Belgium', 'Germany', 'Ireland', 'Netherlands', 'Sweden', 'Switzerland'],
                           data={'Garlic': [29, 22, 5, 15, 9, 64]})

# Merge 'more_foods' into the 'food_consumed' data frame
food_consumed = food_consumed.join(more_foods)

5. Adding a new row¶

# Collect the new data in a Series. Note that 'Tea' is (intentionally) missing!
portugal = pd.Series({'Coffee': 72,  'Herring': 20, 'Yoghurt': 6, 'Garlic': 89},
                    name = 'Portugal')

food_consumed = food_consumed.append(portugal)
# See the missing value created?

6. Delete or drop a row/column¶

# Drop a column, and returns its values to you
coffee_column = food_consumed.pop('Coffee')
print(coffee_column)
print(food_consumed)

# Leaves the original data untouched; returns only 
# a copy, with those columns removed
food_consumed.drop(['Garlic', 'Yoghurt'], axis=1)

# Leaves the original data untouched; returns only 
# a copy, with those rows removed. 
non_EU_consumption = food_consumed.drop(['Switzerland', ], axis=0)

7. Remove rows with missing values¶

# Returns a COPY of the array, with no missing values:
cleaned_data = food_consumed.dropna() 

# Makes the deletion inplace; more efficient for large data sets
food_consumed.dropna(inplace=True) 

# Remove only rows where all values are missing:
food_consumed.dropna(how='all')

In [ ]:

Dealing with CSV files in Pandas¶

You can skip this theory section for now, and go straight to the application below. Read this later, if it is interesting.

CSV stands for Comma-Separated Values. It is a table of data, in a plain text file, where the numbers are separated by commas.

In the figure below, you see an example CSV file and what it might look like if you open one in a text editor. Download the file: https://openmv.net/file/batch-yield-and-purity.csv to follow along. Verify that it looks similar to what is shown.

How is a CSV file processed by Python?

When the file is imported, it is read from top-to-bottom, line-by-line.
Within each line we will read from left-to-right. Every time a comma is encountered, the file import process will assume it is the start of a new column, and place the following number into that column.
It continues until it reaches the end of the line of text.

Terminology: the comma symbol is used to separate one number from the next. In this situation, the comma is also known as a *delimiter*. You can imagine that you can devise a file format where you specify some other symbol as the delimiter, for example the 'tab' character, or the semi-colon, or even a space.

You will see the word *delimiter* used below in the code. Now you know what it means.

*Problems* with CSV files:

The comma indicator often used in European countries for decimals, as in "72,6" for the number 72.6 will be problematic. Since importing reads from left to right "72,6" will create an integer of 72 in one column, and a value of 6 in the next column. CSV files must therefore be created using a period (or full-stop) as the decimal separator for floating point numbers (i.e. non-integer numbers).
The CSV format is not economical in terms of storage space for large datasets.
It should be clear from the above description that it is not certain how many rows or columns will be encountered before loading a CSV file. So computer memory cannot be allocated up-front. This can slow down processing of large files.
Editing data in a CSV file is not simple: you cannot easily delete an entire column, for example.

*Advantages*

The CSV format is future-proof: since everything inside such a file is just plain-old text, it can always be opened, no matter which software you use in the future. You can still open text files from 1970 on your computer from today.
You can quickly change the CSV file to update/add/remove parts of it using your text editor (e.g. a single row). No special software is required.
We can handle missing data, or create missing values: put a NaN between two commas. For example: 71,NaN,73 indicates a value is missing between the 71 and the 73.
The CSV file is a good data-interchange format. By that we mean: almost all data manipulation software can export to CSV, and once you have that as an intermediate file, you can re-import it into other software, including Python. Use it, for example, to export a database to CSV. Or to share data between R, MATLAB, Python or other software tools.
Because of its simplicity, it is widely supported in most data manipulation software. But its simplicity is also its downfall for more complex data (e.g. data greater than 2-dimensions, such as digital images, or data from batch processes).

Reading a CSV file with Pandas¶

The Pandas function pd.read_csv has a lot of flexibility and smart processing built-in to make reading CSV files easy with headers, missing values, and other settings. It is a swiss-army knife function: very versatile, but you need to know how to use it.

Read the CSV file from your computer/network drive:¶

Download this CSV file: http://openmv.net/file/batch-yield-and-purity.csv and adjust the code below, where necessary:

import os
import pandas as pd

directory = r'C:\location\of\file'
filename = 'batch-yield-and-purity.csv'
full_filename = os.path.join(directory, filename)
yield_purity_pd = pd.read_csv(full_filename)

Read the CSV file directly from a web server:¶

import pandas as pd
yield_purity_pd = pd.read_csv('http://openmv.net/file/batch-yield-and-purity.csv')

# If you are on a work computer behind a proxy server, you
# have to take a few more steps. Add these 6 lines of code.
import io
import requests
proxyDict = {"http" : "http://replace.with.proxy.address:port"}
url = "http://openmv.net/file/batch-yield-and-purity.csv"
s = requests.get(url, proxies=proxyDict).content
web_dataset = io.StringIO(s.decode('utf-8'))

# Convert the file fetched from the web to a Pandas dataframe
yield_purity_pd = pd.read_csv(web_dataset)

Write a DataFrame (or Series) to CSV¶

This is as simple as can be:

yield_purity_pd.to_csv('output_filename_here.csv')

➜ Challenge yourself: working with CSV files in Pandas¶

Read in the Batch yield and Purity dataset above. The dataset is described here http://openmv.net/info/batch-yield-and-purity

Advisable, the moment you have created a dataset with Pandas, is to call df.head() on the file. In this case: yield_purity_pd.head(). Does the data frame match the website's description?
What is the lowest yield recorded? Make sure your command returns only a floating point value.
And the highest?
What is the average purity of the raw material?
Check the output of yield_purity_pd.describe(): can you interpret each row of output?
In the description given on the website, there is the idea that batch yield is affected by purity. For a cause-and-effect relationship to exist, there should be a correlation. Do you see that? Use yield_purity_pd.corr() to calculate the correlation between the two columns.
We use relative standard deviation (RSD) as a way to judge how noisy a variable is. For a single variable, this is defined as: $$\text{RSD} = \dfrac{\text{standard deviation}}{\text{average}}$$

Calculate the numerator (yield_purity_pd.std()) and the denominator separately. Do the values look reasonable? Now divide them. Does yield or purity have the highest RSD? 8. Create a new calculation hypothesis = $ 4\times(\text{purity} - 50)$ and add that as a new column to the existing data frame. 9. Write the data frame, now with 3 columns, to a CSV file. 10. Open the CSV file in a text editor to ensure the data are properly stored, with the expected accuracy.

In [ ]:

Reading and writing Excel files with Pandas¶

There is not too much to say here, other than to show the basic commands:

colour_data = pd.read_excel(excel_filename, 
                            sheet_name='Colours', 
                            skiprows=5, 
                            index_col=0)
colour_data.head()

You can call the function with various inputs, depending on your situation. Read the full documentation for reading Excel files: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

Similarly, for writing Excel files, it is often enough to just use:

df = pd.DataFrame(...)
df.to_excel("output.xlsx", sheet_name='Summary')

and it is worth checking the documentation for further function options: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html

In [ ]:

➜ Challenge yourself: Fridge simulation¶

Back in prior worksheet 6 you simulated the cooling process taking place in a fridge.

By the end of the exercise you had assembled 3 columns of equal length in 3 separate variables. Here is how your code might have looked:

import numpy as np
def simulate_cooling(time_final=30, initial_temp=25, delta_t=2.5):
    """Models Newton's Law of Cooling: \dfrac{dT}{dt} = -k (T-F)

    The fridge has a constant temperature, F=5 [°C]
    Heat transfer coefficient = k = 0.08       [1/minutes]

    Discretizing: T_{i+1} = T_i - k (\delta t)(T_i - F)

    Temperature at time point $i+1$ (one step in the future, T_{i+1}) is related
    to the temperature now, at time $i$, T_{i}.

    The total simulation time is `time_final` minutes, starting at
    t=0. The initial temperature of the object in the fridge is 
    ``initial_temp` °C, and the simulation time steps are `delta_t`
    minutes apart.
    """

    F_temperature = 5  # °C
    heat_transfer_coeff = 0.08  # 1/minutes

    # Create the two outputs of interest
    time = np.arange(start=0.0, stop=time_final, step=delta_t)
    temp = np.zeros(time.shape)
    temp[0] = initial_temp

    for idx, t_value in enumerate(time[1:]):
        temp[idx + 1] = temp[idx] - heat_transfer_coeff * delta_t * (temp[idx] - F_temperature)

    # After the loop:
    time[idx + 1] = t_value + delta_t

    # Exact value
    exact = F_temperature + (initial_temp - F_temperature) * np.exp(-heat_transfer_coeff * time)

    return (time, temp, exact)


time, temperature, true_value = simulate_cooling(time_final=30, initial_temp=25)

The moment it happens that you collect several variables as output which logically belong together, then you should combine them in a single variable, a Pandas data frame. Put each variable as a new column so your function output is simplified:

simulation = simulate_cooling(time_final=30, initial_temp=25)

Modify the above code. Remove the last 2 lines inside the function, and replace them with code that:

Creates a data frame, with 2 columns: one for time and one for temperature.
Append a 3rd column, exact, which contains the exact solution, but calculated from the values in the first two columns.
Append a 4th column, error, which contains the difference between the true value and the simulated value.
Make the first column, time, to become your index! That actually removes that column. If your data frame were called results inside the function, you do this by writing: results.set_index('time'). Check the results.shape before and after doing this.
Return a single output, the data frame.
Modify outside the function, so that only the single output is used, and not 3 outputs.

➜ Challenge yourself: CSV file processing¶

Load the CSV file from https://openmv.net/info/raw-material-height which is actual data from a process, where the height is a critical parameter to monitor, to ensure it does not go too low.

What is the minimum value seen in the history of the process?
When did that minimum occur?
Try this: data['Level'].plot()
You loaded the data from CSV. Now export the data set to Excel.
Open the Excel file, and make a plot of the data. Verify that the Python plot matches the Excel plot.

We will return to plotting in a later worksheet.

➜ Challenge yourself: KNMI data loading¶

Back in worksheet 4 you used a data set from The Dutch meteorological service (KNMI): temperature readings from a location in The Netherlands, since 1901. Download the file, or use the direct web address: http://projects.knmi.nl/klimatologie/onderzoeksgegevens/homogeen_260/tg_hom_mnd260.txt

The first column is the station number, the next is the date, and the third column is the temperature, measured in units of °C.

Unfortunately, the KNMI service has used commas as the delimiter in line 27 for the column headings, and then spaces in the rest of the file as delimiter. This makes it hard to find the right settings to import the file. Nevertheless, try using the documentation for pd.read_csv.

*Hint:*

knmi = pd.read_csv('URL here, or the file name on your computer', delimiter=..., skiprows=..., header=0)
knmi.head()

Warning: it can be a frustrating exercise dealing with other people's badly formatted data. *But that's reality.*

In [ ]:

Further tips¶

Download a cheatsheet of Pandas tips: https://www.kdnuggets.com/2017/01/pandas-cheat-sheet.html
Learn more about Indexing Pandas arrays: https://www.kdnuggets.com/2019/04/pandas-dataframe-indexing.html (to be covered in the Advanced course).

Wrap up this section by committing all your work. Have you used a good commit message? Push your work, to refer to later, but also as a backup.

*Feedback and comments about this worksheet?* Please provide any anonymous comments, feedback and tips.

In [ ]:

# IGNORE this. Execute this cell to load the notebook's style sheet.
from IPython.core.display import HTML
css_file = './images/style.css'
HTML(open(css_file, "r").read())

In [ ]:

Table of Contents

Module 7: Overview¶

Preparing for this module###¶

Dictionary data type¶

Try it¶

Extend yourself (later)¶

Using the Pandas library¶

Using a Pandas Series¶

Mathematical calculations¶

Accessing entries¶

Creating a Series from a dictionary¶

Pandas DataFrame¶

Try it:¶

Create a DataFrame from a dictionary¶

DataFrame operations¶

1. Shape of a data frame¶

2. Unique entries¶

3. Add a new column¶

4. Merging dataframes¶

5. Adding a new row¶

6. Delete or drop a row/column¶

7. Remove rows with missing values¶

Dealing with CSV files in Pandas¶

Reading a CSV file with Pandas¶

Read the CSV file from your computer/network drive:¶

Read the CSV file directly from a web server:¶

Write a DataFrame (or Series) to CSV¶

➜ Challenge yourself: working with CSV files in Pandas¶

Reading and writing Excel files with Pandas¶

➜ Challenge yourself: Fridge simulation¶

➜ Challenge yourself: CSV file processing¶

➜ Challenge yourself: KNMI data loading¶

Further tips¶

`Dictionary` data type¶

Using a Pandas `Series`¶

Pandas `DataFrame`¶