Pythonic Data Analysis
Pythonic Data Analysis
Unidata Python Workshop
Questions¶
- How can we employ Python language features to make complicated analysis require less code?
- How can we make multi panel plots?
- What can be done to eliminate repeated code that operates on sequences of objects?
- How can functions be used to encapsulate calculations and behaviors?
Objectives¶
1. From Time Series Plotting Episode¶
Here's the basic set of imports and data reading functionality that we established in the Basic Time Series Plotting notebook.
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter, DayLocator
from siphon.simplewebservice.ndbc import NDBC
%matplotlib inline
# Read in some data
df = NDBC.realtime_observations('42039')
# Trim to the last 7 days
df = df[df['time'] > (pd.Timestamp.utcnow() - pd.Timedelta(days=7))]
2. Multi-panel Plots¶
Often we wish to create figures with multiple panels of data. It's common to separate variables of different types into these panels. We also don't want to create each panel as an individual figure and combine them in a tool like Illustrator - imagine having to do that for hundreds of plots!
Previously we specified subplots individually with plt.subplot()
. We can instead use the subplots
method to specify a number of rows and columns of plots in our figure, which returns the figure and all of the axes (subplots) we ask for in a single call:
# ShareX means that the axes will share range, ticking, etc. for the x axis
fig, (ax1, ax2) = plt.subplots(1, 2, sharex=True, figsize=(18, 6))
# Panel 1
ax1.plot(df.time.values, df.wind_speed, color='tab:orange', label='Windspeed')
ax1.set_xlabel('Time')
ax1.set_ylabel('Speed')
ax1.set_title('Measured Winds')
ax1.legend(loc='upper left')
ax1.grid(True)
# Not repeated only by sharing x
ax1.xaxis.set_major_formatter(DateFormatter('%m/%d'))
ax1.xaxis.set_major_locator(DayLocator())
# Panel 2
ax2.plot(df.time.values, df.pressure, color='black', label='Pressure')
ax2.set_xlabel('Time')
ax2.set_ylabel('hPa')
ax2.set_title('Atmospheric Pressure')
ax2.legend(loc='upper left')
ax2.grid(True)
plt.suptitle('Buoy 42039 Data', fontsize=24)
So even with the sharing of axis information, there's still a lot of repeated code. This current version with just two parameters might still be ok, but:
- What if we had more data being plotted on each axes?
- What if we had many subplots?
- What if we wanted to change one of the parameters?
- What if we wanted to plot data from different files on the same plot?
3. Iteration and Enumeration¶
Iterating over lists is a very useful tool to reduce the amount of repeated code you write. We're going to start out by iterating over a single list with a for
loop. Unlike C or other common scientific languages, Python 'knows' how to iterate over certain objects without you needing to specify an index variable and do the book keeping on that.
my_list = ['2001 A Space Obyssey',
'The Princess Bride',
'Monty Python and the Holy Grail']
for item in my_list:
print(item)
Using the zip
function we can even iterate over multiple lists at the same time with ease:
my_other_list = ['I\'m sorry, Dave. I\'m afraid I can\'t do that.',
'My name is Inigo Montoya.',
'It\'s only a flesh wound.']
for item in zip(my_list, my_other_list):
print(item)
That's really handy, but needing to access each part of each item with an index like item[0]
isn't very flexible, requires us to remember the layout of the item, and isn't best practice. Instead we can use Python's unpacking syntax to make things nice and intuitive.
for reference, quote in zip(my_list, my_other_list):
print(reference, '-', quote)
- Make two new lists named
plot_variables
andplot_names
. Populate them with the variable name and plot label string for windspeed and pressure. - Using the unpacking syntax, write a for loop that prints a sentence describing the action that would be taken (i.e. Plotting variable wind_speed as Windspeed
# YOUR CODE GOES HERE
# %load solutions/zip.py
# Cell content replaced by load magic replacement.
plot_variables = ['wind_speed', 'pressure']
plot_names = ['Windspeed', 'Atmospheric Pressure']
for var, name in zip(plot_variables, plot_names):
print('Plotting variable', var, 'as', name)
zip
can also be used to "unzip" items.
zipped_list = [(1, 2),
(3, 4),
(5, 6)]
unzipped = zip(*zipped_list)
print(list(unzipped))
Let's break down what happened there. Zip pairs elements from all of the input arguements and hands those back to us. So effectively out zip(*zipped_list)
is zip((1, 2), (3, 4), (5, 6))
, so the first element from each input is paired (1, 3, 5), etc. You can think of it like unzipping or transposing.
We can use the enumerate
function to 'count through' an iterable object as well. This can be useful when placing figures in certain rows/columns or when a counter is needed.
for i, quote in enumerate(my_other_list):
print(i, ' - ', quote)
- Combine what you've learned about enumeration and iteration to produce the following output:
0 - 2001 A Space Obyssey - I'm sorry, Dave. I'm afraid I can't do that.
1 - The Princess Bride - My name is Inigo Montoya.
2 - Monty Python and the Holy Grail - It's only a flesh wound.
# YOUR CODE GOES HERE
# %load solutions/enumerate.py
# Cell content replaced by load magic replacement.
for i, item in enumerate(zip(my_list, my_other_list)):
reference, quote = item
print(i, ' - ', reference, ' - ', quote)
4. Functions¶
You're probably already familiar with Python functions, but here's a quick refresher. Functions are used to house blocks of code that we can run repeatedly. Paramters are given as inputs, and values are returned from the function to where it was called. In the world of programming you can think of functions like paragraphs, they encapsulate a complete idea/process.
Let's define a simple function that returns a value:
def silly_add(a, b):
return a + b
We've re-implemented add which isn't incredibly exiciting, but that could be hundreds of lines of a numerical method, making a plot, or some other task. Using the function is simple:
result = silly_add(3, 4)
print(result)
- Write a function that returns powers of 2. (i.e. calling
myfunc(4)
returns 2^4) - Bonus: Using for loop iteration, print all powers of 2 from 0 to 24.
# Your code goes here
# %load solutions/functions.py
# Cell content replaced by load magic replacement.
def myfunc(exp):
return 2**exp
for i in range(0, 25):
print(myfunc(i))
Reading buoy data with a function¶
Let's create a function to read in buoy data and trim it down to the last 7 days by only providing the buoy number to the function.
def read_buoy_data(buoy, days=7):
# Read in some data
df = NDBC.realtime_observations(buoy)
# Trim to the last 7 days
df = df[df['time'] > (pd.Timestamp.utcnow() - pd.Timedelta(days=days))]
return df
df = read_buoy_data('42039')
df
5. Args and Kwargs¶
Within a function call, we can also set optional arguments and keyword arguments (abbreviated args and kwargs in Python). Args are used to pass a variable length list of non-keyword arguments. This means that args don't have a specific keyword they are attached to, and are used in the order provided. Kwargs are arguments that are attached to specific keywords, and therefore have a specific use within a function.
Args Example¶
def arg_func(*args):
for arg in args:
print (arg)
arg_func('Welcome', 'to', 'the', 'Python', 'Workshop')
Kwargs Example¶
# Create a function to conduct all basic math operations, using a kwarg
def silly_function(a, b, operation=None):
if operation == 'add':
return a + b
elif operation == 'subtract':
return a - b
elif operation == 'multiply':
return a * b
elif operation == 'division':
return a / b
else:
raise ValueError('Incorrect value for "operation" provided.')
print(silly_function(3, 4, operation='add'))
print(silly_function(3, 4, operation='multiply'))
Kwargs are commonly used in MetPy, matplotlib, pandas, and many other Python libraries (in fact we've used them in almost every notebook so far!).
6. Plotting with Iteration¶
Now let's bring what we've learned about iteration to bear on the problem of plotting. We'll start with a basic example and roll into a more involved system at the end.
To begin, let's make an arbitrary number of plots in a single row:
# A list of names of variables we want to plot
plot_variables = ['wind_speed', 'pressure']
# Make our figure, now choosing number of subplots based on length of variable name list
fig, axes = plt.subplots(1, len(plot_variables), sharex=True, figsize=(18, 6))
# Loop over the list of subplots and names together
for ax, var_name in zip(axes, plot_variables):
ax.plot(df.time.values, df[var_name])
# Set label/title based on variable name--no longer hard-coded
ax.set_ylabel(var_name)
ax.set_title(f'Buoy {var_name}')
# Set up our formatting--note lack of repetition
ax.grid(True)
ax.set_xlabel('Time')
ax.xaxis.set_major_formatter(DateFormatter('%m/%d'))
ax.xaxis.set_major_locator(DayLocator())
It's a step forward, but we've lost a lot of formatting information. The lines are both blue, the labels as less ideal, and the title just uses the variable name. We can use some of Python's features like dictionaries, functions, and string manipulation to help improve the versatility of the plotter.
To start out, let's get the line color functionality back by using a Python dictionary to hold that information. Dictionaries can hold any data type and allow you to access that value with a key (hence the name key-value pair). We'll use the variable name for the key and the value will be the color of line to plot.
colors = {'wind_speed': 'tab:orange', 'wind_gust': 'tab:olive', 'pressure': 'black'}
To access the value, just access that element of the dictionary with the key.
colors['pressure']
Now let's apply that to our plot. We'll use the same code from the previous example, but now look up the line color in the dictionary.
fig, axes = plt.subplots(1, len(plot_variables), sharex=True, figsize=(18, 6))
for ax, var_name in zip(axes, plot_variables):
# Grab the color from our dictionary and pass it to plot()
color = colors[var_name]
ax.plot(df.time.values, df[var_name], color)
ax.set_ylabel(var_name)
ax.set_title(f'Buoy {var_name}')
ax.grid(True)
ax.set_xlabel('Time')
ax.xaxis.set_major_formatter(DateFormatter('%m/%d'))
ax.xaxis.set_major_locator(DayLocator())
That's already much better. We need to be able to plot multiple variables on the wind speed/gust plot though. In this case, we'll allow a list of variables for each plot to be given and iterate over them. We'll store this in a list of lists. Each plot has its own list of variables!
plot_variables = [['wind_speed', 'wind_gust'], ['pressure']]
fig, axes = plt.subplots(1, len(plot_variables), sharex=True, figsize=(18, 6))
for ax, var_names in zip(axes, plot_variables):
for var_name in var_names:
# Grab the color from our dictionary and pass it to plot()
color = colors[var_name]
ax.plot(df.time.values, df[var_name], color)
ax.set_ylabel(var_name)
ax.set_title(f'Buoy {var_name}')
ax.grid(True)
ax.set_xlabel('Time')
ax.xaxis.set_major_formatter(DateFormatter('%m/%d'))
ax.xaxis.set_major_locator(DayLocator())
- Create a dictionary of linestyles in which the variable name is the key and the linestyle is the value.
- Use that dictionary to modify the code below to plot the lines with the styles you specified.
# Create your linestyles dictionary and modify the code below
fig, axes = plt.subplots(1, len(plot_variables), sharex=True, figsize=(18, 6))
for ax, var_names in zip(axes, plot_variables):
for var_name in var_names:
# Grab the color from our dictionary and pass it to plot()
color = colors[var_name]
ax.plot(df.time.values, df[var_name], color)
ax.set_ylabel(var_name)
ax.set_title(f'Buoy {var_name}')
ax.grid(True)
ax.set_xlabel('Time')
ax.xaxis.set_major_formatter(DateFormatter('%m/%d'))
ax.xaxis.set_major_locator(DayLocator())
# %load solutions/looping1.py
# Cell content replaced by load magic replacement.
linestyles = {'wind_speed': '-', 'wind_gust': '--', 'pressure': '-'}
fig, axes = plt.subplots(1, len(plot_variables), sharex=True, figsize=(18, 6))
for ax, var_names in zip(axes, plot_variables):
for var_name in var_names:
# Grab the color from our dictionary and pass it to plot()
color = colors[var_name]
linestyle = linestyles[var_name]
ax.plot(df.time, df[var_name], color, linestyle=linestyle)
ax.set_ylabel(var_name)
ax.set_title(f'Buoy {var_name}')
ax.grid(True)
ax.set_xlabel('Time')
ax.xaxis.set_major_formatter(DateFormatter('%m/%d'))
ax.xaxis.set_major_locator(DayLocator())
We're almost back to where to started, but in a much more versatile form! We just need to make the labels and titles look nice. To do that, let's write a function that uses some string manipulation to clean up the variable names and give us an axis/plot title and legend label.
def format_varname(varname):
parts = varname.split('_')
title = parts[0].title()
label = varname.replace('_', ' ').title()
return title, label
fig, axes = plt.subplots(1, len(plot_variables), sharex=True, figsize=(18, 6))
linestyles = {'wind_speed': '-', 'wind_gust': '--', 'pressure': '-'}
for ax, var_names in zip(axes, plot_variables):
for var_name in var_names:
title, label = format_varname(var_name)
color = colors[var_name]
linestyle = linestyles[var_name]
ax.plot(df.time.values, df[var_name], color, linestyle=linestyle, label=label)
ax.set_ylabel(title)
ax.set_title(f'Buoy {title}')
ax.grid(True)
ax.set_xlabel('Time')
ax.xaxis.set_major_formatter(DateFormatter('%m/%d'))
ax.xaxis.set_major_locator(DayLocator())
ax.legend(loc='upper left')
7. Plotting Multiple Files¶
Finally, let's plot data for two buoys on the same figure by iterating over a list of file names. We can use enumerate to plot each file on a new row of the figure. We will also create a function to read in the buoy data and avoid all of that repeated code.
buoys = ['42039', '42022']
fig, axes = plt.subplots(len(buoys), len(plot_variables), sharex=True, figsize=(14, 10))
for row, buoy in enumerate(buoys):
df = read_buoy_data(buoy)
for col, var_names in enumerate(plot_variables):
ax = axes[row,col]
for var_name in var_names:
title, label = format_varname(var_name)
color = colors[var_name]
linestyle = linestyles[var_name]
ax.plot(df.time.values, df[var_name], color, linestyle=linestyle, label=label)
ax.set_ylabel(title)
ax.set_title(f'Buoy {buoy} {title}')
ax.grid(True)
ax.set_xlabel('Time')
ax.xaxis.set_major_formatter(DateFormatter('%m/%d'))
ax.xaxis.set_major_locator(DayLocator())