Advanced StationPlots with Mesonet Data
Advanced Surface Observations: Working with Mesonet Data
Unidata Python Workshop
Questions¶
- How do I read in complicated mesonet data with Pandas?
- How do I merge multiple Pandas DataFrames?
- What's the best way to make a station plot of data?
- How can I make a time series of data from one station?
Objectives¶
Reading Mesonet Data¶
In this notebook, we're going to use the Pandas library to read text-based data. Pandas is excellent at handling text, csv, and other files. However, you have to help Pandas figure out how your data is formatted sometimes. Lucky for you, mesonet data frequently comes in forms that are not the most user-friendly. Through this notebook, we'll see how these complicated datasets can be handled nicely by Pandas to create useful station plots for hand analysis or publication.
# Import Pandas
import pandas as pd
West Texas Mesonet¶
The West Texas Mesonet is a wonderful data source for researchers and storm chasers alike! We have some 5-minute observations from the entire network on 22 March 2019 that we'll analyze in this notebook.
Pandas can parse time into a nice internal storage format as we read in the file. If the time is specified in the file in a somewhat standard form, pandas will even guess at the format if you tell it which column to use. However, in this case the time is reported in a horrible format: between one and four characters that, if there are four characters, represent hours and minutes as HHMM. Let's turn take a charater string, turn it into an integer, and then use integer string formatting to write out a four character string.
for t in ['0', '05', '100', '1005']:
print('{0:04d}'.format(int(t)))
Pandas can be told how to parse non-standard dates formats by writing an arbitrary function that takes a string and returns a datetime. Here's what that function looks like in this case. We can use timedelta to convert hours and minutes, and then add them to the start date using date math.
def parse_tx_date(v, start_date=None):
s = '{0:04d}'.format(int(v)) # regularize the data to a four character string
hour = pd.to_timedelta(int(s[0:2]), 'hour')
minute = pd.to_timedelta(int(s[2:4]), 'minute')
return start_date + hour + minute
# Read in the data and handle the lines that cause issues
# Get a nice date variable cooresponding to the start time
start_date = pd.datetime.strptime('2019-03-22', '%Y-%m-%d')
print(start_date)
# Pre-apply the start date to our date parsing function, so that pandas only passes one value
from functools import partial
date_parser = partial(parse_tx_date, start_date=start_date)
filename = 'West_Texas_data/FIVEMIN_82.txt'
tx_data = pd.read_csv(filename, delimiter=',', header=None, error_bad_lines=False, warn_bad_lines=False,
parse_dates=[2], date_parser=date_parser
)
tx_data
# Rename columns to be understandable
tx_data.columns = ['Array_ID', 'QC_flag', 'Time', 'Station_ID', '10m_scalar_wind_speed',
'10m_vector_wind_speed', '10m_wind_direction',
'10m_wind_direction_std', '10m_wind_speed_std',
'10m_gust_wind_speed', '1.5m_temperature',
'9m_temperature', '2m_temperature',
'1.5m_relative_humidity', 'station_pressure', 'rainfall',
'dewpoint', '2m_wind_speed', 'solar_radiation']
tx_data
The West Texas mesonet provides data on weather, agriculture, and radiation. These different observations are encoded 1, 2, and 3, respectively in the Array ID column. Let's parse out only the meteorological data for this exercise.
# Remove non-meteorological rows
tx_data = tx_data[tx_data['Array_ID'] == 1]
tx_data
Station pressure is 600 hPa lower than it should be, so let's correct that as well!
# Correct presssure
tx_data['station_pressure'] += 600
tx_data['station_pressure']
Finally, let's read in the station metadata file for the West Texas mesonet, so that we can have coordinates to plot data later on.
tx_stations = pd.read_csv('WestTexas_stations.csv')
tx_stations
Oklahoma Data¶
Try reading in the Oklahoma Mesonet data located in the 201903222300.mdf
file using Pandas. Check out the documentation on Pandas if you run into issues! Make sure to handle missing values as well. Also read in the Oklahoma station data from the Oklahoma_stations.csv
file. Only read in the station ID, latitude, and longitude columns from that file.
# Your code here
def parse_ok_date(v, start_date=None):
s = '{0:04d}'.format(int(v)) # regularize the data to a four character string
minute = pd.to_timedelta(int(s), 'minute')
return start_date + minute
# %load solutions/read_ok.py
# Cell content replaced by load magic replacement.
ok_data = pd.read_csv('201903222300.mdf', skiprows=2, delim_whitespace=True, na_values=-999,
parse_dates=[2], date_parser=partial(parse_ok_date, start_date=start_date))
ok_stations = pd.read_csv('Oklahoma_stations.csv', usecols=[1,7,8])
print(ok_data.head())
print(ok_stations.head())
Merging DataFrames¶
We now have two data files per mesonet - one for the data itself and one for the metadata. It would be really nice to combine these DataFrames together into one for each mesonet. Pandas has some built in methods to do this - see here. For this example, we'll be using the merge
method. First, let's rename columns in the Oklahoma station DataFrame to be more understandable.
# Rename columns so merging can occur
ok_stations.columns = ['STID', 'LAT', 'LON']
Conveniently, we have a STID
column in both DataFrames. Let's base our merge on that and see what we get!
# Merge the two data frames based on the Station ID
ok_data = pd.merge(ok_data, ok_stations, on='STID')
ok_data
That was nice! But what if our DataFrames don't have the same column name, and we want to avoid renaming columns? Check out the documentation for pd.merge
and see how we can merge the West Texas DataFrames together. Also, subset the data to only be from 2300 UTC, which is when our Oklahoma data was taken. Call the new DataFrame tx_one_time
.
# Your code here
# %load solutions/merge_texas.py
# Cell content replaced by load magic replacement.
# Find common time between TX and OK data
tx_data = pd.merge(tx_data, tx_stations, left_on='Station_ID', right_on='Logger ID')
tx_one_time = tx_data[tx_data['Time'] == '2019-3-22 23:00']
tx_one_time
Creating a Station Plot¶
Let's say we want to plot temperature, dewpoint, and wind barbs. Given our data from the two mesonets, do we have what we need? If not, use MetPy to calculate what you need!
import metpy.calc as mpcalc
from metpy.units import units
# Your code here
# %load solutions/data_conversion.py
# Cell content replaced by load magic replacement.
ok_dewpoint = mpcalc.dewpoint_rh(ok_data['TAIR'].values * units.degC, ok_data['RELH'].values * units.percent)
ok_u, ok_v = mpcalc.wind_components(ok_data['WSPD'].values * units.mph, ok_data['WDIR'].values * units.degrees)
tx_u, tx_v = mpcalc.wind_components(tx_one_time['10m_scalar_wind_speed'].values * units.mph, tx_one_time['10m_wind_direction'].values * units.degrees)
Now, let's make a Station Plot with our data using MetPy and CartoPy.
from metpy.plots import StationPlot
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import matplotlib.pyplot as plt
# Set up a plot with map features
fig = plt.figure(figsize=(12, 12))
proj = ccrs.Stereographic(central_longitude=-100, central_latitude=35)
ax = fig.add_subplot(1, 1, 1, projection=proj)
ax.add_feature(cfeature.STATES.with_scale('50m'), edgecolor='black')
ax.gridlines()
# Create a station plot pointing to an Axes to draw on as well as the location of points
stationplot = StationPlot(ax, ok_data['LON'].values, ok_data['LAT'].values, transform=ccrs.PlateCarree(),
fontsize=10)
stationplot.plot_parameter('NW', ok_data['TAIR'], color='red')
stationplot.plot_parameter('SW', ok_dewpoint, color='green')
stationplot.plot_barb(ok_u, ok_v)
# Texas Data
stationplot = StationPlot(ax, tx_one_time['Long'].values, tx_one_time['Lat'].values, transform=ccrs.PlateCarree(),
fontsize=10)
stationplot.plot_parameter('NW', tx_one_time['2m_temperature'], color='red')
stationplot.plot_parameter('SW', tx_one_time['dewpoint'], color='green')
stationplot.plot_barb(tx_u, tx_v)
This is an informative plot, but is rather crowded. Using MetPy's reduce_point_density
function, try cleaning up this plot to something that would be presentable/publishable. This function will return a mask, which you'll apply to all arrays in the plotting commands to filter down the data.
# Oklahoma
xy = proj.transform_points(ccrs.PlateCarree(), ok_data['LON'].values, ok_data['LAT'].values)
# Reduce point density so that there's only one point within a 50km circle
ok_mask = mpcalc.reduce_point_density(xy, 50000)
# Texas
# Your code here
# Plot
# Your code here
# %load solutions/reduce_and_plot.py
# Cell content replaced by load magic replacement.
xy = proj.transform_points(ccrs.PlateCarree(), tx_one_time['Long'].values, tx_one_time['Lat'].values)
tx_mask = mpcalc.reduce_point_density(xy, 50000)
#Plot
# Set up a plot with map features
fig = plt.figure(figsize=(12, 12))
proj = ccrs.Stereographic(central_longitude=-100, central_latitude=35)
ax = fig.add_subplot(1, 1, 1, projection=proj)
ax.add_feature(cfeature.STATES.with_scale('50m'), edgecolor='black')
ax.gridlines()
# Create a station plot pointing to an Axes to draw on as well as the location of points
stationplot = StationPlot(ax, ok_data['LON'].values[ok_mask], ok_data['LAT'].values[ok_mask], transform=ccrs.PlateCarree(),
fontsize=10)
stationplot.plot_parameter('NW', ok_data['TAIR'][ok_mask], color='red')
stationplot.plot_parameter('SW', ok_dewpoint[ok_mask], color='green')
stationplot.plot_barb(ok_u[ok_mask], ok_v[ok_mask])
# Texas Data
stationplot = StationPlot(ax, tx_one_time['Long'].values[tx_mask], tx_one_time['Lat'].values[tx_mask], transform=ccrs.PlateCarree(),
fontsize=10)
stationplot.plot_parameter('NW', tx_one_time['2m_temperature'][tx_mask], color='red')
stationplot.plot_parameter('SW', tx_one_time['dewpoint'][tx_mask], color='green')
stationplot.plot_barb(tx_u[tx_mask], tx_v[tx_mask])
Creating Time Series for Stations¶
What if we want to take data from all times from a single station to make a time series (or meteogram) plot? How can we easily do that with Pandas without having to aggregate the data by hand?
import numpy as np
# Select daylight hours
tx_daytime = tx_data[(tx_data['Time'] >= '2019-03-22 06:00') & (tx_data['Time'] <= '2019-03-22 20:00')]
# Create sub-tables for each station
tx_grp = tx_daytime.groupby('ID')
# Get data from station DIMM
station_data = tx_grp.get_group('DIMM')
# Create hourly averaged data
# time_bins = pd.cut(station_data['Time'], np.arange(600, 2100, 100))
# xarray has groupby_bins, but pandas has cut
station_data.index=station_data['Time']
station_hourly = station_data.resample('H')
# station_hourly = station_data.groupby(time_bins)
station_hourly_mean = station_hourly.mean()
station_hourly_mean = station_hourly_mean.reset_index() # no longer index by time so that we get it back as a regular variable.
# The times are reported at the beginning of the interval, but really represent
# the mean symmetric about the half hour. Let's fix that.
# from datetime import timedelta timedelta(minutes=30) #
station_hourly_mean['Time'] += pd.to_timedelta(30, 'minutes')
print(station_hourly_mean['Time'])
print(station_data['Time'])
Use the data above to make a time series plot of the instantaneous data and the hourly averaged data:
# Your code here
# %load solutions/mesonet_timeseries.py
# Cell content replaced by load magic replacement.
import matplotlib.pyplot as plt
fig = plt.figure()
ax= fig.add_subplot(111)
ax.plot(station_hourly_mean['Time'], station_hourly_mean['solar_radiation'])
ax.plot(station_data['Time'], station_data['solar_radiation'])