Introduction to Pandas

Unidata Python Workshop

Questions¶

What is Pandas?
What are the basic Pandas data structures?
How can I read data into Pandas?
What are some of the data operations available in Pandas?

Objectives¶

Data Series
Data Frames
Loading Data in Pandas
Missing Data
Manipulating Data

Data Series¶

Data series are one of the fundamental data structures in Pandas. You can think of them like a dictionary; they have a key (index) and value (data/values) like a dictionary, but also have some handy functionality attached to them.

To start out, let's create a series from scratch. We'll imagine these are temperature observations.

In [1]:

import pandas as pd
temperatures = pd.Series([23, 20, 25, 18])
temperatures

Out[1]:

0    23
1    20
2    25
3    18
dtype: int64

The values on the left are the index (zero based integers by default) and on the right are the values. Notice that the data type is an integer. Any NumPy datatype is acceptable in a series.

That's great, but it'd be more useful if the station were associated with those values. In fact you could say we want the values indexed by station name.

In [2]:

temperatures = pd.Series([23, 20, 25, 18], index=['TOP', 'OUN', 'DAL', 'DEN'])
temperatures

Out[2]:

TOP    23
OUN    20
DAL    25
DEN    18
dtype: int64

Now, very similar to a dictionary, we can use the index to access and modify elements.

In [3]:

temperatures['DAL']

Out[3]:

In [4]:

temperatures[['DAL', 'OUN']]

Out[4]:

DAL    25
OUN    20
dtype: int64

We can also do basic filtering, math, etc.

In [5]:

temperatures[temperatures > 20]

Out[5]:

TOP    23
DAL    25
dtype: int64

In [6]:

temperatures + 2

Out[6]:

TOP    25
OUN    22
DAL    27
DEN    20
dtype: int64

Remember how I said that series are like dictionaries? We can create a series straight from a dictionary.

In [7]:

dps = {'TOP': 14,
       'OUN': 18,
       'DEN': 9,
       'PHX': 11,
       'DAL': 23}

dewpoints = pd.Series(dps)
dewpoints

Out[7]:

TOP    14
OUN    18
DEN     9
PHX    11
DAL    23
dtype: int64

It's also easy to check and see if an index exists in a given series:

In [8]:

'PHX' in dewpoints

Out[8]:

True

In [9]:

'PHX' in temperatures

Out[9]:

False

Series have a name attribute and their index has a name attribute.

In [10]:

temperatures.name = 'temperature'
temperatures.index.name = 'station'

In [11]:

temperatures

Out[11]:

station
TOP    23
OUN    20
DAL    25
DEN    18
Name: temperature, dtype: int64

EXERCISE:

Create a series of pressures for stations TOP, OUN, DEN, and DAL (assign any values you like).
Set the series name and series index name.
Print the pressures for all stations which have a dewpoint below 15.

In [12]:

# YOUR CODE GOES HERE

SOLUTION

In [13]:

# %load solutions/make_series.py

# Cell content replaced by load magic replacement.
pressures = pd.Series([1012.1, 1010.6, 1008.8, 1011.2], index=['TOP', 'OUN', 'DEN', 'DAL'])
pressures.name = 'pressure'
pressures.index.name = 'station'
print(pressures[dewpoints < 15])

station
TOP    1012.1
DEN    1008.8
Name: pressure, dtype: float64

Top

Data Frames¶

Series are great, but what about a bunch of related series? Something like a table or a spreadsheet? Enter the data frame. A data frame can be thought of as a dictionary of data series. They have indexes for their rows and their columns. Each data series can be of a different type, but they will all share a common index.

The easiest way to create a data frame by hand is to use a dictionary.

In [14]:

data = {'station': ['TOP', 'OUN', 'DEN', 'DAL'],
        'temperature': [23, 20, 25, 18],
        'dewpoint': [14, 18, 9, 23]}

df = pd.DataFrame(data)
df

Out[14]:

	station	temperature	dewpoint
0	TOP	23	14
1	OUN	20	18
2	DEN	25	9
3	DAL	18	23

You can access columns (data series) using dictionary type notation or attribute type notation.

In [15]:

df['temperature']

Out[15]:

0    23
1    20
2    25
3    18
Name: temperature, dtype: int64

In [16]:

df.dewpoint

Out[16]:

0    14
1    18
2     9
3    23
Name: dewpoint, dtype: int64

Notice the index is shared and that the name of the column is attached as the series name.

You can also create a new column and assign values. If I only pass a scalar it is duplicated.

In [17]:

df['wspeed'] = 0.
df

Out[17]:

	station	temperature	dewpoint
0	TOP	23	14
1	OUN	20	18
2	DEN	25	9
3	DAL	18	23

Let's set the index to be the station.

In [18]:

df.index = df.station
df

Out[18]:

	station	temperature	dewpoint	wspeed
station
TOP	TOP	23	14	0.0
OUN	OUN	20	18	0.0
DEN	DEN	25	9	0.0
DAL	DAL	18	23	0.0

Well, that's close, but we now have a redundant column, so let's get rid of it.

In [19]:

df = df.drop('station', axis='columns')
df

Out[19]:

	temperature	dewpoint	wspeed
station
TOP	23	14	0.0
OUN	20	18	0.0
DEN	25	9	0.0
DAL	18	23	0.0

We can also add data and order it by providing index values. Note that the next cell contains data that's "out of order" compared to the dataframe shown above. However, by providing the index that corresponds to each value, the data is organized correctly into the dataframe.

In [20]:

df['pressure'] = pd.Series([1010,1000,998,1018], index=['DEN','TOP','DAL','OUN'])
df

Out[20]:

	temperature	dewpoint	wspeed	pressure
station
TOP	23	14	0.0	1000
OUN	20	18	0.0	1018
DEN	25	9	0.0	1010
DAL	18	23	0.0	998

Now let's get a row from the dataframe instead of a column.

In [21]:

df.loc['DEN']

Out[21]:

temperature      25.0
dewpoint          9.0
wspeed            0.0
pressure       1010.0
Name: DEN, dtype: float64

We can even transpose the data easily if we needed that do make things easier to merge/munge later.

In [22]:

df.T

Out[22]:

station	TOP	OUN	DEN	DAL
temperature	23.0	20.0	25.0	18.0
dewpoint	14.0	18.0	9.0	23.0
wspeed	0.0	0.0	0.0	0.0
pressure	1000.0	1018.0	1010.0	998.0

Look at the values attribute to access the data as a 1D or 2D array for series and data frames recpectively.

In [23]:

df.values

Out[23]:

array([[  23.,   14.,    0., 1000.],
       [  20.,   18.,    0., 1018.],
       [  25.,    9.,    0., 1010.],
       [  18.,   23.,    0.,  998.]])

In [24]:

df.temperature.values

Out[24]:

array([23, 20, 25, 18])

EXERCISE:

Add a series of rain observations to the existing data frame.
Apply an instrument correction of -2 to the dewpoint observations.

In [25]:

# YOUR CODE GOES HERE

SOLUTION

In [26]:

# %load solutions/rain_obs.py

# Cell content replaced by load magic replacement.
df['rain'] = [0, 0.4, 0.2, 0]
df.dewpoint = df.dewpoint - 2
df

Out[26]:

	temperature	dewpoint	wspeed	pressure	rain
station
TOP	23	12	0.0	1000	0.0
OUN	20	16	0.0	1018	0.4
DEN	25	7	0.0	1010	0.2
DAL	18	21	0.0	998	0.0

Top

Loading Data in Pandas¶

The real power of pandas is in manupulating and summarizing large sets of tabular data. To do that, we'll need a large set of tabular data. We've included a file in this directory called JAN17_CO_ASOS.txt that has all of the ASOS observations for several stations in Colorado for January of 2017. It's a few hundred thousand rows of data in a tab delimited format. Let's load it into Pandas.

In [27]:

import pandas as pd

In [28]:

df = pd.read_csv('Jan17_CO_ASOS.txt', sep='\t')

In [29]:

df.head()

Out[29]:

	station	valid	tmpc	dwpc	mslp
0	FNL	2017-01-01 00:00	M	M	M
1	FNL	2017-01-01 00:05	M	M	M
2	FNL	2017-01-01 00:10	M	M	M
3	LMO	2017-01-01 00:13	1.00	-7.50	M
4	FNL	2017-01-01 00:15	-3.00	-9.00	M

In [30]:

df = pd.read_csv('Jan17_CO_ASOS.txt', sep='\t', parse_dates=['valid'])

In [31]:

df.head()

Out[31]:

	station	valid	tmpc	dwpc	mslp
0	FNL	2017-01-01 00:00:00	M	M	M
1	FNL	2017-01-01 00:05:00	M	M	M
2	FNL	2017-01-01 00:10:00	M	M	M
3	LMO	2017-01-01 00:13:00	1.00	-7.50	M
4	FNL	2017-01-01 00:15:00	-3.00	-9.00	M

In [32]:

df = pd.read_csv('Jan17_CO_ASOS.txt', sep='\t', parse_dates=['valid'], na_values='M')

In [33]:

df.head()

Out[33]:

	station	valid	tmpc	dwpc	mslp
0	FNL	2017-01-01 00:00:00	NaN	NaN	NaN
1	FNL	2017-01-01 00:05:00	NaN	NaN	NaN
2	FNL	2017-01-01 00:10:00	NaN	NaN	NaN
3	LMO	2017-01-01 00:13:00	1.0	-7.5	NaN
4	FNL	2017-01-01 00:15:00	-3.0	-9.0	NaN

Let's look in detail at those column names. Turns out we need to do some cleaning of this file. Welcome to real world data analysis.

In [34]:

df.columns

Out[34]:

Index(['station', 'valid', ' tmpc ', '  dwpc ', '  mslp'], dtype='object')

In [35]:

df.columns = ['station', 'time', 'temperature', 'dewpoint', 'pressure']

In [36]:

df.head()

Out[36]:

	station	time	temperature	dewpoint	pressure
0	FNL	2017-01-01 00:00:00	NaN	NaN	NaN
1	FNL	2017-01-01 00:05:00	NaN	NaN	NaN
2	FNL	2017-01-01 00:10:00	NaN	NaN	NaN
3	LMO	2017-01-01 00:13:00	1.0	-7.5	NaN
4	FNL	2017-01-01 00:15:00	-3.0	-9.0	NaN

For other formats of data CSV, fixed width, etc. that are tools to read it as well. You can even read excel files straight into Pandas.

Top

Missing Data¶

We've already dealt with some missing data by turning the 'M' string into actual NaN's while reading the file in. We can do one better though and delete any rows that have all values missing. There are similar operations that could be performed for columns. You can even drop if any values are missing, all are missing, or just those you specify are missing.

In [37]:

len(df)

Out[37]:

In [38]:

df = df.dropna(axis='rows', how='all', subset=['temperature', 'dewpoint', 'pressure'])

In [39]:

len(df)

Out[39]:

In [40]:

df.head()

Out[40]:

	station	time	temperature	dewpoint	pressure
3	LMO	2017-01-01 00:13:00	1.00	-7.50	NaN
4	FNL	2017-01-01 00:15:00	-3.00	-9.00	NaN
5	1V6	2017-01-01 00:15:00	0.00	-9.00	NaN
7	0CO	2017-01-01 00:23:00	-12.00	-18.00	NaN
10	LMO	2017-01-01 00:34:00	-0.22	-8.22	NaN

EXERCISE: Our dataframe df has data in which we dropped any entries that were missing all of the temperature, dewpoint and pressure observations. Let's modify our command some and create a new dataframe df2 that only keeps observations that have all three variables (i.e. if a pressure is missing, the whole entry is dropped). This is useful if you were doing some computation that requires a complete observation to work.

In [41]:

# YOUR CODE GOES HERE
# df2 =

SOLUTION

In [42]:

# %load solutions/drop_obs.py

# Cell content replaced by load magic replacement.
df2 = df.dropna(how='any')
df2

Out[42]:

	station	time	temperature	dewpoint	pressure
10422	FNL	2017-01-24 07:15:00	-1.00	-2.00	929.5
10427	FNL	2017-01-24 07:35:00	-1.00	-2.00	929.5
10434	FNL	2017-01-24 07:55:00	-2.00	-3.00	929.5
10435	FNL	2017-01-24 07:56:00	-2.22	-2.78	998.7
10440	FNL	2017-01-24 08:15:00	-3.00	-4.00	929.5
...	...	...	...	...	...
169573	FNL	2017-12-30 19:56:00	-10.00	-12.22	1018.7
169594	FNL	2017-12-30 20:56:00	-10.00	-12.78	1017.1
169615	FNL	2017-12-30 21:56:00	-8.28	-12.22	1016.7
169637	FNL	2017-12-30 22:56:00	-8.28	-11.72	1015.9
169657	FNL	2017-12-30 23:56:00	-8.89	-12.22	1016.2

7953 rows × 5 columns

Lastly, we still have the original index values. Let's reindex to a new zero-based index for only the rows that have valid data in them.

In [43]:

df.reset_index(drop=True)

Out[43]:

	station	time	temperature	dewpoint	pressure
0	LMO	2017-01-01 00:13:00	1.00	-7.50	NaN
1	FNL	2017-01-01 00:15:00	-3.00	-9.00	NaN
2	1V6	2017-01-01 00:15:00	0.00	-9.00	NaN
3	0CO	2017-01-01 00:23:00	-12.00	-18.00	NaN
4	LMO	2017-01-01 00:34:00	-0.22	-8.22	NaN
...	...	...	...	...	...
72545	LMO	2017-12-30 23:35:00	-7.00	-11.28	NaN
72546	1V6	2017-12-30 23:50:00	-5.00	-10.00	NaN
72547	0CO	2017-12-30 23:54:00	-4.00	-10.00	NaN
72548	LMO	2017-12-30 23:55:00	-7.00	-11.00	NaN
72549	FNL	2017-12-30 23:56:00	-8.89	-12.22	1016.2

72550 rows × 5 columns

In [44]:

df.head()

Out[44]:

	station	time	temperature	dewpoint	pressure
3	LMO	2017-01-01 00:13:00	1.00	-7.50	NaN
4	FNL	2017-01-01 00:15:00	-3.00	-9.00	NaN
5	1V6	2017-01-01 00:15:00	0.00	-9.00	NaN
7	0CO	2017-01-01 00:23:00	-12.00	-18.00	NaN
10	LMO	2017-01-01 00:34:00	-0.22	-8.22	NaN

Top

Manipulating Data¶

We can now take our data and do some intersting things with it. Let's start with a simple min/max.

In [45]:

print(f'Min: {df.temperature.min()}\nMax: {df.temperature.max()}')

Min: -28.72
Max: 39.0

You can also do some useful statistics on data with attached methods like corr for correlation coefficient.

In [46]:

df.temperature.corr(df.dewpoint)

Out[46]:

0.7453312035648769

We can also call a groupby on the data frame to start getting some summary information for each station.

In [47]:

df.groupby('station').mean()

Out[47]:

	temperature	dewpoint	pressure
station
0CO	-1.926889	-7.491375	NaN
1V6	7.574364	-4.872335	NaN
FNL	8.656791	-0.228522	1014.852848
LMO	12.006185	0.387110	NaN

EXERCISE: Calculate the min, max, and standard deviation of the temperature field grouped by each station.

In [48]:

# Calculate min

In [49]:

# Calculate max

In [50]:

# Calculate standard deviation

SOLUTION

In [51]:

# %load solutions/calc_stats.py

# Cell content replaced by load magic replacement.
print(df.groupby('station').temperature.min())
print(df.groupby('station').temperature.max())
print(df.groupby('station').temperature.std())

station
0CO   -26.00
1V6   -18.00
FNL   -27.00
LMO   -28.72
Name: temperature, dtype: float64
station
0CO    15.00
1V6    30.00
FNL    37.22
LMO    39.00
Name: temperature, dtype: float64
station
0CO     7.701259
1V6     8.413450
FNL    11.413245
LMO    10.716778
Name: temperature, dtype: float64

Now, let me show you how to do all of that and more in a single call.

In [52]:

df.groupby('station').describe()

Out[52]:

	temperature								dewpoint					pressure
	count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
station
0CO	25044.0	-1.926889	7.701259	-26.00	-7.00	-2.00	5.00	15.00	25044.0	-7.491375	...	-1.0	11.00	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1V6	11632.0	7.574364	8.413450	-18.00	1.00	7.00	13.00	30.00	11632.0	-4.872335	...	0.0	12.00	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
FNL	11504.0	8.656791	11.413245	-27.00	0.61	8.28	16.72	37.22	11500.0	-0.228522	...	7.0	18.28	7953.0	1014.852848	8.87871	929.5	1010.3	1015.4	1020.0	1034.7
LMO	24370.0	12.006185	10.716778	-28.72	4.22	12.50	19.50	39.00	24370.0	0.387110	...	7.0	19.11	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN

4 rows × 24 columns

Now let's suppose we're going to make a meteogram or similar and want to get all of the data for a single station.

In [53]:

df.groupby('station').get_group('0CO').head().reset_index(drop=True)

Out[53]:

	station	time	temperature	dewpoint	pressure
0	0CO	2017-01-01 00:23:00	-12.0	-18.0	NaN
1	0CO	2017-01-01 00:43:00	-12.0	-18.0	NaN
2	0CO	2017-01-01 01:03:00	-12.0	-18.0	NaN
3	0CO	2017-01-01 01:23:00	-12.0	-18.0	NaN
4	0CO	2017-01-01 02:03:00	-12.0	-19.0	NaN

EXERCISE:

Round the temperature column to whole degrees.
Group the observations by temperature and use the count method to see how many instances of the rounded temperatures there are in the dataset.

In [54]:

# YOUR CODE GOES HERE

SOLUTION

In [55]:

# %load solutions/temperature_count.py

# Cell content replaced by load magic replacement.
df.temperature = df.temperature.round()
df.groupby('temperature').count()

Out[55]:

	station	time	dewpoint	pressure
temperature
-29.0	1	1	1	0
-28.0	4	4	4	0
-27.0	5	5	4	0
-26.0	15	15	15	0
-25.0	31	31	31	0
...	...	...	...	...
35.0	103	103	103	11
36.0	75	75	75	9
37.0	40	40	40	6
38.0	7	7	7	0
39.0	4	4	4	0

69 rows × 4 columns

Top