Pandas Introduction
Data Series¶
Data series are one of the fundamental data structures in Pandas. You can think of them like a dictionary; they have a key (index) and value (data/values) like a dictionary, but also have some handy functionality attached to them.
To start out, let's create a series from scratch. We'll imagine these are temperature observations.
import pandas as pd
temperatures = pd.Series([23, 20, 25, 18])
temperatures
The values on the left are the index (zero based integers by default) and on the right are the values. Notice that the data type is an integer. Any NumPy datatype is acceptable in a series.
That's great, but it'd be more useful if the station were associated with those values. In fact you could say we want the values indexed by station name.
temperatures = pd.Series([23, 20, 25, 18], index=['TOP', 'OUN', 'DAL', 'DEN'])
temperatures
Now, very similar to a dictionary, we can use the index to access and modify elements.
temperatures['DAL']
temperatures[['DAL', 'OUN']]
We can also do basic filtering, math, etc.
temperatures[temperatures > 20]
temperatures + 2
Remember how I said that series are like dictionaries? We can create a series straight from a dictionary.
dps = {'TOP': 14,
'OUN': 18,
'DEN': 9,
'PHX': 11,
'DAL': 23}
dewpoints = pd.Series(dps)
dewpoints
It's also easy to check and see if an index exists in a given series:
'PHX' in dewpoints
'PHX' in temperatures
Series have a name attribute and their index has a name attribute.
temperatures.name = 'temperature'
temperatures.index.name = 'station'
temperatures
- Create a series of pressures for stations TOP, OUN, DEN, and DAL (assign any values you like).
- Set the series name and series index name.
- Print the pressures for all stations which have a dewpoint below 15.
# YOUR CODE GOES HERE
# %load solutions/make_series.py
# Cell content replaced by load magic replacement.
pressures = pd.Series([1012.1, 1010.6, 1008.8, 1011.2], index=['TOP', 'OUN', 'DEN', 'DAL'])
pressures.name = 'pressure'
pressures.index.name = 'station'
print(pressures[dewpoints < 15])
Data Frames¶
Series are great, but what about a bunch of related series? Something like a table or a spreadsheet? Enter the data frame. A data frame can be thought of as a dictionary of data series. They have indexes for their rows and their columns. Each data series can be of a different type, but they will all share a common index.
The easiest way to create a data frame by hand is to use a dictionary.
data = {'station': ['TOP', 'OUN', 'DEN', 'DAL'],
'temperature': [23, 20, 25, 18],
'dewpoint': [14, 18, 9, 23]}
df = pd.DataFrame(data)
df
You can access columns (data series) using dictionary type notation or attribute type notation.
df['temperature']
df.dewpoint
Notice the index is shared and that the name of the column is attached as the series name.
You can also create a new column and assign values. If I only pass a scalar it is duplicated.
df['wspeed'] = 0.
df
Let's set the index to be the station.
df.index = df.station
df
Well, that's close, but we now have a redundant column, so let's get rid of it.
df = df.drop('station', axis='columns')
df
We can also add data and order it by providing index values. Note that the next cell contains data that's "out of order" compared to the dataframe shown above. However, by providing the index that corresponds to each value, the data is organized correctly into the dataframe.
df['pressure'] = pd.Series([1010,1000,998,1018], index=['DEN','TOP','DAL','OUN'])
df
Now let's get a row from the dataframe instead of a column.
df.loc['DEN']
We can even transpose the data easily if we needed that do make things easier to merge/munge later.
df.T
Look at the values
attribute to access the data as a 1D or 2D array for series and data frames recpectively.
df.values
df.temperature.values
- Add a series of rain observations to the existing data frame.
- Apply an instrument correction of -2 to the dewpoint observations.
# YOUR CODE GOES HERE
# %load solutions/rain_obs.py
# Cell content replaced by load magic replacement.
df['rain'] = [0, 0.4, 0.2, 0]
df.dewpoint = df.dewpoint - 2
df
Loading Data in Pandas¶
The real power of pandas is in manupulating and summarizing large sets of tabular data. To do that, we'll need a large set of tabular data. We've included a file in this directory called JAN17_CO_ASOS.txt
that has all of the ASOS observations for several stations in Colorado for January of 2017. It's a few hundred thousand rows of data in a tab delimited format. Let's load it into Pandas.
import pandas as pd
df = pd.read_csv('Jan17_CO_ASOS.txt', sep='\t')
df.head()
df = pd.read_csv('Jan17_CO_ASOS.txt', sep='\t', parse_dates=['valid'])
df.head()
df = pd.read_csv('Jan17_CO_ASOS.txt', sep='\t', parse_dates=['valid'], na_values='M')
df.head()
Let's look in detail at those column names. Turns out we need to do some cleaning of this file. Welcome to real world data analysis.
df.columns
df.columns = ['station', 'time', 'temperature', 'dewpoint', 'pressure']
df.head()
For other formats of data CSV, fixed width, etc. that are tools to read it as well. You can even read excel files straight into Pandas.
Missing Data¶
We've already dealt with some missing data by turning the 'M' string into actual NaN's while reading the file in. We can do one better though and delete any rows that have all values missing. There are similar operations that could be performed for columns. You can even drop if any values are missing, all are missing, or just those you specify are missing.
len(df)
df = df.dropna(axis='rows', how='all', subset=['temperature', 'dewpoint', 'pressure'])
len(df)
df.head()
# YOUR CODE GOES HERE
# df2 =
# %load solutions/drop_obs.py
# Cell content replaced by load magic replacement.
df2 = df.dropna(how='any')
df2
Lastly, we still have the original index values. Let's reindex to a new zero-based index for only the rows that have valid data in them.
df.reset_index(drop=True)
df.head()
Manipulating Data¶
We can now take our data and do some intersting things with it. Let's start with a simple min/max.
print(f'Min: {df.temperature.min()}\nMax: {df.temperature.max()}')
You can also do some useful statistics on data with attached methods like corr for correlation coefficient.
df.temperature.corr(df.dewpoint)
We can also call a groupby
on the data frame to start getting some summary information for each station.
df.groupby('station').mean()
# Calculate min
# Calculate max
# Calculate standard deviation
# %load solutions/calc_stats.py
# Cell content replaced by load magic replacement.
print(df.groupby('station').temperature.min())
print(df.groupby('station').temperature.max())
print(df.groupby('station').temperature.std())
Now, let me show you how to do all of that and more in a single call.
df.groupby('station').describe()
Now let's suppose we're going to make a meteogram or similar and want to get all of the data for a single station.
df.groupby('station').get_group('0CO').head().reset_index(drop=True)
- Round the temperature column to whole degrees.
- Group the observations by temperature and use the count method to see how many instances of the rounded temperatures there are in the dataset.
# YOUR CODE GOES HERE
# %load solutions/temperature_count.py
# Cell content replaced by load magic replacement.
df.temperature = df.temperature.round()
df.groupby('temperature').count()