{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "
\n", "\"Unidata\n", "
\n", "\n", "

Introduction to Pandas

\n", "

Unidata Python Workshop

\n", "\n", "
\n", "
\n", "\n", "
\n", "\n", "### Questions\n", "1. What is Pandas?\n", "1. What are the basic Pandas data structures?\n", "1. How can I read data into Pandas?\n", "1. What are some of the data operations available in Pandas?\n", "\n", "### Objectives\n", "1. Data Series\n", "1. Data Frames\n", "1. Loading Data in Pandas\n", "1. Missing Data\n", "1. Manipulating Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Data Series\n", "Data series are one of the fundamental data structures in Pandas. You can think of them like a dictionary; they have a key (index) and value (data/values) like a dictionary, but also have some handy functionality attached to them.\n", "\n", "To start out, let's create a series from scratch. We'll imagine these are temperature observations." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 23\n", "1 20\n", "2 25\n", "3 18\n", "dtype: int64" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "temperatures = pd.Series([23, 20, 25, 18])\n", "temperatures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The values on the left are the index (zero based integers by default) and on the right are the values. Notice that the data type is an integer. Any NumPy datatype is acceptable in a series.\n", "\n", "That's great, but it'd be more useful if the station were associated with those values. In fact you could say we want the values *indexed* by station name." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "TOP 23\n", "OUN 20\n", "DAL 25\n", "DEN 18\n", "dtype: int64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "temperatures = pd.Series([23, 20, 25, 18], index=['TOP', 'OUN', 'DAL', 'DEN'])\n", "temperatures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, very similar to a dictionary, we can use the index to access and modify elements." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "25" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "temperatures['DAL']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DAL 25\n", "OUN 20\n", "dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "temperatures[['DAL', 'OUN']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also do basic filtering, math, etc." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "TOP 23\n", "DAL 25\n", "dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "temperatures[temperatures > 20]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "TOP 25\n", "OUN 22\n", "DAL 27\n", "DEN 20\n", "dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "temperatures + 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember how I said that series are like dictionaries? We can create a series straight from a dictionary." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "TOP 14\n", "OUN 18\n", "DEN 9\n", "PHX 11\n", "DAL 23\n", "dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dps = {'TOP': 14,\n", " 'OUN': 18,\n", " 'DEN': 9,\n", " 'PHX': 11,\n", " 'DAL': 23}\n", "\n", "dewpoints = pd.Series(dps)\n", "dewpoints" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's also easy to check and see if an index exists in a given series:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'PHX' in dewpoints" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'PHX' in temperatures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Series have a name attribute and their index has a name attribute." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "temperatures.name = 'temperature'\n", "temperatures.index.name = 'station'" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "station\n", "TOP 23\n", "OUN 20\n", "DAL 25\n", "DEN 18\n", "Name: temperature, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "temperatures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE:\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE GOES HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " SOLUTION\n", "
" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "station\n", "TOP 1012.1\n", "DEN 1008.8\n", "Name: pressure, dtype: float64\n" ] } ], "source": [ "# %load solutions/make_series.py\n", "\n", "# Cell content replaced by load magic replacement.\n", "pressures = pd.Series([1012.1, 1010.6, 1008.8, 1011.2], index=['TOP', 'OUN', 'DEN', 'DAL'])\n", "pressures.name = 'pressure'\n", "pressures.index.name = 'station'\n", "print(pressures[dewpoints < 15])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Top\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Data Frames\n", "Series are great, but what about a bunch of related series? Something like a table or a spreadsheet? Enter the data frame. A data frame can be thought of as a dictionary of data series. They have indexes for their rows and their columns. Each data series can be of a different type, but they will all share a common index.\n", "\n", "The easiest way to create a data frame by hand is to use a dictionary." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stationtemperaturedewpoint
0TOP2314
1OUN2018
2DEN259
3DAL1823
\n", "
" ], "text/plain": [ " station temperature dewpoint\n", "0 TOP 23 14\n", "1 OUN 20 18\n", "2 DEN 25 9\n", "3 DAL 18 23" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = {'station': ['TOP', 'OUN', 'DEN', 'DAL'],\n", " 'temperature': [23, 20, 25, 18],\n", " 'dewpoint': [14, 18, 9, 23]}\n", "\n", "df = pd.DataFrame(data)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can access columns (data series) using dictionary type notation or attribute type notation." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 23\n", "1 20\n", "2 25\n", "3 18\n", "Name: temperature, dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['temperature']" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 14\n", "1 18\n", "2 9\n", "3 23\n", "Name: dewpoint, dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dewpoint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the index is shared and that the name of the column is attached as the series name.\n", "\n", "You can also create a new column and assign values. If I only pass a scalar it is duplicated." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stationtemperaturedewpointwspeed
0TOP23140.0
1OUN20180.0
2DEN2590.0
3DAL18230.0
\n", "
" ], "text/plain": [ " station temperature dewpoint wspeed\n", "0 TOP 23 14 0.0\n", "1 OUN 20 18 0.0\n", "2 DEN 25 9 0.0\n", "3 DAL 18 23 0.0" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['wspeed'] = 0.\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's set the index to be the station." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stationtemperaturedewpointwspeed
station
TOPTOP23140.0
OUNOUN20180.0
DENDEN2590.0
DALDAL18230.0
\n", "
" ], "text/plain": [ " station temperature dewpoint wspeed\n", "station \n", "TOP TOP 23 14 0.0\n", "OUN OUN 20 18 0.0\n", "DEN DEN 25 9 0.0\n", "DAL DAL 18 23 0.0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.index = df.station\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Well, that's close, but we now have a redundant column, so let's get rid of it." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
temperaturedewpointwspeed
station
TOP23140.0
OUN20180.0
DEN2590.0
DAL18230.0
\n", "
" ], "text/plain": [ " temperature dewpoint wspeed\n", "station \n", "TOP 23 14 0.0\n", "OUN 20 18 0.0\n", "DEN 25 9 0.0\n", "DAL 18 23 0.0" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.drop('station', axis='columns')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also add data and order it by providing index values. Note that the next cell contains data that's \"out of order\" compared to the dataframe shown above. However, by providing the index that corresponds to each value, the data is organized correctly into the dataframe." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
temperaturedewpointwspeedpressure
station
TOP23140.01000
OUN20180.01018
DEN2590.01010
DAL18230.0998
\n", "
" ], "text/plain": [ " temperature dewpoint wspeed pressure\n", "station \n", "TOP 23 14 0.0 1000\n", "OUN 20 18 0.0 1018\n", "DEN 25 9 0.0 1010\n", "DAL 18 23 0.0 998" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['pressure'] = pd.Series([1010,1000,998,1018], index=['DEN','TOP','DAL','OUN'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's get a row from the dataframe instead of a column." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "temperature 25.0\n", "dewpoint 9.0\n", "wspeed 0.0\n", "pressure 1010.0\n", "Name: DEN, dtype: float64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['DEN']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can even transpose the data easily if we needed that do make things easier to merge/munge later." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stationTOPOUNDENDAL
temperature23.020.025.018.0
dewpoint14.018.09.023.0
wspeed0.00.00.00.0
pressure1000.01018.01010.0998.0
\n", "
" ], "text/plain": [ "station TOP OUN DEN DAL\n", "temperature 23.0 20.0 25.0 18.0\n", "dewpoint 14.0 18.0 9.0 23.0\n", "wspeed 0.0 0.0 0.0 0.0\n", "pressure 1000.0 1018.0 1010.0 998.0" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look at the `values` attribute to access the data as a 1D or 2D array for series and data frames recpectively." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 23., 14., 0., 1000.],\n", " [ 20., 18., 0., 1018.],\n", " [ 25., 9., 0., 1010.],\n", " [ 18., 23., 0., 998.]])" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.values" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([23, 20, 25, 18])" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.temperature.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE:\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE GOES HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " SOLUTION\n", "
" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
temperaturedewpointwspeedpressurerain
station
TOP23120.010000.0
OUN20160.010180.4
DEN2570.010100.2
DAL18210.09980.0
\n", "
" ], "text/plain": [ " temperature dewpoint wspeed pressure rain\n", "station \n", "TOP 23 12 0.0 1000 0.0\n", "OUN 20 16 0.0 1018 0.4\n", "DEN 25 7 0.0 1010 0.2\n", "DAL 18 21 0.0 998 0.0" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# %load solutions/rain_obs.py\n", "\n", "# Cell content replaced by load magic replacement.\n", "df['rain'] = [0, 0.4, 0.2, 0]\n", "df.dewpoint = df.dewpoint - 2\n", "df\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Top\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Loading Data in Pandas\n", "The real power of pandas is in manupulating and summarizing large sets of tabular data. To do that, we'll need a large set of tabular data. We've included a file in this directory called `JAN17_CO_ASOS.txt` that has all of the ASOS observations for several stations in Colorado for January of 2017. It's a few hundred thousand rows of data in a tab delimited format. Let's load it into Pandas." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('Jan17_CO_ASOS.txt', sep='\\t')" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stationvalidtmpcdwpcmslp
0FNL2017-01-01 00:00MMM
1FNL2017-01-01 00:05MMM
2FNL2017-01-01 00:10MMM
3LMO2017-01-01 00:131.00-7.50M
4FNL2017-01-01 00:15-3.00-9.00M
\n", "
" ], "text/plain": [ " station valid tmpc dwpc mslp\n", "0 FNL 2017-01-01 00:00 M M M\n", "1 FNL 2017-01-01 00:05 M M M\n", "2 FNL 2017-01-01 00:10 M M M\n", "3 LMO 2017-01-01 00:13 1.00 -7.50 M\n", "4 FNL 2017-01-01 00:15 -3.00 -9.00 M" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('Jan17_CO_ASOS.txt', sep='\\t', parse_dates=['valid'])" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stationvalidtmpcdwpcmslp
0FNL2017-01-01 00:00:00MMM
1FNL2017-01-01 00:05:00MMM
2FNL2017-01-01 00:10:00MMM
3LMO2017-01-01 00:13:001.00-7.50M
4FNL2017-01-01 00:15:00-3.00-9.00M
\n", "
" ], "text/plain": [ " station valid tmpc dwpc mslp\n", "0 FNL 2017-01-01 00:00:00 M M M\n", "1 FNL 2017-01-01 00:05:00 M M M\n", "2 FNL 2017-01-01 00:10:00 M M M\n", "3 LMO 2017-01-01 00:13:00 1.00 -7.50 M\n", "4 FNL 2017-01-01 00:15:00 -3.00 -9.00 M" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('Jan17_CO_ASOS.txt', sep='\\t', parse_dates=['valid'], na_values='M')" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stationvalidtmpcdwpcmslp
0FNL2017-01-01 00:00:00NaNNaNNaN
1FNL2017-01-01 00:05:00NaNNaNNaN
2FNL2017-01-01 00:10:00NaNNaNNaN
3LMO2017-01-01 00:13:001.0-7.5NaN
4FNL2017-01-01 00:15:00-3.0-9.0NaN
\n", "
" ], "text/plain": [ " station valid tmpc dwpc mslp\n", "0 FNL 2017-01-01 00:00:00 NaN NaN NaN\n", "1 FNL 2017-01-01 00:05:00 NaN NaN NaN\n", "2 FNL 2017-01-01 00:10:00 NaN NaN NaN\n", "3 LMO 2017-01-01 00:13:00 1.0 -7.5 NaN\n", "4 FNL 2017-01-01 00:15:00 -3.0 -9.0 NaN" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look in detail at those column names. Turns out we need to do some cleaning of this file. Welcome to real world data analysis." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['station', 'valid', ' tmpc ', ' dwpc ', ' mslp'], dtype='object')" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "df.columns = ['station', 'time', 'temperature', 'dewpoint', 'pressure']" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stationtimetemperaturedewpointpressure
0FNL2017-01-01 00:00:00NaNNaNNaN
1FNL2017-01-01 00:05:00NaNNaNNaN
2FNL2017-01-01 00:10:00NaNNaNNaN
3LMO2017-01-01 00:13:001.0-7.5NaN
4FNL2017-01-01 00:15:00-3.0-9.0NaN
\n", "
" ], "text/plain": [ " station time temperature dewpoint pressure\n", "0 FNL 2017-01-01 00:00:00 NaN NaN NaN\n", "1 FNL 2017-01-01 00:05:00 NaN NaN NaN\n", "2 FNL 2017-01-01 00:10:00 NaN NaN NaN\n", "3 LMO 2017-01-01 00:13:00 1.0 -7.5 NaN\n", "4 FNL 2017-01-01 00:15:00 -3.0 -9.0 NaN" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For other formats of data CSV, fixed width, etc. that are tools to read it as well. You can even read excel files straight into Pandas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Top\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Missing Data\n", "We've already dealt with some missing data by turning the 'M' string into actual NaN's while reading the file in. We can do one better though and delete any rows that have all values missing. There are similar operations that could be performed for columns. You can even drop if any values are missing, all are missing, or just those you specify are missing." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "169658" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "df = df.dropna(axis='rows', how='all', subset=['temperature', 'dewpoint', 'pressure'])" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "72550" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stationtimetemperaturedewpointpressure
3LMO2017-01-01 00:13:001.00-7.50NaN
4FNL2017-01-01 00:15:00-3.00-9.00NaN
51V62017-01-01 00:15:000.00-9.00NaN
70CO2017-01-01 00:23:00-12.00-18.00NaN
10LMO2017-01-01 00:34:00-0.22-8.22NaN
\n", "
" ], "text/plain": [ " station time temperature dewpoint pressure\n", "3 LMO 2017-01-01 00:13:00 1.00 -7.50 NaN\n", "4 FNL 2017-01-01 00:15:00 -3.00 -9.00 NaN\n", "5 1V6 2017-01-01 00:15:00 0.00 -9.00 NaN\n", "7 0CO 2017-01-01 00:23:00 -12.00 -18.00 NaN\n", "10 LMO 2017-01-01 00:34:00 -0.22 -8.22 NaN" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE:\n", "\n", "Our dataframe df has data in which we dropped any entries that were missing all of the temperature, dewpoint and pressure observations. Let's modify our command some and create a new dataframe df2 that only keeps observations that have all three variables (i.e. if a pressure is missing, the whole entry is dropped). This is useful if you were doing some computation that requires a complete observation to work.\n", "
\n" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE GOES HERE\n", "# df2 = " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " SOLUTION\n", "
" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stationtimetemperaturedewpointpressure
10422FNL2017-01-24 07:15:00-1.00-2.00929.5
10427FNL2017-01-24 07:35:00-1.00-2.00929.5
10434FNL2017-01-24 07:55:00-2.00-3.00929.5
10435FNL2017-01-24 07:56:00-2.22-2.78998.7
10440FNL2017-01-24 08:15:00-3.00-4.00929.5
..................
169573FNL2017-12-30 19:56:00-10.00-12.221018.7
169594FNL2017-12-30 20:56:00-10.00-12.781017.1
169615FNL2017-12-30 21:56:00-8.28-12.221016.7
169637FNL2017-12-30 22:56:00-8.28-11.721015.9
169657FNL2017-12-30 23:56:00-8.89-12.221016.2
\n", "

7953 rows × 5 columns

\n", "
" ], "text/plain": [ " station time temperature dewpoint pressure\n", "10422 FNL 2017-01-24 07:15:00 -1.00 -2.00 929.5\n", "10427 FNL 2017-01-24 07:35:00 -1.00 -2.00 929.5\n", "10434 FNL 2017-01-24 07:55:00 -2.00 -3.00 929.5\n", "10435 FNL 2017-01-24 07:56:00 -2.22 -2.78 998.7\n", "10440 FNL 2017-01-24 08:15:00 -3.00 -4.00 929.5\n", "... ... ... ... ... ...\n", "169573 FNL 2017-12-30 19:56:00 -10.00 -12.22 1018.7\n", "169594 FNL 2017-12-30 20:56:00 -10.00 -12.78 1017.1\n", "169615 FNL 2017-12-30 21:56:00 -8.28 -12.22 1016.7\n", "169637 FNL 2017-12-30 22:56:00 -8.28 -11.72 1015.9\n", "169657 FNL 2017-12-30 23:56:00 -8.89 -12.22 1016.2\n", "\n", "[7953 rows x 5 columns]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# %load solutions/drop_obs.py\n", "\n", "# Cell content replaced by load magic replacement.\n", "df2 = df.dropna(how='any')\n", "df2\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, we still have the original index values. Let's reindex to a new zero-based index for only the rows that have valid data in them." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stationtimetemperaturedewpointpressure
0LMO2017-01-01 00:13:001.00-7.50NaN
1FNL2017-01-01 00:15:00-3.00-9.00NaN
21V62017-01-01 00:15:000.00-9.00NaN
30CO2017-01-01 00:23:00-12.00-18.00NaN
4LMO2017-01-01 00:34:00-0.22-8.22NaN
..................
72545LMO2017-12-30 23:35:00-7.00-11.28NaN
725461V62017-12-30 23:50:00-5.00-10.00NaN
725470CO2017-12-30 23:54:00-4.00-10.00NaN
72548LMO2017-12-30 23:55:00-7.00-11.00NaN
72549FNL2017-12-30 23:56:00-8.89-12.221016.2
\n", "

72550 rows × 5 columns

\n", "
" ], "text/plain": [ " station time temperature dewpoint pressure\n", "0 LMO 2017-01-01 00:13:00 1.00 -7.50 NaN\n", "1 FNL 2017-01-01 00:15:00 -3.00 -9.00 NaN\n", "2 1V6 2017-01-01 00:15:00 0.00 -9.00 NaN\n", "3 0CO 2017-01-01 00:23:00 -12.00 -18.00 NaN\n", "4 LMO 2017-01-01 00:34:00 -0.22 -8.22 NaN\n", "... ... ... ... ... ...\n", "72545 LMO 2017-12-30 23:35:00 -7.00 -11.28 NaN\n", "72546 1V6 2017-12-30 23:50:00 -5.00 -10.00 NaN\n", "72547 0CO 2017-12-30 23:54:00 -4.00 -10.00 NaN\n", "72548 LMO 2017-12-30 23:55:00 -7.00 -11.00 NaN\n", "72549 FNL 2017-12-30 23:56:00 -8.89 -12.22 1016.2\n", "\n", "[72550 rows x 5 columns]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stationtimetemperaturedewpointpressure
3LMO2017-01-01 00:13:001.00-7.50NaN
4FNL2017-01-01 00:15:00-3.00-9.00NaN
51V62017-01-01 00:15:000.00-9.00NaN
70CO2017-01-01 00:23:00-12.00-18.00NaN
10LMO2017-01-01 00:34:00-0.22-8.22NaN
\n", "
" ], "text/plain": [ " station time temperature dewpoint pressure\n", "3 LMO 2017-01-01 00:13:00 1.00 -7.50 NaN\n", "4 FNL 2017-01-01 00:15:00 -3.00 -9.00 NaN\n", "5 1V6 2017-01-01 00:15:00 0.00 -9.00 NaN\n", "7 0CO 2017-01-01 00:23:00 -12.00 -18.00 NaN\n", "10 LMO 2017-01-01 00:34:00 -0.22 -8.22 NaN" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Top\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Manipulating Data\n", "We can now take our data and do some intersting things with it. Let's start with a simple min/max." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Min: -28.72\n", "Max: 39.0\n" ] } ], "source": [ "print(f'Min: {df.temperature.min()}\\nMax: {df.temperature.max()}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also do some useful statistics on data with attached methods like corr for correlation coefficient." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7453312035648769" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.temperature.corr(df.dewpoint)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also call a `groupby` on the data frame to start getting some summary information for each station." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
temperaturedewpointpressure
station
0CO-1.926889-7.491375NaN
1V67.574364-4.872335NaN
FNL8.656791-0.2285221014.852848
LMO12.0061850.387110NaN
\n", "
" ], "text/plain": [ " temperature dewpoint pressure\n", "station \n", "0CO -1.926889 -7.491375 NaN\n", "1V6 7.574364 -4.872335 NaN\n", "FNL 8.656791 -0.228522 1014.852848\n", "LMO 12.006185 0.387110 NaN" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('station').mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE:\n", "\n", "Calculate the min, max, and standard deviation of the temperature field grouped by each station.\n", "
" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "# Calculate min\n" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "# Calculate max\n" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "# Calculate standard deviation\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " SOLUTION\n", "
" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "station\n", "0CO -26.00\n", "1V6 -18.00\n", "FNL -27.00\n", "LMO -28.72\n", "Name: temperature, dtype: float64\n", "station\n", "0CO 15.00\n", "1V6 30.00\n", "FNL 37.22\n", "LMO 39.00\n", "Name: temperature, dtype: float64\n", "station\n", "0CO 7.701259\n", "1V6 8.413450\n", "FNL 11.413245\n", "LMO 10.716778\n", "Name: temperature, dtype: float64\n" ] } ], "source": [ "# %load solutions/calc_stats.py\n", "\n", "# Cell content replaced by load magic replacement.\n", "print(df.groupby('station').temperature.min())\n", "print(df.groupby('station').temperature.max())\n", "print(df.groupby('station').temperature.std())\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let me show you how to do all of that and more in a single call." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
temperaturedewpointpressure
countmeanstdmin25%50%75%maxcountmean...75%maxcountmeanstdmin25%50%75%max
station
0CO25044.0-1.9268897.701259-26.00-7.00-2.005.0015.0025044.0-7.491375...-1.011.000.0NaNNaNNaNNaNNaNNaNNaN
1V611632.07.5743648.413450-18.001.007.0013.0030.0011632.0-4.872335...0.012.000.0NaNNaNNaNNaNNaNNaNNaN
FNL11504.08.65679111.413245-27.000.618.2816.7237.2211500.0-0.228522...7.018.287953.01014.8528488.87871929.51010.31015.41020.01034.7
LMO24370.012.00618510.716778-28.724.2212.5019.5039.0024370.00.387110...7.019.110.0NaNNaNNaNNaNNaNNaNNaN
\n", "

4 rows × 24 columns

\n", "
" ], "text/plain": [ " temperature \\\n", " count mean std min 25% 50% 75% max \n", "station \n", "0CO 25044.0 -1.926889 7.701259 -26.00 -7.00 -2.00 5.00 15.00 \n", "1V6 11632.0 7.574364 8.413450 -18.00 1.00 7.00 13.00 30.00 \n", "FNL 11504.0 8.656791 11.413245 -27.00 0.61 8.28 16.72 37.22 \n", "LMO 24370.0 12.006185 10.716778 -28.72 4.22 12.50 19.50 39.00 \n", "\n", " dewpoint ... pressure \\\n", " count mean ... 75% max count mean std \n", "station ... \n", "0CO 25044.0 -7.491375 ... -1.0 11.00 0.0 NaN NaN \n", "1V6 11632.0 -4.872335 ... 0.0 12.00 0.0 NaN NaN \n", "FNL 11500.0 -0.228522 ... 7.0 18.28 7953.0 1014.852848 8.87871 \n", "LMO 24370.0 0.387110 ... 7.0 19.11 0.0 NaN NaN \n", "\n", " \n", " min 25% 50% 75% max \n", "station \n", "0CO NaN NaN NaN NaN NaN \n", "1V6 NaN NaN NaN NaN NaN \n", "FNL 929.5 1010.3 1015.4 1020.0 1034.7 \n", "LMO NaN NaN NaN NaN NaN \n", "\n", "[4 rows x 24 columns]" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('station').describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's suppose we're going to make a meteogram or similar and want to get all of the data for a single station." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stationtimetemperaturedewpointpressure
00CO2017-01-01 00:23:00-12.0-18.0NaN
10CO2017-01-01 00:43:00-12.0-18.0NaN
20CO2017-01-01 01:03:00-12.0-18.0NaN
30CO2017-01-01 01:23:00-12.0-18.0NaN
40CO2017-01-01 02:03:00-12.0-19.0NaN
\n", "
" ], "text/plain": [ " station time temperature dewpoint pressure\n", "0 0CO 2017-01-01 00:23:00 -12.0 -18.0 NaN\n", "1 0CO 2017-01-01 00:43:00 -12.0 -18.0 NaN\n", "2 0CO 2017-01-01 01:03:00 -12.0 -18.0 NaN\n", "3 0CO 2017-01-01 01:23:00 -12.0 -18.0 NaN\n", "4 0CO 2017-01-01 02:03:00 -12.0 -19.0 NaN" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('station').get_group('0CO').head().reset_index(drop=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " EXERCISE:\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE GOES HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " SOLUTION\n", "
" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
stationtimedewpointpressure
temperature
-29.01110
-28.04440
-27.05540
-26.01515150
-25.03131310
...............
35.010310310311
36.07575759
37.04040406
38.07770
39.04440
\n", "

69 rows × 4 columns

\n", "
" ], "text/plain": [ " station time dewpoint pressure\n", "temperature \n", "-29.0 1 1 1 0\n", "-28.0 4 4 4 0\n", "-27.0 5 5 4 0\n", "-26.0 15 15 15 0\n", "-25.0 31 31 31 0\n", "... ... ... ... ...\n", " 35.0 103 103 103 11\n", " 36.0 75 75 75 9\n", " 37.0 40 40 40 6\n", " 38.0 7 7 7 0\n", " 39.0 4 4 4 0\n", "\n", "[69 rows x 4 columns]" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# %load solutions/temperature_count.py\n", "\n", "# Cell content replaced by load magic replacement.\n", "df.temperature = df.temperature.round()\n", "df.groupby('temperature').count()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Top\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.8" } }, "nbformat": 4, "nbformat_minor": 4 }