# XArray Introduction

## XArray Introduction

#### Questions¶

1. What is XArray?
2. How does XArray fit in with Numpy and Pandas?

#### Objectives¶

1. Create a DataArray.
2. Open netCDF data using XArray
3. Subset the data.

### XArray¶

XArray expands on the capabilities on NumPy arrays, providing a lot of streamlined data manipulation. It is similar in that respect to Pandas, but whereas Pandas excels at working with tabular data, XArray is focused on N-dimensional arrays of data (i.e. grids). Its interface is based largely on the netCDF data model (variables, attributes, and dimensions), but it goes beyond the traditional netCDF interfaces to provide functionality similar to netCDF-java's Common Data Model (CDM).

#### DataArray¶

The DataArray is one of the basic building blocks of XArray. It provides a NumPy ndarray-like object that expands to provide two critical pieces of functionality:

1. Coordinate names and values are stored with the data, making slicing and indexing much more powerful
2. It has a built-in container for attributes
In [1]:
# Convention for import to get shortened namespace
import numpy as np
import xarray as xr

In [2]:
# Create some sample "temperature" data
data = 283 + 5 * np.random.randn(5, 3, 4)
data

Out[2]:
array([[[288.6717353 , 282.86812983, 289.37201191, 289.31986918],
[285.83487764, 282.77615342, 282.93472659, 283.23135707],
[277.945099  , 285.43115054, 287.77345197, 276.61585094]],

[[282.7983004 , 280.01172407, 281.10145953, 280.92659817],
[287.26062971, 289.1618384 , 282.47171833, 286.20171911],
[288.1363946 , 288.28544086, 287.07000053, 278.45082951]],

[[280.28475597, 277.00294807, 281.70299506, 285.02187614],
[283.3136319 , 285.10185196, 288.07733264, 286.67059564],
[280.54805382, 284.38031707, 279.94037853, 286.85580016]],

[[271.38161375, 278.71340601, 279.83642271, 284.36803707],
[280.95591716, 283.94325187, 283.7348996 , 286.52561326],
[287.3256393 , 286.39359029, 284.75952091, 276.56980633]],

[[283.68426701, 280.35584331, 276.3688505 , 288.82948095],
[282.55974942, 284.32670363, 289.76757034, 291.04432312],
[265.49834479, 286.59971571, 282.72050244, 282.44922247]]])

Here we create a basic DataArray by passing it just a numpy array of random data. Note that XArray generates some basic dimension names for us.

In [3]:
temp = xr.DataArray(data)
temp

Out[3]:
<xarray.DataArray (dim_0: 5, dim_1: 3, dim_2: 4)>
array([[[288.6717353 , 282.86812983, 289.37201191, 289.31986918],
[285.83487764, 282.77615342, 282.93472659, 283.23135707],
[277.945099  , 285.43115054, 287.77345197, 276.61585094]],

[[282.7983004 , 280.01172407, 281.10145953, 280.92659817],
[287.26062971, 289.1618384 , 282.47171833, 286.20171911],
[288.1363946 , 288.28544086, 287.07000053, 278.45082951]],

[[280.28475597, 277.00294807, 281.70299506, 285.02187614],
[283.3136319 , 285.10185196, 288.07733264, 286.67059564],
[280.54805382, 284.38031707, 279.94037853, 286.85580016]],

[[271.38161375, 278.71340601, 279.83642271, 284.36803707],
[280.95591716, 283.94325187, 283.7348996 , 286.52561326],
[287.3256393 , 286.39359029, 284.75952091, 276.56980633]],

[[283.68426701, 280.35584331, 276.3688505 , 288.82948095],
[282.55974942, 284.32670363, 289.76757034, 291.04432312],
[265.49834479, 286.59971571, 282.72050244, 282.44922247]]])
Dimensions without coordinates: dim_0, dim_1, dim_2

We can also pass in our own dimension names:

In [4]:
temp = xr.DataArray(data, dims=['time', 'lat', 'lon'])
temp

Out[4]:
<xarray.DataArray (time: 5, lat: 3, lon: 4)>
array([[[288.6717353 , 282.86812983, 289.37201191, 289.31986918],
[285.83487764, 282.77615342, 282.93472659, 283.23135707],
[277.945099  , 285.43115054, 287.77345197, 276.61585094]],

[[282.7983004 , 280.01172407, 281.10145953, 280.92659817],
[287.26062971, 289.1618384 , 282.47171833, 286.20171911],
[288.1363946 , 288.28544086, 287.07000053, 278.45082951]],

[[280.28475597, 277.00294807, 281.70299506, 285.02187614],
[283.3136319 , 285.10185196, 288.07733264, 286.67059564],
[280.54805382, 284.38031707, 279.94037853, 286.85580016]],

[[271.38161375, 278.71340601, 279.83642271, 284.36803707],
[280.95591716, 283.94325187, 283.7348996 , 286.52561326],
[287.3256393 , 286.39359029, 284.75952091, 276.56980633]],

[[283.68426701, 280.35584331, 276.3688505 , 288.82948095],
[282.55974942, 284.32670363, 289.76757034, 291.04432312],
[265.49834479, 286.59971571, 282.72050244, 282.44922247]]])
Dimensions without coordinates: time, lat, lon

This is already improved upon from a numpy array, because we have names for each of the dimensions (or axes in NumPy parlance). Even better, we can take arrays representing the values for the coordinates for each of these dimensions and associate them with the data when we create the DataArray.

In [5]:
# Use pandas to create an array of datetimes
import pandas as pd
times = pd.date_range('2018-01-01', periods=5)
times

Out[5]:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05'],
dtype='datetime64[ns]', freq='D')
In [6]:
# Sample lon/lats
lons = np.linspace(-120, -60, 4)
lats = np.linspace(25, 55, 3)


When we create the DataArray instance, we pass in the arrays we just created:

In [7]:
temp = xr.DataArray(data, coords=[times, lats, lons], dims=['time', 'lat', 'lon'])
temp

Out[7]:
<xarray.DataArray (time: 5, lat: 3, lon: 4)>
array([[[288.6717353 , 282.86812983, 289.37201191, 289.31986918],
[285.83487764, 282.77615342, 282.93472659, 283.23135707],
[277.945099  , 285.43115054, 287.77345197, 276.61585094]],

[[282.7983004 , 280.01172407, 281.10145953, 280.92659817],
[287.26062971, 289.1618384 , 282.47171833, 286.20171911],
[288.1363946 , 288.28544086, 287.07000053, 278.45082951]],

[[280.28475597, 277.00294807, 281.70299506, 285.02187614],
[283.3136319 , 285.10185196, 288.07733264, 286.67059564],
[280.54805382, 284.38031707, 279.94037853, 286.85580016]],

[[271.38161375, 278.71340601, 279.83642271, 284.36803707],
[280.95591716, 283.94325187, 283.7348996 , 286.52561326],
[287.3256393 , 286.39359029, 284.75952091, 276.56980633]],

[[283.68426701, 280.35584331, 276.3688505 , 288.82948095],
[282.55974942, 284.32670363, 289.76757034, 291.04432312],
[265.49834479, 286.59971571, 282.72050244, 282.44922247]]])
Coordinates:
* time     (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2018-01-05
* lat      (lat) float64 25.0 40.0 55.0
* lon      (lon) float64 -120.0 -100.0 -80.0 -60.0

...and we can also set some attribute metadata:

In [8]:
temp.attrs['units'] = 'kelvin'
temp.attrs['standard_name'] = 'air_temperature'

temp

Out[8]:
<xarray.DataArray (time: 5, lat: 3, lon: 4)>
array([[[288.6717353 , 282.86812983, 289.37201191, 289.31986918],
[285.83487764, 282.77615342, 282.93472659, 283.23135707],
[277.945099  , 285.43115054, 287.77345197, 276.61585094]],

[[282.7983004 , 280.01172407, 281.10145953, 280.92659817],
[287.26062971, 289.1618384 , 282.47171833, 286.20171911],
[288.1363946 , 288.28544086, 287.07000053, 278.45082951]],

[[280.28475597, 277.00294807, 281.70299506, 285.02187614],
[283.3136319 , 285.10185196, 288.07733264, 286.67059564],
[280.54805382, 284.38031707, 279.94037853, 286.85580016]],

[[271.38161375, 278.71340601, 279.83642271, 284.36803707],
[280.95591716, 283.94325187, 283.7348996 , 286.52561326],
[287.3256393 , 286.39359029, 284.75952091, 276.56980633]],

[[283.68426701, 280.35584331, 276.3688505 , 288.82948095],
[282.55974942, 284.32670363, 289.76757034, 291.04432312],
[265.49834479, 286.59971571, 282.72050244, 282.44922247]]])
Coordinates:
* time     (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2018-01-05
* lat      (lat) float64 25.0 40.0 55.0
* lon      (lon) float64 -120.0 -100.0 -80.0 -60.0
Attributes:
units:          kelvin
standard_name:  air_temperature

Notice what happens if we perform a mathematical operaton with the DataArray: the coordinate values persist, but the attributes are lost. This is done because it is very challenging to know if the attribute metadata is still correct or appropriate after arbitrary arithmetic operations.

In [9]:
# For example, convert Kelvin to Celsius
temp - 273.15

Out[9]:
<xarray.DataArray (time: 5, lat: 3, lon: 4)>
array([[[15.5217353 ,  9.71812983, 16.22201191, 16.16986918],
[12.68487764,  9.62615342,  9.78472659, 10.08135707],
[ 4.795099  , 12.28115054, 14.62345197,  3.46585094]],

[[ 9.6483004 ,  6.86172407,  7.95145953,  7.77659817],
[14.11062971, 16.0118384 ,  9.32171833, 13.05171911],
[14.9863946 , 15.13544086, 13.92000053,  5.30082951]],

[[ 7.13475597,  3.85294807,  8.55299506, 11.87187614],
[10.1636319 , 11.95185196, 14.92733264, 13.52059564],
[ 7.39805382, 11.23031707,  6.79037853, 13.70580016]],

[[-1.76838625,  5.56340601,  6.68642271, 11.21803707],
[ 7.80591716, 10.79325187, 10.5848996 , 13.37561326],
[14.1756393 , 13.24359029, 11.60952091,  3.41980633]],

[[10.53426701,  7.20584331,  3.2188505 , 15.67948095],
[ 9.40974942, 11.17670363, 16.61757034, 17.89432312],
[-7.65165521, 13.44971571,  9.57050244,  9.29922247]]])
Coordinates:
* time     (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2018-01-05
* lat      (lat) float64 25.0 40.0 55.0
* lon      (lon) float64 -120.0 -100.0 -80.0 -60.0

#### Selection¶

We can use the .sel method to select portions of our data based on these coordinate values, rather than using indices (this is similar to the CDM).

In [10]:
temp.sel(time='2018-01-02')

Out[10]:
<xarray.DataArray (lat: 3, lon: 4)>
array([[282.7983004 , 280.01172407, 281.10145953, 280.92659817],
[287.26062971, 289.1618384 , 282.47171833, 286.20171911],
[288.1363946 , 288.28544086, 287.07000053, 278.45082951]])
Coordinates:
time     datetime64[ns] 2018-01-02
* lat      (lat) float64 25.0 40.0 55.0
* lon      (lon) float64 -120.0 -100.0 -80.0 -60.0
Attributes:
units:          kelvin
standard_name:  air_temperature

.sel has the flexibility to also perform nearest neighbor sampling, taking an optional tolerance:

In [11]:
from datetime import timedelta
temp.sel(time='2018-01-07', method='nearest', tolerance=timedelta(days=2))

Out[11]:
<xarray.DataArray (lat: 3, lon: 4)>
array([[283.68426701, 280.35584331, 276.3688505 , 288.82948095],
[282.55974942, 284.32670363, 289.76757034, 291.04432312],
[265.49834479, 286.59971571, 282.72050244, 282.44922247]])
Coordinates:
time     datetime64[ns] 2018-01-05
* lat      (lat) float64 25.0 40.0 55.0
* lon      (lon) float64 -120.0 -100.0 -80.0 -60.0
Attributes:
units:          kelvin
standard_name:  air_temperature
EXERCISE: .interp() works similarly to .sel(). Using .interp(), get an interpolated time series "forecast" for Boulder (40°N, 105°W) or your favorite latitude/longitude location. (Documentation for interp here).
In [12]:
# YOUR CODE GOES HERE

SOLUTION
In [13]:
# %load solutions/interp_solution.py

# Cell content replaced by load magic replacement.
temp.interp(lon=-105, lat=40)

Out[13]:
<xarray.DataArray (time: 5)>
array([283.54083448, 288.68653623, 284.65479695, 283.19641819,
283.88496508])
Coordinates:
* time     (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2018-01-05
lon      int64 -105
lat      int64 40
Attributes:
units:          kelvin
standard_name:  air_temperature

#### Slicing with Selection¶

In [14]:
temp.sel(time=slice('2018-01-01', '2018-01-03'), lon=slice(-110, -70), lat=slice(25, 45))

Out[14]:
<xarray.DataArray (time: 3, lat: 2, lon: 2)>
array([[[282.86812983, 289.37201191],
[282.77615342, 282.93472659]],

[[280.01172407, 281.10145953],
[289.1618384 , 282.47171833]],

[[277.00294807, 281.70299506],
[285.10185196, 288.07733264]]])
Coordinates:
* time     (time) datetime64[ns] 2018-01-01 2018-01-02 2018-01-03
* lat      (lat) float64 25.0 40.0
* lon      (lon) float64 -100.0 -80.0
Attributes:
units:          kelvin
standard_name:  air_temperature

#### .loc¶

All of these operations can also be done within square brackets on the .loc attribute of the DataArray. This permits a much more numpy-looking syntax, though you lose the ability to specify the names of the various dimensions. Instead, the slicing must be done in the correct order.

In [15]:
# As done above
temp.loc['2018-01-02']

Out[15]:
<xarray.DataArray (lat: 3, lon: 4)>
array([[282.7983004 , 280.01172407, 281.10145953, 280.92659817],
[287.26062971, 289.1618384 , 282.47171833, 286.20171911],
[288.1363946 , 288.28544086, 287.07000053, 278.45082951]])
Coordinates:
time     datetime64[ns] 2018-01-02
* lat      (lat) float64 25.0 40.0 55.0
* lon      (lon) float64 -120.0 -100.0 -80.0 -60.0
Attributes:
units:          kelvin
standard_name:  air_temperature
In [16]:
temp.loc['2018-01-01':'2018-01-03', 25:45, -110:-70]

Out[16]:
<xarray.DataArray (time: 3, lat: 2, lon: 2)>
array([[[282.86812983, 289.37201191],
[282.77615342, 282.93472659]],

[[280.01172407, 281.10145953],
[289.1618384 , 282.47171833]],

[[277.00294807, 281.70299506],
[285.10185196, 288.07733264]]])
Coordinates:
* time     (time) datetime64[ns] 2018-01-01 2018-01-02 2018-01-03
* lat      (lat) float64 25.0 40.0
* lon      (lon) float64 -100.0 -80.0
Attributes:
units:          kelvin
standard_name:  air_temperature

This does not work however:

temp.loc[-110:-70, 25:45,'2018-01-01':'2018-01-03']


### Opening netCDF data¶

With its close ties to the netCDF data model, XArray also supports netCDF as a first-class file format. This means it has easy support for opening netCDF datasets, so long as they conform to some of XArray's limitations (such as 1-dimensional coordinates).

In [17]:
# Open sample North American Reanalysis data in netCDF format
ds = xr.open_dataset('../../../data/NARR_19930313_0000.nc')
ds

Out[17]:
<xarray.Dataset>
Dimensions:                       (isobaric1: 29, time1: 1, x: 268, y: 119)
Coordinates:
* time1                         (time1) datetime64[ns] 1993-03-13
* isobaric1                     (isobaric1) float32 100.0 125.0 ... 1000.0
* y                             (y) float32 -3116.548 -3084.0852 ... 714.08594
* x                             (x) float32 -3324.4707 ... 5343.1504
Data variables:
u-component_of_wind_isobaric  (time1, isobaric1, y, x) float32 ...
LambertConformal_Projection   int32 0
lat                           (y, x) float64 17.87 17.96 ... 34.85 34.64
lon                           (y, x) float64 -135.0 -134.8 ... -43.13 -42.91
Geopotential_height_isobaric  (time1, isobaric1, y, x) float32 ...
v-component_of_wind_isobaric  (time1, isobaric1, y, x) float32 ...
Temperature_isobaric          (time1, isobaric1, y, x) float32 ...
Attributes:
Originating_or_generating_Center:     US National Weather Service, Nation...
Originating_or_generating_Subcenter:  North American Regional Reanalysis ...
GRIB_table_version:                   0,131
Generating_process_or_model:          North American Regional Reanalysis ...
Conventions:                          CF-1.6
history:                              Read using CDM IOSP GribCollection v3
featureType:                          GRID
History:                              Translated to CF-1.0 Conventions by...
geospatial_lat_min:                   10.753308882144761
geospatial_lat_max:                   46.8308828962289
geospatial_lon_min:                   -153.88242040519995
geospatial_lon_max:                   -42.666108129242815

This returns a Dataset object, which is a container that contains one or more DataArrays, which can also optionally share coordinates. We can then pull out individual fields:

In [18]:
ds.isobaric1

Out[18]:
<xarray.DataArray 'isobaric1' (isobaric1: 29)>
array([ 100.,  125.,  150.,  175.,  200.,  225.,  250.,  275.,  300.,  350.,
400.,  450.,  500.,  550.,  600.,  650.,  700.,  725.,  750.,  775.,
800.,  825.,  850.,  875.,  900.,  925.,  950.,  975., 1000.],
dtype=float32)
Coordinates:
* isobaric1  (isobaric1) float32 100.0 125.0 150.0 ... 950.0 975.0 1000.0
Attributes:
units:                   hPa
long_name:               Isobaric surface
positive:                down
Grib_level_type:         100
_CoordinateAxisType:     Pressure
_CoordinateZisPositive:  down

or

In [19]:
ds['isobaric1']

Out[19]:
<xarray.DataArray 'isobaric1' (isobaric1: 29)>
array([ 100.,  125.,  150.,  175.,  200.,  225.,  250.,  275.,  300.,  350.,
400.,  450.,  500.,  550.,  600.,  650.,  700.,  725.,  750.,  775.,
800.,  825.,  850.,  875.,  900.,  925.,  950.,  975., 1000.],
dtype=float32)
Coordinates:
* isobaric1  (isobaric1) float32 100.0 125.0 150.0 ... 950.0 975.0 1000.0
Attributes:
units:                   hPa
long_name:               Isobaric surface
positive:                down
Grib_level_type:         100
_CoordinateAxisType:     Pressure
_CoordinateZisPositive:  down

Datasets also support much of the same subsetting operations as DataArray, but will perform the operation on all data:

In [20]:
ds_1000 = ds.sel(isobaric1=1000.0)
ds_1000

Out[20]:
<xarray.Dataset>
Dimensions:                       (time1: 1, x: 268, y: 119)
Coordinates:
* time1                         (time1) datetime64[ns] 1993-03-13
isobaric1                     float32 1000.0
* y                             (y) float32 -3116.548 -3084.0852 ... 714.08594
* x                             (x) float32 -3324.4707 ... 5343.1504
Data variables:
u-component_of_wind_isobaric  (time1, y, x) float32 -5.377701 ... 7.497299
LambertConformal_Projection   int32 0
lat                           (y, x) float64 17.87 17.96 ... 34.85 34.64
lon                           (y, x) float64 -135.0 -134.8 ... -43.13 -42.91
Geopotential_height_isobaric  (time1, y, x) float32 153.24702 ... 203.74702
v-component_of_wind_isobaric  (time1, y, x) float32 -5.755432 ... 9.275818
Temperature_isobaric          (time1, y, x) float32 294.09436 ... 288.96936
Attributes:
Originating_or_generating_Center:     US National Weather Service, Nation...
Originating_or_generating_Subcenter:  North American Regional Reanalysis ...
GRIB_table_version:                   0,131
Generating_process_or_model:          North American Regional Reanalysis ...
Conventions:                          CF-1.6
history:                              Read using CDM IOSP GribCollection v3
featureType:                          GRID
History:                              Translated to CF-1.0 Conventions by...
geospatial_lat_min:                   10.753308882144761
geospatial_lat_max:                   46.8308828962289
geospatial_lon_min:                   -153.88242040519995
geospatial_lon_max:                   -42.666108129242815
In [21]:
ds_1000.Temperature_isobaric

Out[21]:
<xarray.DataArray 'Temperature_isobaric' (time1: 1, y: 119, x: 268)>
array([[[294.09436, 294.1256 , ..., 298.6725 , 298.71936],
[294.07874, 294.14124, ..., 298.6256 , 298.5475 ],
...,
[276.86   , 276.84436, ..., 289.11   , 289.01624],
[277.01624, 276.82874, ..., 289.0475 , 288.96936]]], dtype=float32)
Coordinates:
* time1      (time1) datetime64[ns] 1993-03-13
isobaric1  float32 1000.0
* y          (y) float32 -3116.548 -3084.0852 -3051.622 ... 681.6229 714.08594
* x          (x) float32 -3324.4707 -3292.0078 ... 5310.6875 5343.1504
Attributes:
long_name:           Temperature @ Isobaric surface
units:               K
description:         Temperature
grid_mapping:        LambertConformal_Projection
Grib_Variable_Id:    VAR_7-15-131-11_L100
Grib1_Center:        7
Grib1_Subcenter:     15
Grib1_TableVersion:  131
Grib1_Parameter:     11
Grib1_Level_Type:    100
Grib1_Level_Desc:    Isobaric surface

#### Aggregation operations¶

Not only can you use the named dimensions for manual slicing and indexing of data, but you can also use it to control aggregation operations, like sum:

In [22]:
u_winds = ds['u-component_of_wind_isobaric']
u_winds.std(dim=['x', 'y'])

Out[22]:
<xarray.DataArray 'u-component_of_wind_isobaric' (time1: 1, isobaric1: 29)>
array([[ 8.673963 , 10.212325 , 11.556413 , 12.254429 , 13.372146 ,
15.472462 , 16.091969 , 15.846294 , 15.195834 , 13.936979 ,
12.93888  , 12.060708 , 10.972139 ,  9.722328 ,  8.853286 ,
8.257241 ,  7.679721 ,  7.4516497,  7.2352104,  7.039894 ,
6.883371 ,  6.7821493,  6.7088237,  6.6865997,  6.7247376,
6.745023 ,  6.6859775,  6.5107226,  5.972262 ]], dtype=float32)
Coordinates:
* time1      (time1) datetime64[ns] 1993-03-13
* isobaric1  (isobaric1) float32 100.0 125.0 150.0 ... 950.0 975.0 1000.0
EXERCISE: Using the sample dataset, calculate the mean temperature profile (temperature as a function of pressure) over Colorado within this dataset. For this exercise, consider the bounds of Colorado to be:
• x: -182km to 424km
• y: -1450km to -990km
(37°N to 41°N and 102°W to 109°W projected to Lambert Conformal projection coordinates)
In [23]:
# YOUR CODE GOES HERE

SOLUTION
In [24]:
# %load solutions/mean_profile.py

# Cell content replaced by load magic replacement.
temps = ds.Temperature_isobaric
co_temps = temps.sel(x=slice(-182, 424), y=slice(-1450, -990))
prof = co_temps.mean(dim=['x', 'y'])
prof

Out[24]:
<xarray.DataArray 'Temperature_isobaric' (time1: 1, isobaric1: 29)>
array([[215.078  , 215.76935, 217.243  , 217.82663, 215.83487, 216.10933,
219.99902, 224.66118, 228.80576, 234.88701, 238.78503, 242.66309,
246.44807, 249.26636, 250.84995, 253.37354, 257.0429 , 259.08398,
260.97955, 262.98364, 264.82138, 266.5198 , 268.22467, 269.7471 ,
271.18216, 272.66815, 274.13037, 275.54718, 276.97675]],
dtype=float32)
Coordinates:
* time1      (time1) datetime64[ns] 1993-03-13
* isobaric1  (isobaric1) float32 100.0 125.0 150.0 ... 950.0 975.0 1000.0

### Resources¶

There is much more in the XArray library. To learn more, visit the XArray Documentation