<div style="width:1000 px">

<div style="float:right; width:98 px; height:98px;">
<img src="https://raw.githubusercontent.com/Unidata/MetPy/master/src/metpy/plots/_static/unidata_150x150.png" alt="Unidata Logo" style="height: 98px;">
</div>

<h1>XArray Introduction</h1>
<h3>Unidata Python Workshop</h3>

<div style="clear:both"></div>
</div>

<hr style="height:2px;">

<div style="float:right; width:250 px"><img src="http://xarray.pydata.org/en/stable/_static/dataset-diagram-logo.png" alt="NumPy Logo" style="height: 250px;"></div>

### Questions
1. What is XArray?
2. How does XArray fit in with Numpy and Pandas?

### Objectives
1. Create a `DataArray`.
2. Open netCDF data using XArray
3. Subset the data.

## XArray

XArray expands on the capabilities on NumPy arrays, providing a lot of streamlined data manipulation. It is similar in that respect to Pandas, but whereas Pandas excels at working with tabular data, XArray is focused on N-dimensional arrays of data (i.e. grids). Its interface is based largely on the netCDF data model (variables, attributes, and dimensions), but it goes beyond the traditional netCDF interfaces to provide functionality similar to netCDF-java's Common Data Model (CDM). 

### `DataArray`

The `DataArray` is one of the basic building blocks of XArray. It provides a NumPy ndarray-like object that expands to provide two critical pieces of functionality:

1. Coordinate names and values are stored with the data, making slicing and indexing much more powerful
2. It has a built-in container for attributes

In [1]:
# Convention for import to get shortened namespace
import numpy as np
import xarray as xr

In [2]:
# Create some sample "temperature" data
data = 283 + 5 * np.random.randn(5, 3, 4)
data

array([[[288.6717353 , 282.86812983, 289.37201191, 289.31986918],
        [285.83487764, 282.77615342, 282.93472659, 283.23135707],
        [277.945099  , 285.43115054, 287.77345197, 276.61585094]],

       [[282.7983004 , 280.01172407, 281.10145953, 280.92659817],
        [287.26062971, 289.1618384 , 282.47171833, 286.20171911],
        [288.1363946 , 288.28544086, 287.07000053, 278.45082951]],

       [[280.28475597, 277.00294807, 281.70299506, 285.02187614],
        [283.3136319 , 285.10185196, 288.07733264, 286.67059564],
        [280.54805382, 284.38031707, 279.94037853, 286.85580016]],

       [[271.38161375, 278.71340601, 279.83642271, 284.36803707],
        [280.95591716, 283.94325187, 283.7348996 , 286.52561326],
        [287.3256393 , 286.39359029, 284.75952091, 276.56980633]],

       [[283.68426701, 280.35584331, 276.3688505 , 288.82948095],
        [282.55974942, 284.32670363, 289.76757034, 291.04432312],
        [265.49834479, 286.59971571, 282.72050244, 282.44922247]]])

Here we create a basic `DataArray` by passing it just a numpy array of random data. Note that XArray generates some basic dimension names for us.

In [3]:
temp = xr.DataArray(data)
temp

We can also pass in our own dimension names:

In [4]:
temp = xr.DataArray(data, dims=['time', 'lat', 'lon'])
temp

This is already improved upon from a numpy array, because we have names for each of the dimensions (or axes in NumPy parlance). Even better, we can take arrays representing the values for the coordinates for each of these dimensions and associate them with the data when we create the `DataArray`.

In [5]:
# Use pandas to create an array of datetimes
import pandas as pd
times = pd.date_range('2018-01-01', periods=5)
times

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05'],
              dtype='datetime64[ns]', freq='D')

In [6]:
# Sample lon/lats
lons = np.linspace(-120, -60, 4)
lats = np.linspace(25, 55, 3)

When we create the `DataArray` instance, we pass in the arrays we just created:

In [7]:
temp = xr.DataArray(data, coords=[times, lats, lons], dims=['time', 'lat', 'lon'])
temp

...and we can also set some attribute metadata:

In [8]:
temp.attrs['units'] = 'kelvin'
temp.attrs['standard_name'] = 'air_temperature'

temp

Notice what happens if we perform a mathematical operaton with the `DataArray`: the coordinate values persist, but the attributes are lost. This is done because it is very challenging to know if the attribute metadata is still correct or appropriate after arbitrary arithmetic operations.

In [9]:
# For example, convert Kelvin to Celsius
temp - 273.15

### Selection
We can use the `.sel` method to select portions of our data based on these coordinate values, rather than using indices (this is similar to the CDM).

In [10]:
temp.sel(time='2018-01-02')

`.sel` has the flexibility to also perform nearest neighbor sampling, taking an optional tolerance:

In [11]:
from datetime import timedelta
temp.sel(time='2018-01-07', method='nearest', tolerance=timedelta(days=2))

<div class="alert alert-success">
    <b>EXERCISE</b>:
   
.interp() works similarly to .sel(). Using .interp(), get an interpolated time series "forecast" for Boulder (40°N, 105°W) or your favorite latitude/longitude location. (Documentation for interp <a href="http://xarray.pydata.org/en/stable/interpolation.html">here</a>).
</div>


In [12]:
# YOUR CODE GOES HERE

<div class="alert alert-info">
    <b>SOLUTION</b>
</div>

In [13]:
# %load solutions/interp_solution.py

# Cell content replaced by load magic replacement.
temp.interp(lon=-105, lat=40)


### Slicing with Selection

In [14]:
temp.sel(time=slice('2018-01-01', '2018-01-03'), lon=slice(-110, -70), lat=slice(25, 45))

### `.loc`

All of these operations can also be done within square brackets on the `.loc` attribute of the `DataArray`. This permits a much more numpy-looking syntax, though you lose the ability to specify the names of the various dimensions. Instead, the slicing must be done in the correct order.

In [15]:
# As done above
temp.loc['2018-01-02']

In [16]:
temp.loc['2018-01-01':'2018-01-03', 25:45, -110:-70]

This does not work however:
```python
temp.loc[-110:-70, 25:45,'2018-01-01':'2018-01-03']
```

## Opening netCDF data
With its close ties to the netCDF data model, XArray also supports netCDF as a first-class file format. This means it has easy support for opening netCDF datasets, so long as they conform to some of XArray's limitations (such as 1-dimensional coordinates).

In [17]:
# Open sample North American Reanalysis data in netCDF format
ds = xr.open_dataset('../../../data/NARR_19930313_0000.nc')
ds

This returns a `Dataset` object, which is a container that contains one or more `DataArray`s, which can also optionally share coordinates. We can then pull out individual fields:

In [18]:
ds.isobaric1

or

In [19]:
ds['isobaric1']

`Dataset`s also support much of the same subsetting operations as `DataArray`, but will perform the operation on all data:

In [20]:
ds_1000 = ds.sel(isobaric1=1000.0)
ds_1000

In [21]:
ds_1000.Temperature_isobaric

### Aggregation operations

Not only can you use the named dimensions for manual slicing and indexing of data, but you can also use it to control aggregation operations, like `sum`:

In [22]:
u_winds = ds['u-component_of_wind_isobaric']
u_winds.std(dim=['x', 'y'])

<div class="alert alert-success">
    <b>EXERCISE</b>:

Using the sample dataset, calculate the mean temperature profile (temperature as a function of pressure) over Colorado within this dataset. For this exercise, consider the bounds of Colorado to be:
     <ul>
         <li>x: -182km to 424km</li>
         <li>y: -1450km to -990km</li>
    </ul>
    
(37°N to 41°N and 102°W to 109°W projected to Lambert Conformal projection coordinates)
</div>

In [23]:
# YOUR CODE GOES HERE

<div class="alert alert-info">
    <b>SOLUTION</b>
</div>

In [24]:
# %load solutions/mean_profile.py

# Cell content replaced by load magic replacement.
temps = ds.Temperature_isobaric
co_temps = temps.sel(x=slice(-182, 424), y=slice(-1450, -990))
prof = co_temps.mean(dim=['x', 'y'])
prof


## Resources

There is much more in the XArray library. To learn more, visit the [XArray Documentation](http://xarray.pydata.org/en/stable/index.html)