NetCDF and CF - The Basics

NetCDF and CF: The Basics

Overview¶

This workshop will teach some of the basics of Climate and Forecasting metadata for netCDF data files with some hands-on work available in Jupyter Notebooks using Python. Along with introduction to netCDF and CF, we will introduce the CF data model and discuss some netCDF implementation details to consider when deciding how to write data with CF and netCDF. We will cover gridded data as well as in situ data (stations, soundings, etc.) and touch on storing geometries data in CF.

This assumes a basic understanding of netCDF.

Outline¶

Gridded Data¶

Let's say we're working with some numerical weather forecast model output. Let's walk through the steps necessary to store this data in netCDF, using the Climate and Forecasting metadata conventions to ensure that our data are available to as many tools as possible.

To start, let's assume the following about our data:

It corresponds to forecast three dimensional temperature at several times
The native coordinate system of the model is on a regular grid that represents the Earth on a Lambert conformal projection.

We'll also go ahead and generate some arrays of data below to get started:

In [1]:

# Import some useful Python tools
from datetime import datetime, timedelta

import numpy as np

# Twelve hours of hourly output starting at 22Z today
start = datetime.utcnow().replace(hour=22, minute=0, second=0, microsecond=0)
times = np.array([start + timedelta(hours=h) for h in range(13)])

# 3km spacing in x and y
x = np.arange(-150, 153, 3)
y = np.arange(-100, 100, 3)

# Standard pressure levels in hPa
press = np.array([1000, 925, 850, 700, 500, 300, 250])

temps = np.random.randn(times.size, press.size, y.size, x.size)

Creating the file and dimensions¶

The first step is to create a new file and set up the shared dimensions we'll be using in the file. We'll be using the netCDF4-python library to do all of the requisite netCDF API calls.

In [2]:

from netCDF4 import Dataset
nc = Dataset('forecast_model.nc', 'w', format='NETCDF4_CLASSIC', diskless=True)

We're going to start by adding some global attribute metadata. These are recommendations from the standard (not required), but they're easy to add and help users keep the data straight, so let's go ahead and do it.

In [3]:

nc.Conventions = 'CF-1.7'
nc.title = 'Forecast model run'
nc.institution = 'Unidata'
nc.source = 'WRF-1.5'
nc.history = str(datetime.utcnow()) + ' Python'
nc.references = ''
nc.comment = ''

At this point, this is the CDL representation of this dataset:

netcdf forecast_model {
  attributes:
    :Conventions = "CF-1.7" ;
    :title = "Forecast model run" ;
    :institution = "Unidata" ;
    :source = "WRF-1.5" ;
    :history = "2019-07-16 02:21:52.005718 Python" ;
    :references = "" ;
    :comment = "" ;
}

Next, before adding variables to the file to define each of the data fields in this file, we need to define the dimensions that exist in this data set. We set each of x, y, and pressure to the size of the corresponding array. We set forecast_time to be an "unlimited" dimension, which allows the dataset to grow along that dimension if we write additional data to it later.

In [4]:

nc.createDimension('forecast_time', None)
nc.createDimension('x', x.size)
nc.createDimension('y', y.size)
nc.createDimension('pressure', press.size)
nc

Out[4]:

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4_CLASSIC data model, file format HDF5):
    Conventions: CF-1.7
    title: Forecast model run
    institution: Unidata
    source: WRF-1.5
    history: 2020-09-04 15:37:01.984586 Python
    references: 
    comment: 
    dimensions(sizes): forecast_time(0), x(101), y(67), pressure(7)
    variables(dimensions): 
    groups:

The CDL representation now shows our dimensions:

netcdf forecast_model {
  dimensions:
    forecast_time = UNLIMITED (currently 13) ;
    x = 101 ;
    y = 67 ;
    pressure = 7 ;
  attributes:
    :Conventions = "CF-1.7" ;
    :title = "Forecast model run" ;
    :institution = "Unidata" ;
    :source = "WRF-1.5" ;
    :history = "2019-07-16 02:21:52.005718 Python" ;
    :references = "" ;
    :comment = "" ;
}

Creating and filling a variable¶

So far, all we've done is outlined basic information about our dataset: broad metadata and the dimensions of our dataset. Now we create a variable to hold one particular data field for our dataset, in this case the forecast air temperature. When defining this variable, we specify the datatype for the values being stored, the relevant dimensions, as well as enable optional compression.

In [5]:

temps_var = nc.createVariable('Temperature', datatype=np.float32,
                              dimensions=('forecast_time', 'pressure', 'y', 'x'),
                              zlib=True)

Now that we have the variable, we tell python to write our array of data to it.

In [6]:

temps_var[:] = temps
temps_var

Out[6]:

<class 'netCDF4._netCDF4.Variable'>
float32 Temperature(forecast_time, pressure, y, x)
unlimited dimensions: forecast_time
current shape = (13, 7, 67, 101)
filling on, default _FillValue of 9.969209968386869e+36 used

If instead we wanted to write data sporadically, like once per time step, we could do that instead (though the for loop below might actually be at a higher level in the program:

In [7]:

next_slice = 0
for temp_slice in temps:
    temps_var[next_slice] = temp_slice
    next_slice += 1

At this point, this is the CDL representation of our dataset:

netcdf forecast_model {
  dimensions:
    forecast_time = UNLIMITED (currently 13) ;
    x = 101 ;
    y = 67 ;
    pressure = 7 ;
  variables:
    float Temperature(forecast_time, pressure, y, x) ;
  attributes:
    :Conventions = "CF-1.7" ;
    :title = "Forecast model run" ;
    :institution = "Unidata" ;
    :source = "WRF-1.5" ;
    :history = "2019-07-16 02:21:52.005718 Python" ;
    :references = "" ;
    :comment = "" ;
}

We can also add attributes to this variable to define metadata. The CF conventions require a units attribute to be set for all variables that represent a dimensional quantity. The value of this attribute needs to be parsable by the UDUNITS library. Here we set it to a value of 'Kelvin'. We also set the standard (optional) attributes of long_name and standard_name. The former contains a longer description of the variable, while the latter comes from a controlled vocabulary in the CF conventions. This allows users of data to understand, in a standard fashion, what a variable represents. If we had missing values, we could also set the missing_value attribute to an appropriate value.

NASA Dataset Interoperability Recommendations:

Section 2.2 - Include Basic CF Attributes

Include where applicable: units, long_name, standard_name, valid_min / valid_max, scale_factor / add_offset and others.

In [8]:

temps_var.units = 'Kelvin'
temps_var.standard_name = 'air_temperature'
temps_var.long_name = 'Forecast air temperature'
temps_var.missing_value = -9999
temps_var

Out[8]:

<class 'netCDF4._netCDF4.Variable'>
float32 Temperature(forecast_time, pressure, y, x)
    units: Kelvin
    standard_name: air_temperature
    long_name: Forecast air temperature
    missing_value: -9999.0
unlimited dimensions: forecast_time
current shape = (13, 7, 67, 101)
filling on, default _FillValue of 9.969209968386869e+36 used

The resulting CDL (truncated to the variables only) looks like:

  variables:
    float Temperature(forecast_time, pressure, y, x) ;
      Temperature:units = "Kelvin" ;
      Temperature:standard_name = "air_temperature" ;
      Temperature:long_name = "Forecast air temperature" ;
      Temperature:missing_value = -9999.0 ;

Coordinate variables¶

To properly orient our data in time and space, we need to go beyond dimensions (which define common sizes and alignment) and include values along these dimensions, which are called "Coordinate Variables". Generally, these are defined by creating a one dimensional variable with the same name as the respective dimension.

To start, we define variables which define our x and y coordinate values. These variables include standard_names which allow associating them with projections (more on this later) as well as an optional axis attribute to make clear what standard direction this coordinate refers to.

In [9]:

x_var = nc.createVariable('x', np.float32, ('x',))
x_var[:] = x
x_var.units = 'km'
x_var.axis = 'X' # Optional
x_var.standard_name = 'projection_x_coordinate'
x_var.long_name = 'x-coordinate in projected coordinate system'

y_var = nc.createVariable('y', np.float32, ('y',))
y_var[:] = y
y_var.units = 'km'
y_var.axis = 'Y' # Optional
y_var.standard_name = 'projection_y_coordinate'
y_var.long_name = 'y-coordinate in projected coordinate system'

We also define a coordinate variable pressure to reference our data in the vertical dimension. The standard_name of 'air_pressure' is sufficient to identify this coordinate variable as the vertical axis, but let's go ahead and specify the axis as well. We also specify the attribute positive to indicate whether the variable increases when going up or down. In the case of pressure, this is technically optional.

In [10]:

press_var = nc.createVariable('pressure', np.float32, ('pressure',))
press_var[:] = press
press_var.units = 'hPa'
press_var.axis = 'Z'  # Optional
press_var.standard_name = 'air_pressure'
press_var.positive = 'down'  # Optional

Time coordinates must contain a units attribute with a string value with a form similar to 'seconds since 2019-01-06 12:00:00.00'. 'seconds', 'minutes', 'hours', and 'days' are the most commonly used units for time. Due to the variable length of months and years, they are not recommended.

Before we can write data, we need to first need to convert our list of Python datetime instances to numeric values. We can use the cftime library to make this easy to convert using the unit string as defined above.

In [11]:

from cftime import date2num
time_units = 'hours since {:%Y-%m-%d 00:00}'.format(times[0])
time_vals = date2num(times, time_units)
time_vals

Out[11]:

array([22., 23., 24., 25., 26., 27., 28., 29., 30., 31., 32., 33., 34.])

Now we can create the forecast_time variable just as we did before for the other coordinate variables:

In [12]:

time_var = nc.createVariable('forecast_time', np.int32, ('forecast_time',))
time_var[:] = time_vals
time_var.units = time_units
time_var.axis = 'T'  # Optional
time_var.standard_name = 'time'  # Optional
time_var.long_name = 'time'

The CDL representation of the variables now contains much more information:

  dimensions:
    forecast_time = UNLIMITED (currently 13) ;
    x = 101 ;
    y = 67 ;
    pressure = 7 ;
  variables:
    float x(x) ;
      x:units = "km" ;
      x:axis = "X" ;
      x:standard_name = "projection_x_coordinate" ;
      x:long_name = "x-coordinate in projected coordinate system" ;
    float y(y) ;
      y:units = "km" ;
      y:axis = "Y" ;
      y:standard_name = "projection_y_coordinate" ;
      y:long_name = "y-coordinate in projected coordinate system" ;
    float pressure(pressure) ;
      pressure:units = "hPa" ;
      pressure:axis = "Z" ;
      pressure:standard_name = "air_pressure" ;
      pressure:positive = "down" ;
    float forecast_time(forecast_time) ;
      forecast_time:units = "hours since 2019-07-16 00:00" ;
      forecast_time:axis = "T" ;
      forecast_time:standard_name = "time" ;
      forecast_time:long_name = "time" ;
    float Temperature(forecast_time, pressure, y, x) ;
      Temperature:units = "Kelvin" ;
      Temperature:standard_name = "air_temperature" ;
      Temperature:long_name = "Forecast air temperature" ;
      Temperature:missing_value = -9999.0 ;

Auxilliary Coordinates¶

Our data are still not CF-compliant because they do not contain latitude and longitude information, which is needed to properly locate the data. To solve this, we need to add variables with latitude and longitude. These are called "auxillary coordinate variables", not because they are extra, but because they are not simple one dimensional variables.

Below, we first generate longitude and latitude values from our projected coordinates using the pyproj library.

In [13]:

from pyproj import Proj
X, Y = np.meshgrid(x, y)
lcc = Proj({'proj':'lcc', 'lon_0':-105, 'lat_0':40, 'a':6371000.,
            'lat_1':25})
lon, lat = lcc(X * 1000, Y * 1000, inverse=True)

Now we can create the needed variables. Both are dimensioned on y and x and are two-dimensional. The longitude variable is identified as actually containing such information by its required units of 'degrees_east', as well as the optional 'longitude' standard_name attribute. The case is the same for latitude, except the units are 'degrees_north' and the standard_name is 'latitude'.

In [14]:

lon_var = nc.createVariable('lon', np.float64, ('y', 'x'))
lon_var[:] = lon
lon_var.units = 'degrees_east'
lon_var.standard_name = 'longitude'  # Optional
lon_var.long_name = 'longitude'

lat_var = nc.createVariable('lat', np.float64, ('y', 'x'))
lat_var[:] = lat
lat_var.units = 'degrees_north'
lat_var.standard_name = 'latitude'  # Optional
lat_var.long_name = 'latitude'

With the variables created, we identify these variables as containing coordinates for the Temperature variable by setting the coordinates value to a space-separated list of the names of the auxilliary coordinate variables:

In [15]:

temps_var.coordinates = 'lon lat'

This yields the following CDL:

  double lon(y, x);
    lon:units = "degrees_east";
    lon:long_name = "longitude coordinate";
    lon:standard_name = "longitude";
  double lat(y, x);
    lat:units = "degrees_north";
    lat:long_name = "latitude coordinate";
    lat:standard_name = "latitude";
  float Temperature(time, y, x);
    Temperature:units = "Kelvin" ;
    Temperature:standard_name = "air_temperature" ;
    Temperature:long_name = "Forecast air temperature" ;
    Temperature:missing_value = -9999.0 ;
    Temperature:coordinates = "lon lat";

Coordinate System Information¶

With our data specified on a Lambert conformal projected grid, it would be good to include this information in our metadata. We can do this using a "grid mapping" variable. This uses a dummy scalar variable as a namespace for holding all of the required information. Relevant variables then reference the dummy variable with their grid_mapping attribute.

Below we create a variable and set it up for a Lambert conformal conic projection on a spherical earth. The grid_mapping_name attribute describes which of the CF-supported grid mappings we are specifying. The names of additional attributes vary between the mappings.

In [16]:

proj_var = nc.createVariable('lambert_projection', np.int32, ())
proj_var.grid_mapping_name = 'lambert_conformal_conic'
proj_var.standard_parallel = 25.
proj_var.latitude_of_projection_origin = 40.
proj_var.longitude_of_central_meridian = -105.
proj_var.semi_major_axis = 6371000.0
proj_var

Out[16]:

<class 'netCDF4._netCDF4.Variable'>
int32 lambert_projection()
    grid_mapping_name: lambert_conformal_conic
    standard_parallel: 25.0
    latitude_of_projection_origin: 40.0
    longitude_of_central_meridian: -105.0
    semi_major_axis: 6371000.0
unlimited dimensions: 
current shape = ()
filling on, default _FillValue of -2147483647 used

Now that we created the variable, all that's left is to set the grid_mapping attribute on our Temperature variable to the name of our dummy variable:

In [17]:

temps_var.grid_mapping = 'lambert_projection'  # or proj_var.name

This yields the CDL:

  variables:
    int lambert_projection ;
      lambert_projection:grid_mapping_name = "lambert_conformal_conic ;
      lambert_projection:standard_parallel = 25.0 ;
      lambert_projection:latitude_of_projection_origin = 40.0 ;
      lambert_projection:longitude_of_central_meridian = -105.0 ;
      lambert_projection:semi_major_axis = 6371000.0 ;
    float Temperature(forecast_time, pressure, y, x) ;
      Temperature:units = "Kelvin" ;
      Temperature:standard_name = "air_temperature" ;
      Temperature:long_name = "Forecast air temperature" ;
      Temperature:missing_value = -9999.0 ;
      Temperature:coordinates = "lon lat" ;
      Temperature:grid_mapping = "lambert_projection" ;

Cell Bounds¶

NASA Dataset Interoperability Recommendations:

Section 2.3 - Use CF “bounds” attributes

CF conventions state: “When gridded data does not represent the point values of a field but instead represents some characteristic of the field within cells of finite ‘volume,’ a complete description of the variable should include metadata that describes the domain or extent of each cell, and the characteristic of the field that the cell values represent.”

For example, if a rain guage is read every 3 hours but only dumped every six hours, it might look like this

netcdf precip_bucket_bounds {
  dimensions:
      lat = 12 ;
      lon = 19 ;
      time = 8 ;
      tbv = 2;
  variables:
      float lat(lat) ;
      float lon(lon) ;
      float time(time) ;
        time:units = "hours since 2019-07-12 00:00:00.00";
        time:bounds = "time_bounds" ;
      float time_bounds(time,tbv)
      float precip(time, lat, lon) ;
        precip:units = "inches" ;
  data:
    time = 3, 6, 9, 12, 15, 18, 21, 24;
    time_bounds = 0, 3, 0, 6, 6, 9, 6, 12, 12, 15, 12, 18, 18, 21, 18, 24;
}

So the time coordinate looks like

|---X
|-------X
        |---X
        |-------X
                |---X
                |-------X
                        |---X
                        |-------X
0   3   6   9  12  15  18  21  24

Observational Data¶

So far we've focused on how to handle storing data that are arranged in a grid. What about observation data? The CF conventions describe this as conventions for Discrete Sampling Geometeries (DSG).

For data that are regularly sampled (say, all at the same heights) this is straightforward. First, let's define some sample profile data, all at a few heights less than 1000m:

In [18]:

lons = np.array([-97.1, -105, -80])
lats = np.array([35.25, 40, 27])
heights = np.linspace(10, 1000, 10)
temps = np.random.randn(lats.size, heights.size)
stids = ['KBOU', 'KOUN', 'KJUP']

Creation and basic setup¶

First we create a new file and define some dimensions. Since this is profile data, heights will be one dimension. We use station as our other dimension. We also set the global featureType attribute to 'profile' to indicate that this file holds "an ordered set of data points along a vertical line at a fixed horizontal position and fixed time". We also add a dimension to assist in storing our string station ids.

In [19]:

nc.close()
nc = Dataset('obs_data.nc', 'w', format='NETCDF4_CLASSIC', diskless=True)
nc.createDimension('station', lats.size)
nc.createDimension('heights', heights.size)
nc.createDimension('str_len', 4)
nc.Conventions = 'CF-1.7'
nc.featureType = 'profile'
nc

Out[19]:

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4_CLASSIC data model, file format HDF5):
    Conventions: CF-1.7
    featureType: profile
    dimensions(sizes): station(3), heights(10), str_len(4)
    variables(dimensions): 
    groups:

Which gives this CDL:

netcdf obs_data {
  dimensions:
    station = 3 ;
    heights = 10 ;
    str_len = 4 ;
  attributes:
    :Conventions = "CF-1.7" ;
    :featureType = "profile" ;
}

We can create our coordinates with:

In [20]:

lon_var = nc.createVariable('lon', np.float64, ('station',))
lon_var.units = 'degrees_east'
lon_var.standard_name = 'longitude'

lat_var = nc.createVariable('lat', np.float64, ('station',))
lat_var.units = 'degrees_north'
lat_var.standard_name = 'latitude'

The standard refers to these as "instance variables" because each one refers to an instance of a feature. From here we can create our height coordinate variable:

In [21]:

heights_var = nc.createVariable('heights', np.float32, ('heights',))
heights_var.units = 'meters'
heights_var.standard_name = 'altitude'
heights_var.positive = 'up'
heights_var[:] = heights

Station IDs¶

Now we can also write our station IDs to a variable. This is a 2D variable, but one of the dimensions is simply there to facilitate treating strings as character arrays. We also assign this an attribute cf_role with a value of 'profile_id' to facilitate software to identify individual profiles:

In [22]:

stid_var = nc.createVariable('stid', 'c', ('station', 'str_len'))
stid_var.cf_role = 'profile_id'
stid_var.long_name = 'Station identifier'
stid_var[:] = stids

Now our CDL looks like:

netcdf obs_data {
  dimensions:
    station = 3 ;
    heights = 10 ;
    str_len = 4 ;
  variables:
    double lon(station) ;
      lon:units = "degrees_east" ;
      lon:standard_name = "longitude" ;
    double lat(station) ;
      lat:units = "degrees_north" ;
      lat:standard_name = "latitude" ;
    float heights(heights) ;
      heights:units = "meters" ;
      heights:standard_name = "altitude";
      heights:positive = "up" ;
    char stid(station, str_len) ;
      stid:cf_role = "profile_id" ;
      stid:long_name = "Station identifier" ;
  attributes:
    :Conventions = "CF-1.7" ;
    :featureType = "profile" ;
}

Writing the field¶

Now all that's left is to write our profile data, which looks fairly standard. We also add a scalar variable for the time at which these profiles were captured:

In [23]:

time_var = nc.createVariable('time', np.float32, ())
time_var.units = 'minutes since 2019-07-16 17:00'
time_var.standard_name = 'time'
time_var[:] = [5.]

temp_var = nc.createVariable('temperature', np.float32, ('station', 'heights'))
temp_var.units = 'celsius'
temp_var.standard_name = 'air_temperature'
temp_var.coordinates = 'lon lat heights time'

Note the use of the coordinates attribute to store the names of the auxilliary coordinate variables since they're all dimensioned on station and not proper coordinate variables. This yields the CDL for the variables:

  variables:
    double lon(station) ;
      lon:units = "degrees_east" ;
      lon:standard_name = "longitude" ;
    double lat(station) ;
      lat:units = "degrees_north" ;
      lat:standard_name = "latitude" ;
    float heights(heights) ;
      heights:units = "meters" ;
      heights:standard_name = "altitude";
      heights:positive = "up" ;
    char stid(station, str_len) ;
      stid:cf_role = "profile_id" ;
      stid:long_name = "Station identifier" ;
    float time ;
      time:units = "minutes since 2019-07-16 17:00" ;
      time:standard_name = "time" ;
    float temperature(station, heights) ;
      temperature:units = "celsius" ;
      temperature:standard_name = "air_temperature" ;
      temperature:coordinates = "lon lat heights time" ;

These standards for storing DSG extend to time series, trajectories, and combinations of them. They also can extend for differing amounts of data per feature using ragged arrays. For more information see the main document or the annotated DSG examples.

EXERCISE:

Create another 3D variable representing relative humidity.
Create another variable for surface precipitation.

In [24]:

# YOUR CODE GOES HERE

References¶

See CF Conventions doc (1.7)
See Jonathan Gregory's old CF presentation
See CF presentation I gave at Oct 2018 nc training workshop
See NASA ESDS “Dataset Interoperability Recommendations for Earth Science” (web page)
See CF Data Model (cfdm) python package tutorial
See Tim Whiteaker's cfgeom python package (GitHub repo)(tutorial)