The HYSETS database contains a multitude of datasets that, when combined, provide a rich and comprehensive environment for testing in various applications such as in hydrological modelling. Four categories of data are combined to provide this sandbox environment: Hydrometric, watershed delineation, meteorological and physiographic data. The methods to extract, validate and combine these sources of data are presented according to each category.
The daily hydrometric data were collected independently for the three countries covered in this database, namely Canada, the United States and Mexico. The Canadian hydrometric data were provided through the Environment and Climate Change Canada (ECCC) Water Survey Canada (WSC) National Water Data Archive (HYDAT), available at: https://collaboration.cmc.ec.gc.ca/cmc/hydrometrics/www/. The data are downloadable in a single Microsoft Access table and include station metadata, such as station location, drainage area and flow regime, as well as the actual daily flow data. A filter was applied to select only the stations whose flow regime is natural, i.e. is not regulated by man-made structures over the study period. This was performed to ensure that the hydrometric data and meteorological data were as naturally correlated as possible.
Hydrometric data for the United States were collected from the United States Geological Survey (USGS) National Water Information Service (NWIS) web portal23, available at: https://waterdata.usgs.gov/nwis/uv/?referred_module=sw. Data were batch-downloaded and processed to obtain full time-series for each station. While some metadata is made available such as drainage area, station coordinates and statistics on historical flow, there is no information on the flow regime or on the presence of regulation structures. An alternative method to filter regulated structures was thus devised. To determine which sites were regulated or were affected by regulation, the streamflow dataset was cross-checked against the peak flow statistics database from NWIS. Stations whose peaks were influenced by regulation works at least once in the archive were removed from the HYSETS dataset. Therefore, all regulated, or partly regulated stations were excluded to keep the data as close to natural as possible. While this method seems to have produced the desired results, it is possible that the filtering method is not perfectly accurate and as such, users should verify individual hydrographs if any doubts arise on the river’s regulation status.
Finally, hydrometric data in Mexico were collected from the “Banco Nacional de Datos de Aguas Superficiales” (BANDAS) produced and maintained by the National Water Commission (CONAGUA) from Mexican Ministry of Environment. The data were downloaded from the BANDAS web portal and was filtered according to data quality (visual inspection, hydrological model calibration performance) and time series length. The BANDAS data are available at: www.conagua.gob.mx/CONAGUA07/Contenido/Documentos/Portada%20BANDAS.htm.
Metadata is available for every station in the database and includes station coordinates, drainage area and information about the presence/absence of regulation structures. The stations included in the HYSETS database are all located on basins exempt from any regulation structures. Finally, a filter was applied to ensure that all stations had at least one year of recorded streamflow data. All hydrometric data were converted into units of cubic meters per second (m3/s).
The watershed delineation boundaries are a critical component of the dataset as all meteorological data need to be extracted according to those limits for each watershed. For most of the watersheds, the water management agencies provide the official boundaries directly in the form of a shapefile or geodatabase. The Canadian data are available at: http://donnees.ec.gc.ca/data/water/products/national-hydrometric-network-basin-polygons/?lang=en and the United States boundaries are available at the following website: water.usgs.gov/GIS/metadata/usgswrd/XML/streamgagebasins.xml.
The drainage area was made available for most hydrometric gauges by the water management agencies that collate and curate those sources of data. However, a filter was applied to remove all stations that did not have an official drainage area value at the hydrometric gauge, as the value is key in determining if the watershed bounds are acceptable or not. The drainage areas were validated using the watershed delineation boundaries as described above in the geospatial analysis software QGIS 3.4. However, in some instances, the water management agencies did not provide watershed boundary files as they had not been produced or made available publicly. In those cases, estimated watershed contours were taken from the Global Streamflow Indices and Metadata (GSIM) project21,22 where available. For catchments where GSIM boundaries were kept for the data extraction a flag (“flag_GSIM_boundaries”) was set to 1 to inform users that the boundaries are from GSIM and not from the official agencies. The GSIM-derived area is also identified in those cases in the dataset, under the “Drainage_Area_GSIM_km2” heading. For catchments smaller or equal to 50 km2 in size according to the official gauge, a bounding box equal to the surface area around the catchment outlet was provided as the contour of the catchment as at those scales catchment delineations are difficult due to the resolution and hydrometric gauge accuracy. These catchments are represented by the “flag_artificial_boundaries” indicator in the dataset files. Furthermore, weather data and other catchment attributes are coarser than the area of the catchments in most cases.
In the HYSETS database, all boundaries are provided in a WGS84-projected ESRI shapefile and include the following properties: Watershed ID (to link to the data in the netCDF files), source of the data, name of the hydrometric station, official ID of the hydrometric station and the flag to identify if the boundaries were derived from GSIM. All drainage areas are in km2.
The HYSETS database contains meteorological data from five sources. The following sections detail the methods applied to integrate the data into the database at the catchment scale. All precipitation data are provided in millimetres per day (mm/d), and temperature values are in degrees Celsius (°C).
Three weather station products were used to cover the North-America domain: The Environment and Climate Change Canada (ECCC) weather stations for Canada, available at: https://climate.weather.gc.ca/, the Global Historical Climate Network Daily (GHCND) station database24,25 for the United States and Mexico, available at: https://www.ncdc.noaa.gov/ghcnd-data-access and the station-based serially complete dataset for North America (SCDNA)26, available at: https://zenodo.org/record/3735534. The ECCC and GHCND datasets were combined to provide a North American weather station dataset for raw observations with a potential incomplete coverage for the desired 1950–2018 period. The SCDNA was also added to provide another dataset of stations with complete records between the 1979–2018 period.
The daily historical ECCC weather data were downloaded from the ECCC web portal for the years 1950–2018. Data include daily precipitation, maximum and minimum temperature for over 8578 stations across Canada, with varying levels of data completeness and record length. While the GHCND database contained Canada’s weather stations as well, the ECCC database was more complete and was preferred. Over Mexico, a total of 5249 precipitation and 5071 temperature stations are available in the GHCND dataset. Similarly, there are 55693 precipitation and 16011 temperature stations across the United States. While these stations undergo some quality control (QC), both data with and without QC were extracted and provided in the HYSETS database.
These stations were combined into a 69520-station dataset for precipitation and 29960-station dataset for temperature. The extraction process was performed twice for station data: Once using the quality controlled GHCND database and another using all available data, even the non-quality-controlled data.
The SCDNA provides daily precipitation, maximum and minimum temperature for 27280 stations over North America. A strict quality-control was performed on the original station data by the SCDNA authors. Missing data were infilled/reconstructed using information from neighbouring stations as well as three reanalyses products (ERA5, JRA-55 and MERRA-2). Strategies based on quantile mapping, statistical interpolation and machine learning were used to implement the corrections. Overall, this dataset was shown to provide a better agreement to station observations compared to four gridded products.
All three station datasets (ECCC/GHCND with and without QC and SCDNA) were weighted separately using Thiessen polygons for each watershed. The stations contributing to the Thiessen polygon calculation were those located within an artificial boundary defined as the real watershed boundaries extended by a buffer of 1° of latitude and longitude. This step was performed in order to exclude stations that would be too distant to represent the watershed conditions. Due to the highly variable nature of gauge-data quality and station longevity in the case of the ECCC/GHCND combined datasets, Thiessen polygons were computed for each day, using the available data for each day. Therefore, there are some discontinuities when a station is added or removed, or when a station temporarily has no record for a given period of time. There are also cases where no data at all were available for a given period. In those cases, the meteorological data fields are set to NaN, and the user is encouraged to replace those data using other means (either manual replacement or replacement with one of the gridded products as described below). The 1950–1978 period for the SCDNA dataset was also set to NaN.
All three sets of catchment-averaged data are available in the HYSETS database, opening the possibility of performing quantitative assessments of the impacts of using more (but perhaps less reliable) meteorological data in impact studies.
Natural resources canada gridded climate data for canada
The Natural Resources Canada (NRCan) gridded climate data product was made available from Natural Resources Canada’s Canadian Forest Service and covers the entirety of Canada up to approximately 84°N latitude27,28,29. It includes daily precipitation, maximum and minimum temperature data on a daily scale on a ~10 km spatial grid. Data cover the period 1950–2010 inclusively. Data points falling within the catchment boundaries were averaged to obtain a single time-series of continuous data as the NRCan dataset contains no missing data. When catchments were too small to contain a data point, the closest data point to the catchment centroid was used to populate the time series for that watershed. Information on the NRCan dataset can be found at the following website: https://cfs.nrcan.gc.ca/projects/3/4.
Livneh gridded climate data for continental USA, Mexico and southern Canada
The Livneh database includes interpolated precipitation and temperature data on a regular 0.0625 × 0.0625° grid over the continental United States, Mexico and southern Canada30,31. The data cover the period 1915–2015, although only the portion 1950–2015 was used in this dataset. It includes daily precipitation, maximum and minimum temperatures for the entire period without any missing data. Some discontinuities are present at the United States/Mexico border as station density and quality differ and influence the interpolation process. The same is also present but at a smaller scale on the United States/Canada border. The catchment-averaging process was the same as for the NRCan dataset. The Livneh data were provided by the NOAA/OAR/ESRL PSD, Boulder, Colorado, USA, from their Web site at https://www.esrl.noaa.gov/psd/.
ERA5 and ERA5-Land reanalyses
The ERA532 and ERA5-Land33 reanalyses are hourly products developed by the European Center for Medium-Range Weather Forecasting (ECMWF). The reanalyses provide estimates of a multitude of hydrometeorological and atmospheric variables including precipitation and temperature on regular grids covering the entire surface of the Earth. ERA5 was first implemented with a 0.25° x 0.25° spatial resolution and data are available from 1979–2019, although the HYSETS database stops in 2018 to remain consistent with the other data sources. The ERA5-Land reanalysis is a refined version of ERA5 with a spatial resolution of approximately 9 km. It was driven by the ERA5 reanalysis and a mask was applied such that only land masses are modelled in the refined domain. ERA5-Land covers the period 1981-onwards, and as such the years 1981–2018 are available at present in the HYSETS database.
As both the ERA5 and ERA5-Land products are hourly and data are archived without UTC offsets, it was required to shift the data according to the grid point locations. The longitude of the watershed centroid was used to assess to which time zone it belongs. Based on this time zone, the hourly data were shifted from the same number of hours to realign the daily cycle between 00:00 and 24:00. For instance, a station located in the time zone -7 will have the whole time series shifted by 7 hours to match its proper daily cycle. The ERA5 and ERA5-Land data were downloaded from the Copernicus Climate Data Store, available at: https://climate.copernicus.eu/climate-reanalysis. Note that the HYSETS dataset contains modified ERA5 and ERA5-Land reanalysis data from the Copernicus Climate Change Service Information and that neither the European Commission nor ECMWF is responsible for any use that may be made of the Copernicus Information or Data it contains.
Once the data processing was complete to bring it to the daily scale, the extraction followed the same process as for the NRCan and Livneh datasets, however, for the reanalysis products, the data are available for all watersheds given that reanalyses cover the entire globe.
Snow Water Equivalent (SWE) data
Two snow-water equivalent databases are provided in HYSETS. The first is a 9-year (2010–2018) time series of watershed-averaged daily high-resolution (roughly 1 km) data provided by the Snow Data Assimilation System (SNODAS) analysis34 available at: https://nsidc.org/data/g02158. The SNODAS data were averaged at the catchment scale using points within the watershed boundaries, or the closest point for the smallest of watersheds that did not contain a point within their limits. SWE data units are millimetres (mm) and represent the value expected on the ground for that day. Any missing values are replaced by NaNs. SNODAS data incorporate multiple sources of data and provide the best possible estimate of SWE at each point location. HYSETS simply averages those values at the catchment scale for users to easily interpret and analyse the data with respect to the rest of the data. SNODAS’ main limitation is that its spatial coverage includes the continental USA as well as the lower portions of Canada below 54°N latitude. This means that many snowy catchments in northern Canada and Alaska are not covered by SNODAS.
The second dataset is the ERA5-Land reanalysis product, which covers the period 1981–2018. It was extracted using the same method as for the ERA5-Land precipitation and temperature data and was also averaged at the daily scale using an hourly UTC offset depending on the time zone. One main advantage of the ERA5-Land reanalysis SWE is that it covers the entire globe and thus is available for all catchments, even those above 54°N that are not covered by SNODAS.
One of the strengths of the dataset is the inclusion of a multitude of properties to describe and characterize each of the watersheds. The process is similar for all of the properties, but some variables required slightly more complex operations than others.
The first set of data was based on geographic and topographic properties and was derived from the EarthEnv-DEM90 digital elevation model35 available at: https://www.earthenv.org/DEM.html. The process was performed in a meta-software called PAVICS-Hydro (Power Analytics and Visualization for Climate Science – hydrological modelling toolbox) being developed for this purpose, available at: https://pavics-sdi.readthedocs.io/. This set includes mean watershed elevation (meters), slope (degrees), aspect (degrees), Gravelius index (unitless) and perimeter (kilometers). The elevation and perimeter are self-explanatory. The slope is the average slope when considering the individual elevation differences between tiles and can be seen as an indicator of the catchment relief, with higher slopes indicating more mountainous regions. The aspect is the main orientation of the catchment, i.e. where the average slope points towards. The Gravelius index is the ratio of the perimeter of the watershed compared to the perimeter of a circle of the same area. Higher values indicate more elongated or less compact catchments. All the DEM points falling within a watershed boundary were used to compute these characteristics.
The database also provides elevation band data for each catchment in 100-meter intervals. This information was extracted from the EarthEnv 90 m DEM by applying a zonal histogram to a reclassified DEM based on 100-meter intervals in the QGIS software. The data are provided in a separate.csv file “HYSETS_elevation_bands.csv” and represent the percentage of catchment area lying below that elevation. Therefore, the curves are cumulative sums of these areas. This will allow users to provide information to more complex routines and models such as the well-known CemaNeige snow accounting and melt model. The elevation bands can also be used to adjust precipitation rates and temperatures per precipitation bands based on the average catchment elevation and the elevations in the bands, depending on the desired lapse rates and correction methods.
The land use percentages reflect which fraction of the watershed is covered in the different classification categories. The North American Land Change Monitoring System (NALCSM) imagery data from 2010 was used for this purpose36,37. NALCMS was developed by Canada, the United States and Mexico to track the evolution of land use over time. A static dataset for 2010 is available and contains 19 land use classes values that were combined to form 7 meta-categories: forests, shrubs, croplands, wetlands, water, urban and permanent snow/ice. For example, coniferous, deciduous and mixed forests were combined into the “forest” category. The original data are available at the Commission for Environmental Cooperation website:
For each watershed, the NALCMS raster dataset was queried through the PAVICS zonal statistics toolbox for each of the 19 original categories. The values were then aggregated to the 7 categories used in the HYSETS dataset and the relative fraction of each was computed.
The dataset used to characterize catchment geology is the GLobal HYdrogeology MaPS (GLHYMPS) of subsurface permeability and porosity38. This dataset provides quantitative estimates of permeability and porosity below the soil horizon. Catchment averages of these two variables were calculated, by considering the contribution of each spatial polygon being weighted by the fraction of a catchment it covers. The arithmetic mean was used for porosity, but for permeability, the geometric mean was taken. The same process as for NALCMS was performed. However, a pre-processing of the GLHYMPS vector data was performed to transform it to a raster format. The same zonal statistics tools from PAVICS were used to extract the average values of both variables for each catchment. The permeability units are in m2 whereas the porosity is archived as a fraction. The GLHYMPS dataset is available at: https://dataverse.scholarsportal.info/dataset.xhtml?persistentId=doi:10.5683/SP2/DLGXYO.
For all the above-mentioned properties, if for any reason the extraction process could not be performed (watershed out of product boundaries, unavailability at a given location) the values are replaced by NaNs and a flag was set in the metadata and properties file.
Monthly hydrometeorological data
One aspect that must be noted is that the data are all averaged temporally at the daily scale and spatially at the catchment scale. This means that for large catchments, it is possible that the daily data are not very representative of the localized precipitation and runoff events. For this reason, the HYSETS database also includes monthly-aggregated data for all catchments, which will allow evaluating products in terms of bias and mass balance over longer periods. The daily data are still made available for all catchments and the users are invited to consider which timescale is more appropriate for their use-case.
The monthly hydrometeorological data include data from the seven temperature and precipitation data sources as well as for the streamflow. SWE values were not provided at the monthly scale as they are typically not as variable as other variables. All monthly data are combined into a single netCDF file named “HYSETS_2020_monthly_meteorolgical_data.nc”.