summary_database() scans the directory specified and returns a data.table() containing summary information about all the CMIP6 files available against the output file index loaded using load_cmip6_index().

summary_database(
  dir,
  by = c("activity", "experiment", "variant", "frequency", "variable", "source",
    "resolution"),
  mult = c("skip", "latest"),
  append = FALSE,
  miss = c("keep", "overwrite"),
  recursive = FALSE,
  update = FALSE,
  warning = TRUE
)

Arguments

dir

A single string indicating the directory where CMIP6 model output NetCDF files are stored.

by

The grouping column to summary the database status. Should be a subset of:

  • "experiment": root experiment identifiers

  • "source": model identifiers

  • "variable": variable identifiers

  • "activity": activity identifiers

  • "frequency": sampling frequency

  • "variant": variant label

  • "resolution": approximate horizontal resolution

mult

Actions when multiple files match a same case in the CMIP6 index. If "latest", the file with latest modification time will be used. If "skip", all matched files will be skip and this case will be kept as unmatched. Default: "skip".

append

If TRUE, status of CMIP6 files will only be updated if they are not found in previous summary. This is useful if CMIP6 files are stored in different directories. Default: FALSE.

miss

Actions when matched files in the previous summary do not exist when running current summary. Only applicable when append is set to TRUE. If "keep", the metadata for the missing output files will be kept. If "overwrite", existing metadata of those output will be first removed from the output file index and overwritten based on the newly matched files if possible. Default: "keep".

recursive

If TRUE, scan recursively into directories. Default: FALSE.

update

If TRUE, the output file index will be updated based on the matched NetCDF files in specified directory. If FALSE, only current loaded index will be updated, but the actual index database file saved in get_data_dir() will remain unchanged. Default: FALSE.

warning

If TRUE, warning messages will show when target files are missing or multiple files match a same case. Default: TRUE.

Value

A data.table::data.table() containing corresponding grouping columns plus:

ColumnTypeDescription
datetime_startPOSIXctStart date and time of simulation
datetime_endPOSIXctEnd date and time of simulation
file_numIntegerTotal number of file per group
file_sizeUnits (Mbytes)Approximate total size of file
dl_numIntegerTotal number of file downloaded
dl_percentUnits (%)Total percentage of file downloaded
dl_sizeUnits (Mbytes)Total size of file downloaded

Also 2 extra data.table::data.table() are attached as attributes:

  • not_found: A data.table::data.table() that contains metadata for those CMIP6 outputs that are listed in current CMIP6 output file index but the existing file paths are not valid now and cannot be found in current database.

  • not_matched: A data.table::data.table() that contains metadata for those CMIP6 output files that are found in current database but not listed in current CMIP6 output file index.

For the meaning of grouping columns, see init_cmip6_index().

Details

The database here can be any directory that stores the NetCDF files for CMIP6 GCMs. It can be also be the same as get_data_dir() where epwshiftr stores the output file index, if you want to save the output file index and output files in the same place.

summary_database() uses the tracking_id, datetime_start and datetime_end global attributes of each NetCDF file to match against the output file index. So the names of NetCDF files do not necessarily follow the CMIP6 file name encoding.

summary_database() will append 5 columns in the CMIP6 output file index:

  • file_path: the full path of matched NetCDF file for every case.

summary_database() uses future.apply underneath to speed up the data processing if applicable. You can use your preferable future backend to speed up data extraction in parallel. By default, summary_database() uses future::sequential backend, which runs things in sequential.

Examples

if (FALSE) {
summary_database()

summary_database(by = "experiment")
}