summary_database()
scans the directory specified and returns a
data.table()
containing summary information about all the CMIP6
files available against the output file index loaded using
load_cmip6_index()
.
A single string indicating the directory where CMIP6 model output NetCDF files are stored.
The grouping column to summary the database status. Should be a subset of:
"experiment"
: root experiment identifiers
"source"
: model identifiers
"variable"
: variable identifiers
"activity"
: activity identifiers
"frequency"
: sampling frequency
"variant"
: variant label
"resolution"
: approximate horizontal resolution
Actions when multiple files match a same case in the CMIP6
index. If "latest"
, the file with latest modification time
will be used. If "skip"
, all matched files will be skip and this
case will be kept as unmatched. Default: "skip"
.
If TRUE
, status of CMIP6 files will only be updated if they
are not found in previous summary. This is useful if CMIP6 files are
stored in different directories. Default: FALSE
.
Actions when matched files in the previous summary do not exist
when running current summary. Only applicable when append
is set to
TRUE
. If "keep"
, the metadata for the missing output files will
be kept. If "overwrite"
, existing metadata of those output will be
first removed from the output file index and overwritten based on the
newly matched files if possible. Default: "keep"
.
If TRUE
, scan recursively into directories. Default:
FALSE
.
If TRUE
, the output file index will be updated based
on the matched NetCDF files in specified directory. If FALSE
, only
current loaded index will be updated, but the actual index
database file saved in get_data_dir()
will remain unchanged.
Default: FALSE
.
If TRUE
, warning messages will show when target files are
missing or multiple files match a same case. Default: TRUE
.
A data.table::data.table()
containing corresponding grouping
columns plus:
Column | Type | Description |
datetime_start | POSIXct | Start date and time of simulation |
datetime_end | POSIXct | End date and time of simulation |
file_num | Integer | Total number of file per group |
file_size | Units (Mbytes) | Approximate total size of file |
dl_num | Integer | Total number of file downloaded |
dl_percent | Units (%) | Total percentage of file downloaded |
dl_size | Units (Mbytes) | Total size of file downloaded |
Also 2 extra data.table::data.table()
are attached as attributes:
not_found
: A data.table::data.table()
that contains metadata for those
CMIP6 outputs that are listed in current CMIP6 output file index but the
existing file paths are not valid now and cannot be found in current
database.
not_matched
: A data.table::data.table()
that contains metadata for
those CMIP6 output files that are found in current database but not listed
in current CMIP6 output file index.
For the meaning of grouping columns, see init_cmip6_index()
.
The database here can be any directory that stores the NetCDF files for CMIP6
GCMs. It can be also be the same as get_data_dir()
where epwshiftr stores
the output file index, if you want to save the output file index and output
files in the same place.
summary_database()
uses the tracking_id
, datetime_start
and
datetime_end
global attributes of each NetCDF file to match against the
output file index. So the names of NetCDF files do not necessarily follow the
CMIP6 file name encoding.
summary_database()
will append 5 columns in the CMIP6 output file index:
file_path
: the full path of matched NetCDF file for every case.
summary_database()
uses future.apply
underneath to speed up the data processing if applicable. You can use your
preferable future backend to speed up data extraction in parallel. By default,
summary_database()
uses future::sequential
backend, which runs things in
sequential.
if (FALSE) {
summary_database()
summary_database(by = "experiment")
}