The Earth System Grid Federation (ESGF) is an international collaboration for the software that powers most global climate change research, notably assessments by the Intergovernmental Panel on Climate Change (IPCC).
The ESGF search service exposes RESTful APIs that can be used by clients to query the contents of the underlying search index, and return results matching the given constraints. The documentation of the APIs can be found using this link
EsgfQuery
is the workhorse for dealing with ESGF search services.
query_esgf(host = "https://esgf-node.llnl.gov/esg-search")
The URL to the ESGF Search API service. This should be the URL of
the ESGF search service excluding the final endpoint name. Usually
this is http://<hostname>/esg-search
. Default is set to the
LLNL (Lawrence Livermore National Laboratory) Index Node,
which is "https://esgf-node.llnl.gov/esg-search"
.
EsgfQuery
objectquery_esgf()
returns an EsgfQuery
object, which is an R6
object with quite a few methods that can be classified into 3 categories:
Value listing: methods to list all possible values of facets, shards, etc.
Parameter getter & setter: methods to get the query parameter values or set them before sending the actual query to the ESGF search services.
Query responses: methods to collect results for the query response.
When creating an EsgfQuery
object, a
facet listing query
is sent to the index node to get all available facets and shards for the
default project (CMIP6).
EsgfQuery
object provides three value-listing methods to extract data from
the response of the facet listing query:
EsgfQuery$list_all_facets()
:
List all available facet names.
EsgfQuery$list_all_shards()
:
List all available shards.
EsgfQuery$list_all_values()
:
List all available values of a specific facet.
The ESGF search services support a lot of parameters. The EsgfQuery
contains dedicated methods to set values for most of them, including:
Most common keywords:
facets
,
offset
,
limit
,
fields
,
type
,
replica
,
latest
,
distrib
and
shards
.
Most common facets:
project
,
activity_id
,
experiment_id
,
source_id
,
variable_id
,
frequency
,
variant_label
,
nominal_resolution
and
data_node
.
All methods act in a similar way:
If input is given, the corresponding parameter is set and the updated
EsgfQuery
object is returned.
This makes it possible to chain different parameter setters, e.g.
EsgfQuery$project("CMIP6")$frequency("day")$limit(1)
sets the parameter
project
, frequency
and limit
sequentially.
For parameters that want character inputs, you can put a preceding !
to
negate the constraints, e.g. EsgfQuery$project(!"CMIP6")
searches for
all projects except for CMIP6
.
If no input is given, the current parameter value is returned. For example,
directly calling EsgfQuery$project()
returns the current value of the
project
parameter. The returned value can be two types:
NULL
, i.e. there is no constraint on the corresponding parameter
An EsgfQueryParam
object which is essentially a list of three elements:
value
: The input values
negate
: Whether there is a preceding !
in the input
name
: The parameter name
Despite methods for specific keywords and facets, you can specify arbitrary
query parameters using
EsgfQuery$params()
method. For
details on the usage, please see the
documentation.
The query is not sent unless related methods are called:
EsgfQuery$count()
: Count the total
number of records that match the query.
You can return only the total number of matched record by calling
EsgfQuery$count(facets = FALSE)
You can also count the matched records for specified facets, e.g.
EsgfQuery$count(facets = c("source_id", "activity_id"))
EsgfQuery$collect()
: Collect the
query results and format it into a data.table
EsgfQuery
object also provide several other helper functions:
EsgfQuery$build_cache()
:
By default, EsgfQuery$build_cache()
is called when initialize a new
EsgfQuery
object. So in general, there is no need to call this
separately. Basically, EsgfQuery$build_cahce()
sends a
facet listing query
to the index node and stores the response internally. The response contains
all available facets and shards and is used as a source for validating user
input for parameter setters.
EsgfQuery$url()
: Returns the actual
query URL or the wget script URL which can be used to download all files
matching the given constraints..
EsgfQuery$response()
: Returns the
actual response of
EsgfQuery$count()
and
EsgfQuery$collect()
. It is a named
list generated from the JSON response using jsonlite::fromJSON()
.
EsgfQuery$print()
: Print a summary
of the current EsgfQuery
object including the host URL, the built time of
facet cache and all query parameters.
new()
Create a new EsgfQuery object
When initialization, a
facet listing query
is sent to the index node to get all available facets and shards.
This information will be used to validate inputs for activity_id
,
scource_id
facets and etc.
EsgfQuery$new(host = "https://esgf-node.llnl.gov/esg-search")
host
The URL to the ESGF Search API service. This should be
the URL of the ESGF search service excluding the final
endpoint name. Usually this is http://<hostname>/esg-search
.
Default is to ses the LLNL (Lawrence Livermore National Laboratory) Index Node, which is
"https://esgf-node.llnl.gov/esg-search"
.
build_cache()
Build facet cache used for input validation
A facet cache is data that is fetched using a facet listing query to the index node. It contains all available facets and shards that can be used as parameter values within a specific project.
By default, $build_cache()
is called when initialize a new
EsgfQuery
object for the default project (CMIP6). So in general,
there is no need to call this method, unless that you want to
rebuild the cache again with different projects after calling
$project()
.
project()
Get or set the project
facet parameter.
value
The parameter value. Default: "CMIP6"
.
There are two options:
If value
is not given, current value is returned.
A character vector or NULL
. Note that you can put a preceding !
to negate the facet constraints. For example, $project(!c("CMIP5", "CMIP6"))
searches for all project
s except for CMIP5
and CMIP6
.
activity_id()
Get or set the activity_id
facet parameter.
value
The parameter value. Default: NULL
.
There are two options:
If value
is not given, current value is returned.
A character vector or NULL
. Note that you can put a preceding !
to negate the facet constraints. For example, $activity_id(!c("C4MIP", "GeoMIP"))
searches for all activity_id
s except for C4MIP
and GeoMIP
.
experiment_id()
Get or set the experiment_id
facet parameter.
value
The parameter value. Default: NULL
.
There are two options:
If value
is not given, current value is returned.
A character vector or NULL
. Note that you can put a preceding !
to negate the facet constraints. For example, $experiment_id(!c("ssp126", "ssp245"))
searches for all experiment_id
s except for ssp126
and ssp245
.
source_id()
Get or set the source_id
facet parameter.
value
The parameter value. Default: NULL
.
There are two options:
If value
is not given, current value is returned.
A character vector or NULL
. Note that you can put a preceding !
to negate the facet constraints. For example, $source_id(!c("CESM2", "CESM2-FV2"))
searches for all source_id
s except for CESM2
and CESM2-FV2
.
variable_id()
Get or set the variable_id
facet parameter.
value
The parameter value. Default: NULL
.
There are two options:
If value
is not given, current value is returned.
A character vector or NULL
. Note that you can put a preceding !
to negate the facet constraints. For example, $variable_id(!c("tas", "pr"))
searches for all variable_id
s except for tas
and pr
.
frequency()
Get or set the frequency
facet parameter.
value
The parameter value. Default: NULL
.
There are two options:
If value
is not given, current value is returned.
A character vector or NULL
. Note that you can put a preceding !
to negate the facet constraints. For example, $frequency(!c("day", "mon"))
searches for all frequency
s except for day
and mon
.
variant_label()
Get or set the variant_label
facet parameter.
value
The parameter value. Default: NULL
.
There are two options:
If value
is not given, current value is returned.
A character vector or NULL
. Note that you can put a preceding !
to negate the facet constraints. For example, $variant_label(!c("r1i1p1f1", "r2i1p1f1"))
searches for all variant_label
s except for r1i1p1f1
and r2i1p1f1
.
nominal_resolution()
Get or set the nominal_resolution
facet parameter.
value
The parameter value. Default: NULL
.
There are two options:
If value
is not given, current value is returned.
A character vector or NULL
. Note that you can put a preceding !
to negate the facet constraints. For example, $nominal_resolution(!c("50 km", "1x1 degree"))
searches for all nominal_resolution
s except for 50 km
and 1x1 degree
.
data_node()
Get or set the data_node
parameter.
value
The parameter value. Default: NULL
.
There are two options:
If value
is not given, current value is returned.
A character vector or NULL
. Note that you can put a preceding !
to negate the facet constraints. For example, $data_node(!c("cmip.bcc.cma.cn", "esg.camscma.cn"))
searches for all data_node
s except for cmip.bcc.cma.cn
and esg.camscma.cn
.
facets()
Get or set the facets
parameter for facet counting query.
Note that $facets()
only affects
$count()
method when sending a query of facet counting.
value
The facet parameter value. Default: NULL
.
There are two options:
If value
is not given, current value is returned.
A character vector or NULL
. The special notation "*"
can be used to indicate that all available facets should be considered.
fields()
Get or set the fields
parameter.
By default, all available metadata fields are returned for each
query. $facets()
can be used to limit the number of fields returned
in the query response.
value
The facet parameter value. Default: "*"
.
There are two options:
If value
is not given, current value is returned.
A character vector or NULL
. The special notation "*"
can be used to indicate that all available fields should be considered.
If value
is given, the modified EsgfQuery
object.
Otherwise, an EsgfQueryParam
object which is essentially a list of three elements:
value
: input values.
negate
: Whether there is a preceding !
.
name
: Parameter name.
\dontrun{
# get current value
q$fields()
# set the fields
q$fields(c("activity_id", "source_id"))
# use all available fields
q$fields("*")
# remove the parameter
# act the same as above because the default `fields` in ESGF search
# services is `*` if `fields` is not specified
q$fields(NULL)
}
shards()
Get or set the shards
parameter.
By default, a distributed query targets all ESGF Nodes. $shards()
can be used to execute a distributed search that targets only one or
more specific nodes.
All available shards can be retrieved using
$list_all_shards()
method.
value
The facet parameter value. There are two options:
If value
is not given, current value is returned.
A character vector or NULL
.
If value
is given, the modified EsgfQuery
object.
Otherwise, an EsgfQueryParam
object which is essentially a list of three elements:
value
: input values.
negate
: Whether there is a preceding !
.
name
: Parameter name.
\dontrun{
# get current value
q$shards()
# set the parameter
q$shards("localhost:8983/solr/datasets")
# negate the constraints
q$shards(!"localhost:8983/solr/datasets")
# only applicable for distributed queries
q$distrib(FALSE)$shards("localhost:8983/solr/datasets") # Error
# remove the parameter
q$shards(NULL)
}
replica()
Get or set the replica
parameter.
By default, a query returns all records (masters and replicas)
matching the search criteria, i.e. $replica(NULL)
.
To return only master records, use $replica(FALSE)
; to return only
replicas, use $replica(TRUE)
.
value
The facet parameter value. Default: NULL
.
There are two options:
If value
is not given, current value is returned.
A flag or NULL
.
latest()
Get or set the latest
parameter.
By default, a query to the ESGF search services returns only the very
last, up-to-date version of the matching records, i.e.
$latest(TRUE)
. You can use $latest(FALSE)
to return all versions.
value
The facet parameter value. Default: TRUE
.
There are two options:
If value
is not given, current value is returned.
A flag.
type()
Get or set the type
parameter.
There are three types in total: Dataset
, File
or Aggregation
.
value
The facet parameter value. Default: "Dataset"
.
There are two options:
If value
is not given, current value is returned.
A string.
limit()
Get or set the limit
parameter.
$limit()
can be used to limit the number of records to return.
Note that the maximum number of records to return per query for ESGF
search services is 10,000. A warning is issued if input value is
greater than that. In this case, limit
will be reset to 10,000.
value
The facet parameter value. Default: 10
.
There are two options:
If value
is not given, current value is returned.
An integer.
offset()
Get or set the offset
parameter.
If the query returns records that exceed the
limit
number,
$offset()
can be used to paginate through the available results.
value
The facet parameter value. Default: 0
.
There are two options:
If value
is not given, current value is returned.
An integer.
distrib()
Get or set the distrib
facet
By default, the query is sent to all ESGF Nodes, i.e.
$distrib(TRUE)
.
$distrib(FALSE)
can be used to execute the query only on the
target node.
value
The facet parameter value. Default: TRUE
.
There are two options:
If value
is not given, current value is returned.
A flag.
params()
Get or set other parameters.
$params()
can be used to specify other parameters that do not have
a dedicated method, e.g. version
, master_id
, etc. It can also be
used to overwrite existing parameter values specified using methods
like $activity_id()
.
...
Parameter values to set. There are three options:
If not given, existing parameters that do not have a dedicated method are returned.
If NULL
, all existing parameters that do not have a dedicated
method are removed.
A named vector, e.g. $params(score = 1, table_id = "day")
will
set score
to 1
and table_id
to day
.
The !
notation can still be used to negate the constraints, e.g.
$params(table_id = !c("3hr", "day"))
searches for all table_id
except for 3hr
and day
.
If parameters are specified, the modified EsgfQuery
object,
invisibly.
Otherwise, an empty list for $params(NULL)
or a list of
EsgfQueryParam
objects.
\dontrun{
# get current values
# default is an empty list (`list()`)
q$params()
# set the parameter
q$params(table_id = c("3hr", "day"), member_id = "00")
q$params()
# reset existing parameters
q$frequency("day")
q$params(frequency = "mon")
q$frequency() # frequency value has been changed using $params()
# negating the constraints is also supported
q$params(table_id = !c("3hr", "day"))
# use NULL to remove all parameters
q$params(NULL)$params()
}
url()
Get the URL of actual query or wget script
wget
Whether to return the URL of the wget script that can be
used to download all files matching the given constraints.
Default: FALSE
.
\dontrun{
q$url()
# get the wget script URL
q$url(wget = TRUE)
# You can download the wget script using the URL directly. For
# example, the code below downloads the script and save it as
# 'wget.sh' in R's temporary folder:
download.file(q$url(TRUE), file.path(tempdir(), "wget.sh"), mode = "wb")
}
count()
Send a query of facet counting and fetch the results
facets
NULL
, a flag or a character vector. There are three
options:
If NULL
or FALSE
, only the total number of matched records is
returned.
If TRUE
, the value of $facets()
is used to limit the facets. This is the default value.
If a character vector, it is used to limit the facets.
collect()
Send the actual query and fetch the results
$collect()
sends the actual query to the ESGF search services and
returns the results in a data.table::data.table. The columns depend
on the value of query type and fields
parameter.
A data.table.
response()
Get the response of last sent query
The response of the last sent query is always stored internally and
can be retrieved using $response()
. It is a named list generated
from the JSON response using jsonlite::fromJSON()
.
print()
Print a summary of the current EsgfQuery
object
$print()
gives the summary of current EsgfQuery
object including
the host URL, the built time of facet cache and all query parameters.
## ------------------------------------------------
## Method `EsgfQuery$new`
## ------------------------------------------------
if (FALSE) {
q <- EsgfQuery$new(host = "https://esgf-node.llnl.gov/esg-search")
q
}
## ------------------------------------------------
## Method `EsgfQuery$build_cache`
## ------------------------------------------------
if (FALSE) {
q$build_cache()
}
## ------------------------------------------------
## Method `EsgfQuery$list_all_facets`
## ------------------------------------------------
if (FALSE) {
q$list_all_facets()
}
## ------------------------------------------------
## Method `EsgfQuery$list_all_shards`
## ------------------------------------------------
if (FALSE) {
q$list_all_shards()
}
## ------------------------------------------------
## Method `EsgfQuery$list_all_values`
## ------------------------------------------------
if (FALSE) {
q$list_all_values()
}
## ------------------------------------------------
## Method `EsgfQuery$project`
## ------------------------------------------------
if (FALSE) {
# get current value
q$project()
# set the parameter
q$project("CMIP6")
# negate the project constraints
q$project(!"CMIP6")
# remove the parameter
q$project(NULL)
}
## ------------------------------------------------
## Method `EsgfQuery$activity_id`
## ------------------------------------------------
if (FALSE) {
# get current value
q$activity_id()
# set the parameter
q$activity_id("ScenarioMIP")
# negate the constraints
q$activity_id(!c("CFMIP", "ScenarioMIP"))
# remove the parameter
q$activity_id(NULL)
}
## ------------------------------------------------
## Method `EsgfQuery$experiment_id`
## ------------------------------------------------
if (FALSE) {
# get current value
q$experiment_id()
# set the parameter
q$experiment_id(c("ssp126", "ssp585"))
# negate the constraints
q$experiment_id(!c("ssp126", "ssp585"))
# remove the parameter
q$experiment_id(NULL)
}
## ------------------------------------------------
## Method `EsgfQuery$source_id`
## ------------------------------------------------
if (FALSE) {
# get current value
q$source_id()
# set the parameter
q$source_id(c("BCC-CSM2-MR", "CESM2"))
# negate the constraints
q$source_id(!c("BCC-CSM2-MR", "CESM2"))
# remove the parameter
q$source_id(NULL)
}
## ------------------------------------------------
## Method `EsgfQuery$variable_id`
## ------------------------------------------------
if (FALSE) {
# get current value
q$variable_id()
# set the parameter
q$variable_id(c("tas", "pr"))
# negate the constraints
q$variable_id(!c("tas", "pr"))
# remove the parameter
q$variable_id(NULL)
}
## ------------------------------------------------
## Method `EsgfQuery$frequency`
## ------------------------------------------------
if (FALSE) {
# get current value
q$frequency()
# set the parameter
q$frequency(c("1hr", "day"))
# negate the constraints
q$frequency(!c("1hr", "day"))
# remove the parameter
q$frequency(NULL)
}
## ------------------------------------------------
## Method `EsgfQuery$variant_label`
## ------------------------------------------------
if (FALSE) {
# get current value
q$variant_label()
# set the parameter
q$variant_label(c("r1i1p1f1", "r1i2p1f1"))
# negate the constraints
q$variant_label(!c("r1i1p1f1", "r1i2p1f1"))
# remove the parameter
q$variant_label(NULL)
}
## ------------------------------------------------
## Method `EsgfQuery$nominal_resolution`
## ------------------------------------------------
if (FALSE) {
# get current value
q$nominal_resolution()
# set the parameter
q$nominal_resolution(c("100 km", "1x1 degree"))
# negate the constraints
q$nominal_resolution(!c("100 km", "1x1 degree"))
# remove the parameter
q$nominal_resolution(NULL)
}
## ------------------------------------------------
## Method `EsgfQuery$data_node`
## ------------------------------------------------
if (FALSE) {
# get current value
q$data_node()
# set the parameter
q$data_node("esg.lasg.ac.cn")
# negate the constraints
q$data_node(!"esg.lasg.ac.cn")
# remove the parameter
q$data_node(NULL)
}
## ------------------------------------------------
## Method `EsgfQuery$facets`
## ------------------------------------------------
if (FALSE) {
# get current value
q$facets()
# set the facets
q$facets(c("activity_id", "source_id"))
# use all available facets
q$facets("*")
}
## ------------------------------------------------
## Method `EsgfQuery$fields`
## ------------------------------------------------
if (FALSE) {
# get current value
q$fields()
# set the fields
q$fields(c("activity_id", "source_id"))
# use all available fields
q$fields("*")
# remove the parameter
# act the same as above because the default `fields` in ESGF search
# services is `*` if `fields` is not specified
q$fields(NULL)
}
## ------------------------------------------------
## Method `EsgfQuery$shards`
## ------------------------------------------------
if (FALSE) {
# get current value
q$shards()
# set the parameter
q$shards("localhost:8983/solr/datasets")
# negate the constraints
q$shards(!"localhost:8983/solr/datasets")
# only applicable for distributed queries
q$distrib(FALSE)$shards("localhost:8983/solr/datasets") # Error
# remove the parameter
q$shards(NULL)
}
## ------------------------------------------------
## Method `EsgfQuery$replica`
## ------------------------------------------------
if (FALSE) {
# get current value
q$replica()
# set the parameter
q$replica(TRUE)
# remove the parameter
q$replica(NULL)
}
## ------------------------------------------------
## Method `EsgfQuery$latest`
## ------------------------------------------------
if (FALSE) {
# get current value
q$latest()
# set the parameter
q$latest(TRUE)
}
## ------------------------------------------------
## Method `EsgfQuery$type`
## ------------------------------------------------
if (FALSE) {
# get current value
q$type()
# set the parameter
q$type("Dataset")
}
## ------------------------------------------------
## Method `EsgfQuery$limit`
## ------------------------------------------------
if (FALSE) {
# get current value
q$limit()
# set the parameter
q$limit(10L)
# `limit` is reset to 10,000 if input is greater than that
q$limit(10000L) # warning
}
## ------------------------------------------------
## Method `EsgfQuery$offset`
## ------------------------------------------------
if (FALSE) {
# get current value
q$offset()
# set the parameter
q$offset(0L)
}
## ------------------------------------------------
## Method `EsgfQuery$distrib`
## ------------------------------------------------
if (FALSE) {
# get current value
q$distrib()
# set the parameter
q$distrib(TRUE)
}
## ------------------------------------------------
## Method `EsgfQuery$params`
## ------------------------------------------------
if (FALSE) {
# get current values
# default is an empty list (`list()`)
q$params()
# set the parameter
q$params(table_id = c("3hr", "day"), member_id = "00")
q$params()
# reset existing parameters
q$frequency("day")
q$params(frequency = "mon")
q$frequency() # frequency value has been changed using $params()
# negating the constraints is also supported
q$params(table_id = !c("3hr", "day"))
# use NULL to remove all parameters
q$params(NULL)$params()
}
## ------------------------------------------------
## Method `EsgfQuery$url`
## ------------------------------------------------
if (FALSE) {
q$url()
# get the wget script URL
q$url(wget = TRUE)
# You can download the wget script using the URL directly. For
# example, the code below downloads the script and save it as
# 'wget.sh' in R's temporary folder:
download.file(q$url(TRUE), file.path(tempdir(), "wget.sh"), mode = "wb")
}
## ------------------------------------------------
## Method `EsgfQuery$count`
## ------------------------------------------------
if (FALSE) {
# get the total number of matched records
q$count(NULL) # or q$count(facets = FALSE)
# count records for specific facets
q$facets(c("activity_id", "source_id"))$count()
# same as above
q$count(facets = c("activity_id", "source_id"))
}
## ------------------------------------------------
## Method `EsgfQuery$collect`
## ------------------------------------------------
if (FALSE) {
q$fields("source_id")
q$collect()
}
## ------------------------------------------------
## Method `EsgfQuery$response`
## ------------------------------------------------
if (FALSE) {
q$response()
}
## ------------------------------------------------
## Method `EsgfQuery$print`
## ------------------------------------------------
if (FALSE) {
q$print()
}