Title: | Feature Stores for the 'diseasy' Framework |
---|---|
Description: | Simple feature stores and tools for creating personalised feature stores. 'diseasystore' powers feature stores which can automatically link and aggregate features to a given stratification level. These feature stores are automatically time-versioned (powered by the 'SCDB' package) and allows you to easily and dynamically compute features as part of your continuous integration. |
Authors: | Rasmus Skytte Randløv [aut, cre]
|
Maintainer: | Rasmus Skytte Randløv <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.3.1.9000 |
Built: | 2025-02-28 12:41:27 UTC |
Source: | https://github.com/ssi-dk/diseasystore |
Existence aware pick operator
env %.% field
env %.% field
env |
( |
field |
( |
Error if the field
does not exist in env
, otherwise it returns field
t <- list(a = 1, b = 2) t$a # 1 t %.% a # 1 t$c # NULL try(t %.% c) # Gives error since "c" does not exist in "t"
t <- list(a = 1, b = 2) t$a # 1 t %.% a # 1 t$c # NULL try(t %.% c) # Gives error since "c" does not exist in "t"
Provides the sql code for a time interval (in years).
add_years(reference_date, years, conn)
add_years(reference_date, years, conn)
reference_date |
( |
years |
( |
conn |
( |
SQL query for the time interval.
conn <- SCDB::get_connection(drv = RSQLite::SQLite()) dplyr::copy_to(conn, data.frame(birth = as.Date("2001-04-03"), "test_age")) |> dplyr::mutate(first_birthday = !!add_years("birth", 1, conn)) DBI::dbDisconnect(conn)
conn <- SCDB::get_connection(drv = RSQLite::SQLite()) dplyr::copy_to(conn, data.frame(birth = as.Date("2001-04-03"), "test_age")) |> dplyr::mutate(first_birthday = !!add_years("birth", 1, conn)) DBI::dbDisconnect(conn)
Provides sortable labels for age groups
age_labels(age_cuts)
age_labels(age_cuts)
age_cuts |
( |
A vector of labels with zero-padded numerics so they can be sorted easily.
age_labels(c(5, 12, 20, 30))
age_labels(c(5, 12, 20, 30))
Provides the sql code to compute the age of a person on a given date.
age_on_date(birth, reference_date, conn)
age_on_date(birth, reference_date, conn)
birth |
( |
reference_date |
( |
conn |
( |
SQL query that computes the age on the given date.
conn <- SCDB::get_connection(drv = RSQLite::SQLite()) dplyr::copy_to(conn, data.frame(birth = as.Date("2001-04-03"), "test_age")) |> dplyr::mutate(age = !!age_on_date("birth", as.Date("2024-02-28"), conn)) DBI::dbDisconnect(conn)
conn <- SCDB::get_connection(drv = RSQLite::SQLite()) dplyr::copy_to(conn, data.frame(birth = as.Date("2001-04-03"), "test_age")) |> dplyr::mutate(age = !!age_on_date("birth", as.Date("2024-02-28"), conn)) DBI::dbDisconnect(conn)
Feature aggregators
key_join_sum(.data, feature) key_join_max(.data, feature) key_join_min(.data, feature) key_join_count(.data, feature)
key_join_sum(.data, feature) key_join_max(.data, feature) key_join_min(.data, feature) key_join_count(.data, feature)
.data |
( |
feature |
( |
A dplyr::summarise to aggregate the features together using the given function (sum/max/min/count)
# Primarily used within the framework but can be used individually: data <- dplyr::mutate(mtcars, key_name = rownames(mtcars), .before = dplyr::everything()) key_join_sum(data, "mpg") # sum(mtcars$mpg) key_join_max(data, "mpg") # max(mtcars$mpg) key_join_min(data, "mpg") # min(mtcars$mpg) key_join_count(data, "mpg") # nrow(mtcars)
# Primarily used within the framework but can be used individually: data <- dplyr::mutate(mtcars, key_name = rownames(mtcars), .before = dplyr::everything()) key_join_sum(data, "mpg") # sum(mtcars$mpg) key_join_max(data, "mpg") # max(mtcars$mpg) key_join_min(data, "mpg") # min(mtcars$mpg) key_join_count(data, "mpg") # nrow(mtcars)
Detect available diseasystores
available_diseasystores()
available_diseasystores()
The installed diseasystores on the search path
available_diseasystores() # DiseasystoreGoogleCovid19 + more from other packages
available_diseasystores() # DiseasystoreGoogleCovid19 + more from other packages
Helper function to get options related to diseasy
diseasyoption(option, class = NULL, namespace = NULL, .default = NULL)
diseasyoption(option, class = NULL, namespace = NULL, .default = NULL)
option |
( |
class |
( |
namespace |
( |
.default |
( |
If option
is given, the most specific option within the diseasy
framework for the given option and class.
If option
is missing, all options related to diseasy
packages.
# Retrieve default option for source conn diseasyoption("source_conn") # Retrieve DiseasystoreGoogleCovid19 specific option for source conn diseasyoption("source_conn", "DiseasystoreGoogleCovid19") # Try to retrieve specific option for source conn for a non existent / un-configured diseasystore diseasyoption("source_conn", "DiseasystoreNonExistent") # Returns default source_conn # Try to retrieve specific non-existent option diseasyoption("non_existent", "DiseasystoreGoogleCovid19", .default = "Use this")
# Retrieve default option for source conn diseasyoption("source_conn") # Retrieve DiseasystoreGoogleCovid19 specific option for source conn diseasyoption("source_conn", "DiseasystoreGoogleCovid19") # Try to retrieve specific option for source conn for a non existent / un-configured diseasystore diseasyoption("source_conn", "DiseasystoreNonExistent") # Returns default source_conn # Try to retrieve specific non-existent option diseasyoption("non_existent", "DiseasystoreGoogleCovid19", .default = "Use this")
diseasystore
for the case definitionCheck for the existence of a diseasystore
for the case definition
diseasystore_exists(label)
diseasystore_exists(label)
label |
( |
TRUE if the given diseasystore can be matched to a diseasystore on the search path. FALSE otherwise.
diseasystore_exists("Google COVID-19") # TRUE diseasystore_exists("Non existent diseasystore") # FALSE
diseasystore_exists("Google COVID-19") # TRUE diseasystore_exists("Non existent diseasystore") # FALSE
This DiseasystoreBase
R6 class forms the basis of all feature stores.
It defines the primary methods of each feature stores as well as all of the public methods.
A new instance of the DiseasystoreBase
R6 class.
ds_map
(named list
(character
))
A list that maps features known by the feature store to the corresponding feature handlers
that compute the features. Read only.
available_features
(character()
)
A list of available features in the feature store. Read only.
available_observables
(character()
)
A list of available observables in the feature store. Read only.
available_stratifications
(character()
)
A list of available stratifications in the feature store. Read only.
observables_regex
(character(1)
)
A list of available stratifications in the feature store. Read only.
label
(character(1)
)
A human readable label of the feature store. Read only.
source_conn
(DBIConnection
or file path
)
Used to specify where data is located. Read only. Can be DBIConnection
or file path depending on the diseasystore
.
target_conn
(DBIConnection
)
A database connection to store the computed features in. Read only.
target_schema
(character
)
The schema to place the feature store in. Read only. If the database backend does not support schema, the tables will be prefixed with <target_schema>.
.
start_date
(Date
)
Study period start. Read only.
end_date
(Date
)
Study period end. Read only.
min_start_date
(Date
)
(Minimum)Study period start. Read only.
max_end_date
(Date
)
(Maximum)Study period end. Read only.
slice_ts
(Date
or character
)
Date or timestamp (parsable by as.POSIXct
) to slice the (time-versioned) data on. Read only.
new()
Creates a new instance of the DiseasystoreBase
R6 class.
DiseasystoreBase$new( start_date = NULL, end_date = NULL, slice_ts = NULL, source_conn = NULL, target_conn = NULL, target_schema = NULL, verbose = diseasyoption("verbose", self) )
start_date
(Date
)
Study period start.
end_date
(Date
)
Study period end.
slice_ts
(Date
or character
)
Date or timestamp (parsable by as.POSIXct
) to slice the (time-versioned) data on.
source_conn
(DBIConnection
or file path
)
Used to specify where data is located. Can be DBIConnection
or file path depending on the diseasystore
.
target_conn
(DBIConnection
)
A database connection to store the computed features in.
target_schema
(character
)
The schema to place the feature store in. If the database backend does not support schema, the tables will be prefixed with <target_schema>.
.
verbose
(boolean
)
Boolean that controls enables debugging information.
A new instance of the DiseasystoreBase
R6 class.
get_feature()
Computes, stores, and returns the requested feature for the study period.
DiseasystoreBase$get_feature( feature, start_date = self %.% start_date, end_date = self %.% end_date, slice_ts = self %.% slice_ts )
feature
(character
)
The name of a feature defined in the feature store.
start_date
(Date
)
Study period start.
end_date
(Date
)
Study period end.
slice_ts
(Date
or character
)
Date or timestamp (parsable by as.POSIXct
) to slice the (time-versioned) data on.
A tbl_dbi with the requested feature for the study period.
key_join_features()
Joins various features from the feature store assuming a primary feature (observable)
that contains keys to witch the secondary features (defined by stratification
) are joined.
DiseasystoreBase$key_join_features( observable, stratification = NULL, start_date = self %.% start_date, end_date = self %.% end_date )
observable
(character
)
The observable to provide data or prediction for.
stratification
(list
(quosures
) or NULL
)
Use rlang::quos(...)
to specify stratification. If given, expressions in stratification evaluated to give the stratification level.
start_date
(Date
)
Study period start.
end_date
(Date
)
Study period end.
A tbl_dbi with the requested joined features for the study period.
clone()
The objects of this class are cloneable with this method.
DiseasystoreBase$clone(deep = FALSE)
deep
Whether to make a deep clone.
# DiseasystoreBase is mostly used as the basis of other, more specific, classes # The DiseasystoreBase can be initialised individually if needed. ds <- DiseasystoreBase$new(source_conn = NULL, target_conn = DBI::dbConnect(RSQLite::SQLite())) rm(ds)
# DiseasystoreBase is mostly used as the basis of other, more specific, classes # The DiseasystoreBase can be initialised individually if needed. ds <- DiseasystoreBase$new(source_conn = NULL, target_conn = DBI::dbConnect(RSQLite::SQLite())) rm(ds)
This DiseasystoreEcdcRespiratoryViruses
R6 brings support for using the EU-ECDC
Respiratory viruses weekly data repository.
See the vignette("diseasystore-ecdc-respiratory-viruses") for details on how to configure the feature store.
A new instance of the DiseasystoreEcdcRespiratoryViruses
R6 class.
diseasystore::DiseasystoreBase
-> DiseasystoreEcdcRespiratoryViruses
new()
Creates a new instance of the DiseasystoreEcdcRespiratoryViruses
R6 class.
DiseasystoreEcdcRespiratoryViruses$new(...)
...
Arguments passed to the ?DiseasystoreBase
constructor.
A new instance of the DiseasystoreEcdcRespiratoryViruses
R6 class.
clone()
The objects of this class are cloneable with this method.
DiseasystoreEcdcRespiratoryViruses$clone(deep = FALSE)
deep
Whether to make a deep clone.
ds <- DiseasystoreEcdcRespiratoryViruses$new( source_conn = ".", target_conn = DBI::dbConnect(RSQLite::SQLite()) ) rm(ds)
ds <- DiseasystoreEcdcRespiratoryViruses$new( source_conn = ".", target_conn = DBI::dbConnect(RSQLite::SQLite()) ) rm(ds)
This DiseasystoreGoogleCovid19
R6 brings support for using the Google
Health COVID-19 Open Data repository.
See the vignette("diseasystore-google-covid-19") for details on how to configure the feature store.
A new instance of the DiseasystoreGoogleCovid19
R6 class.
diseasystore::DiseasystoreBase
-> DiseasystoreGoogleCovid19
clone()
The objects of this class are cloneable with this method.
DiseasystoreGoogleCovid19$clone(deep = FALSE)
deep
Whether to make a deep clone.
ds <- DiseasystoreGoogleCovid19$new( source_conn = ".", target_conn = DBI::dbConnect(RSQLite::SQLite()) ) rm(ds)
ds <- DiseasystoreGoogleCovid19$new( source_conn = ".", target_conn = DBI::dbConnect(RSQLite::SQLite()) ) rm(ds)
simulist
featuresThis DiseasystoreSimulist
R6 brings support for individual level data.
A new instance of the DiseasystoreSimulist
R6 class.
diseasystore::DiseasystoreBase
-> DiseasystoreSimulist
new()
Creates a new instance of the DiseasystoreSimulist
R6 class.
DiseasystoreSimulist$new(...)
...
Arguments passed to the ?DiseasystoreBase
constructor.
A new instance of the DiseasystoreSimulist
R6 class.
clone()
The objects of this class are cloneable with this method.
DiseasystoreSimulist$clone(deep = FALSE)
deep
Whether to make a deep clone.
ds <- DiseasystoreSimulist$new( source_conn = ".", target_conn = DBI::dbConnect(duckdb::duckdb()) ) rm(ds)
ds <- DiseasystoreSimulist$new( source_conn = ".", target_conn = DBI::dbConnect(duckdb::duckdb()) ) rm(ds)
Drop feature stores from DB
drop_diseasystore( pattern = NULL, schema = diseasyoption("target_schema", namespace = "diseasystore"), conn = SCDB::get_connection() )
drop_diseasystore( pattern = NULL, schema = diseasyoption("target_schema", namespace = "diseasystore"), conn = SCDB::get_connection() )
pattern |
( |
schema |
( |
conn |
( |
NULL
(called for side effects)
conn <- SCDB::get_connection(drv = RSQLite::SQLite()) drop_diseasystore(conn = conn) DBI::dbDisconnect(conn)
conn <- SCDB::get_connection(drv = RSQLite::SQLite()) drop_diseasystore(conn = conn) DBI::dbDisconnect(conn)
This FeatureHandler
R6 handles individual features for the feature stores.
They define the three methods associated with features (compute
, get
and key_join
).
A new instance of the FeatureHandler
R6 class.
compute
(function
)
A function of the form "function(start_date, end_date, slice_ts, source_conn, ds (optional), ...)".
This function should compute the feature from the source connection.
get
(function
)
A function of the form "function(target_table, slice_ts, target_conn)".
This function should retrieve the computed feature from the target connection.
key_join
(function
)
One of the aggregators from aggregators.
new()
Creates a new instance of the FeatureHandler
R6 class.
FeatureHandler$new(compute = NULL, get = NULL, key_join = NULL)
compute
(function
)
A function of the form "function(start_date, end_date, slice_ts, source_conn, ds (optional), ...)".
This function should return a data.frame
with the computed feature (computed from the source connection).
The data.frame
should contain the following columns:
key_*: One (or more) columns containing keys to link this feature with other features
*: One (or more) columns containing the features that are computed
valid_from, valid_until: A set of columns containing the time period for which this feature information
is valid.
get
(function
)
(Optional). A function of the form "function(target_table, slice_ts, target_conn, ...)".
This function should retrieve the computed feature from the target connection.
key_join
(function
)
A function like one of the aggregators from aggregators()
.
The function should return an expression on the form: dplyr::summarise(.data, dplyr::across(.cols = tidyselect::all_of(feature), .fns = list(n = ~ aggregation function), .names = "{.fn}"), .groups = "drop")
A new instance of the FeatureHandler
R6 class.
clone()
The objects of this class are cloneable with this method.
FeatureHandler$clone(deep = FALSE)
deep
Whether to make a deep clone.
# The FeatureHandler is typically configured as part of making a new Diseasystore. # Most often, we need only specify `compute` and `key_join` to get a functioning FeatureHandler # In this example we use mtcars as the basis for our features conn <- SCDB::get_connection(drv = RSQLite::SQLite()) # We use mtcars as our basis. First we add the rownames as an actual column data <- dplyr::mutate(mtcars, key_name = rownames(mtcars), .before = dplyr::everything()) # Then we add some imaginary times where these cars were produced data <- dplyr::mutate(data, production_start = as.Date(Sys.Date()) + floor(runif(nrow(mtcars)) * 100), production_end = production_start + floor(runif(nrow(mtcars)) * 365)) dplyr::copy_to(conn, data, "mtcars") # In this example, the feature we want is the "maximum miles per gallon" # The feature in question in the mtcars data set is then "mpg" and when we need to reduce # our data set, we want to use the "max()" function. # We first write a compute function for the mpg in our modified mtcars data set # Our goal is to get the mpg of all cars that were in production at the between start/end_date compute_mpg <- function(start_date, end_date, slice_ts, source_conn) { out <- SCDB::get_table(source_conn, "mtcars", slice_ts = slice_ts) |> dplyr::filter({{ start_date }} <= .data$production_end, .data$production_start <= {{ end_date }}) |> dplyr::transmute("key_name", "mpg", "valid_from" = "production_start", "valid_until" = "production_end") return(out) } # We can now combine into our FeatureHandler fh_max_mpg <- FeatureHandler$new(compute = compute_mpg, key_join = key_join_max) DBI::dbDisconnect(conn)
# The FeatureHandler is typically configured as part of making a new Diseasystore. # Most often, we need only specify `compute` and `key_join` to get a functioning FeatureHandler # In this example we use mtcars as the basis for our features conn <- SCDB::get_connection(drv = RSQLite::SQLite()) # We use mtcars as our basis. First we add the rownames as an actual column data <- dplyr::mutate(mtcars, key_name = rownames(mtcars), .before = dplyr::everything()) # Then we add some imaginary times where these cars were produced data <- dplyr::mutate(data, production_start = as.Date(Sys.Date()) + floor(runif(nrow(mtcars)) * 100), production_end = production_start + floor(runif(nrow(mtcars)) * 365)) dplyr::copy_to(conn, data, "mtcars") # In this example, the feature we want is the "maximum miles per gallon" # The feature in question in the mtcars data set is then "mpg" and when we need to reduce # our data set, we want to use the "max()" function. # We first write a compute function for the mpg in our modified mtcars data set # Our goal is to get the mpg of all cars that were in production at the between start/end_date compute_mpg <- function(start_date, end_date, slice_ts, source_conn) { out <- SCDB::get_table(source_conn, "mtcars", slice_ts = slice_ts) |> dplyr::filter({{ start_date }} <= .data$production_end, .data$production_start <= {{ end_date }}) |> dplyr::transmute("key_name", "mpg", "valid_from" = "production_start", "valid_until" = "production_end") return(out) } # We can now combine into our FeatureHandler fh_max_mpg <- FeatureHandler$new(compute = compute_mpg, key_join = key_join_max) DBI::dbDisconnect(conn)
diseasystore
for the case definitionGet the diseasystore
for the case definition
get_diseasystore(label)
get_diseasystore(label)
label |
( |
The diseasystore generator for the diseasystore matching the given label
ds <- get_diseasystore("Google COVID-19") # Returns the DiseasystoreGoogleCovid19 generator
ds <- get_diseasystore("Google COVID-19") # Returns the DiseasystoreGoogleCovid19 generator
source_conn_path: static url / directory. This helper determines whether source_conn is a file path or URL and creates the full path to the the file as needed based on the type of source_conn.
source_conn_github: static GitHub API url / git directory. This helper determines whether source_conn is a git directory or a GitHub API creates the full path to the the file as needed based on the type of source_conn.
A GitHub token can be configured in the "GITHUB_PAT" environment variable to avoid rate limiting.
If the basename of the requested file contains a date, the function will use fuzzy-matching to determine the closest matching, chronologically earlier, file location to return.
source_conn_path(source_conn, file) source_conn_github(source_conn, file, pull = TRUE)
source_conn_path(source_conn, file) source_conn_github(source_conn, file, pull = TRUE)
source_conn |
( |
file |
( |
pull |
( |
(character(1)
)
The full path to the requested file.
# Simulating a data directory source_conn <- "data_dir" dir.create(source_conn) write.csv(mtcars, file.path(source_conn, "mtcars.csv")) write.csv(iris, file.path(source_conn, "iris.csv")) # Get file path for mtcars.csv source_conn_path(source_conn, "mtcars.csv") # Clean up unlink(source_conn, recursive = TRUE)
# Simulating a data directory source_conn <- "data_dir" dir.create(source_conn) write.csv(mtcars, file.path(source_conn, "mtcars.csv")) write.csv(iris, file.path(source_conn, "iris.csv")) # Get file path for mtcars.csv source_conn_path(source_conn, "mtcars.csv") # Clean up unlink(source_conn, recursive = TRUE)
This function runs a battery of tests of the given diseasystore.
The supplied diseasystore must be a generator for the diseasystore, not an instance of the diseasystore.
The tests assume that data has been made available locally to run the majority of the tests. The location of the local data should be configured in the options for "source_conn" of the given diseasystore before calling test_diseasystore.
test_diseasystore( diseasystore_generator = NULL, conn_generator = NULL, data_files = NULL, target_schema = "test_ds", test_start_date = NULL, skip_backends = NULL, ... )
test_diseasystore( diseasystore_generator = NULL, conn_generator = NULL, data_files = NULL, target_schema = "test_ds", test_start_date = NULL, skip_backends = NULL, ... )
diseasystore_generator |
( |
conn_generator |
( |
data_files |
( |
target_schema |
( |
test_start_date |
( |
skip_backends |
( |
... |
Other parameters passed to the diseasystore generator. |
NULL
(called for side effects)
withr::local_options("diseasystore.DiseasystoreEcdcRespiratoryViruses.pull" = FALSE) conn_generator <- function(skip_backends = NULL) { switch( ("SQLiteConnection" %in% skip_backends) + 1, list(DBI::dbConnect(RSQLite::SQLite())), # SQLiteConnection not in skip_backends list() # SQLiteConnection in skip_backends ) } test_diseasystore( DiseasystoreEcdcRespiratoryViruses, conn_generator, data_files = "data/snapshots/2023-11-24_ILIARIRates.csv", target_schema = "test_ds", test_start_date = as.Date("2022-06-20"), slice_ts = "2023-11-24" )
withr::local_options("diseasystore.DiseasystoreEcdcRespiratoryViruses.pull" = FALSE) conn_generator <- function(skip_backends = NULL) { switch( ("SQLiteConnection" %in% skip_backends) + 1, list(DBI::dbConnect(RSQLite::SQLite())), # SQLiteConnection not in skip_backends list() # SQLiteConnection in skip_backends ) } test_diseasystore( DiseasystoreEcdcRespiratoryViruses, conn_generator, data_files = "data/snapshots/2023-11-24_ILIARIRates.csv", target_schema = "test_ds", test_start_date = as.Date("2022-06-20"), slice_ts = "2023-11-24" )
Transform case definition to PascalCase
to_diseasystore_case(label)
to_diseasystore_case(label)
label |
( |
The given label formatted to match a Diseasystore
to_diseasystore_case("Google COVID-19") # DiseasystoreGoogleCovid19
to_diseasystore_case("Google COVID-19") # DiseasystoreGoogleCovid19