Retrieve and Preprocess Data for Machine Learning Models
Source:R/get_Data_formatted_for_ml_and_best.m.R
get_Data_formatted_for_ml_and_best.m.Rd
This function processes data from a given SQLite database or XPT file, calculates liver toxicity scores, and prepares data for machine learning models. It can also tune hyperparameters and apply error correction methods.
Usage
get_Data_formatted_for_ml_and_best.m(
path_db,
rat_studies = FALSE,
studyid_metadata = NULL,
fake_study = FALSE,
use_xpt_file = FALSE,
Round = FALSE,
Impute = FALSE,
reps,
holdback,
Undersample = FALSE,
hyperparameter_tuning = FALSE,
error_correction_method
)
Arguments
- path_db
A character string representing the path to the SQLite database or XPT file.
- rat_studies
A logical flag to filter for rat studies (default is FALSE).
- studyid_metadata
A data frame containing metadata for the study IDs. If NULL, metadata is generated (default is NULL).
- fake_study
A logical flag to use fake study data (default is FALSE).
- use_xpt_file
A logical flag to indicate whether to use an XPT file instead of a SQLite database (default is FALSE).
- Round
A logical flag to round liver toxicity scores (default is FALSE).
- Impute
A logical flag to impute missing values in the dataset (default is FALSE).
- reps
An integer specifying the number of repetitions for cross-validation.
- holdback
A numeric value indicating the fraction of data to hold back for validation.
- Undersample
A logical flag to undersample the majority class (default is FALSE).
- hyperparameter_tuning
A logical flag to perform hyperparameter tuning (default is FALSE).
- error_correction_method
A character string specifying the error correction method. Must be one of 'Flip', 'Prune', or 'None'.
Value
A list containing:
- Data
A data frame containing the preprocessed data ready for machine learning.
- best.m
The best machine learning model after hyperparameter tuning, if applicable.
Details
This function performs several key steps:
Retrieves study IDs from an SQLite database or XPT file.
Generates or uses provided study metadata, including a random assignment of "Target_Organ" values (either "Liver" or "not_Liver").
Calculates liver toxicity scores using the
get_liver_om_lb_mi_tox_score_list
function.Harmonizes the calculated scores using the
get_col_harmonized_scores_df
function.Prepares the data for machine learning and tunes hyperparameters (if enabled) using the
get_ml_data_and_tuned_hyperparameters
function.Returns the processed data and the best model.
Examples
if (FALSE) { # \dontrun{
result <- get_Data_formatted_for_ml_and_best.m(
path_db = "path/to/database.db",
rat_studies = TRUE,
reps = 5,
holdback = 0.2,
error_correction_method = "Flip"
)
# Access the processed data and the best model
processed_data <- result$Data
best_model <- result$best.m
} # }