Documentation for `get_ml_data_and_tuned_hyperparameters` Function
Source:vignettes/get_ml_data_and_tuned_hyperparameters.Rmd
get_ml_data_and_tuned_hyperparameters.Rmd
Overview
The get_ml_data_and_tuned_hyperparameters
function
processes and prepares machine learning data for modeling, with various
optional preprocessing steps such as missing value imputation,
undersampling, and hyperparameter tuning. It also supports error
correction via specific methods like “Flip” and “Prune”.
Function Definition
get_ml_data_and_tuned_hyperparameters <- function(Data,
studyid_metadata,
Impute = FALSE,
Round = FALSE,
reps,
holdback,
Undersample = FALSE,
hyperparameter_tuning = FALSE,
error_correction_method = NULL) {
# Function implementation
}
result <- get_ml_data_and_tuned_hyperparameters(Data = scores_df,
studyid_metadata = metadata_df,
Impute = TRUE,
Round = TRUE,
reps = 10,
holdback = 0.25,
Undersample = TRUE,
hyperparameter_tuning = TRUE,
error_correction_method = "Flip")
# Access the final data and best mtry hyperparameter
rfData <- result$rfData
best_mtry <- result$best.m
)
Parameters
Data (data frame):
Input data containing the scores. This will typically be a data frame namedscores_df
.studyid_metadata (data frame):
A data frame containing metadata, typically including theSTUDYID
column, which is used for joining with theData
.Impute (logical):
IfTRUE
, missing values in the dataset will be imputed using random forest imputation.Round (logical):
IfTRUE
, specific columns will be rounded according to the rules described in the function.reps (numeric):
The number of repetitions for cross-validation. A value of 0 skips repetition.holdback (numeric):
The fraction of data to hold back for testing. A value of 1 means leave-one-out cross-validation.Undersample (logical):
IfTRUE
, the training data will be undersampled to balance the target classes.hyperparameter_tuning (logical):
IfTRUE
, hyperparameter tuning will be performed using cross-validation.error_correction_method (character):
Specifies the error correction method to use. Can be one of"Flip"
,"Prune"
, or"None"
. Defaults toNULL
, which means no correction.
Returns
-
A list containing:
-
rfData:
The final prepared data after preprocessing, imputation, and any error correction methods. -
best.m:
The bestmtry
hyperparameter for the random forest model (determined through tuning or default).
-
rfData:
Function Workflow
Data Merging
- The function first joins the metadata
(
studyid_metadata
) with the input data (Data
) based on theSTUDYID
column.
Target Variable Encoding
- The target variable (
Target_Organ
) is encoded such that:-
'Liver'
is encoded as1
. -
'not_Liver'
is encoded as0
.
-
- This encoding facilitates the modeling process.
Missing Value Imputation
- If
Impute
isTRUE
, missing values are imputed using therandomForest::rfImpute
function.
Data Splitting
- The data is split into training and testing sets:
- A fraction of the data (
holdback
) is held back for testing. - For each repetition (
reps
), the data is split again.
- A fraction of the data (
- The training set is optionally undersampled to balance the target classes.
Hyperparameter Tuning
- If
hyperparameter_tuning
isTRUE
:- The function performs hyperparameter tuning for the random forest
model using cross-validation with
trainControl
from the caret package. - The
mtry
parameter is tuned, which controls the number of variables randomly sampled as candidates at each split.
- The function performs hyperparameter tuning for the random forest
model using cross-validation with
Model Training
- A random forest model is trained on the prepared data using the randomForest package.
- The
best.m
hyperparameter is selected based on the tuning or set to a default value.