Documentation for `get_rf_input_param_list_output_cv_imp` Function
Source:vignettes/get_rf_input_param_list_output_cv_imp.Rmd
get_rf_input_param_list_output_cv_imp.Rmd
Purpose
The get_rf_input_param_list_output_cv_imp
function
prepares the necessary data for training and evaluating a Random Forest
(RF) model with cross-validation and variable importance scores. It
handles various configurations, such as imputation, hyperparameter
tuning, and the inclusion of rat studies. The function interacts with
either an XPT file or an SQLite database to extract and harmonize study
data, followed by model training and evaluation.
Input Parameters
Parameter | Type | Description | Default Value |
---|---|---|---|
path_db |
character | Path to the SQLite database. | N/A |
rat_studies |
logical | If TRUE , limits the studies to rat studies. |
FALSE |
studyid_metadata |
data.frame | A data frame containing metadata for the studies. | N/A |
fake_study |
logical | If TRUE , uses a fake study for data processing. |
FALSE |
use_xpt_file |
logical | If TRUE , reads data from an XPT file instead of a
database. |
FALSE |
Round |
logical | If TRUE , rounds the liver scores. |
FALSE |
Impute |
logical | If TRUE , imputes missing values in the data. |
FALSE |
reps |
integer | Number of repetitions for model evaluation. | N/A |
holdback |
numeric | The proportion of data to hold back for validation. | N/A |
Undersample |
logical | If TRUE , undersamples the data to balance the
classes. |
FALSE |
hyperparameter_tuning |
logical | If TRUE , tunes hyperparameters for the Random Forest
model. |
FALSE |
error_correction_method |
character | The error correction method. Options: ‘Flip’, ‘Prune’, or ‘None’. | N/A |
best.m |
numeric | A predefined value for the number of trees in the Random Forest
model. If NULL , the function will determine this
automatically. |
NULL |
testReps |
integer | Number of test repetitions for model evaluation. | N/A |
indeterminateUpper |
numeric | Upper threshold for indeterminate predictions. | N/A |
indeterminateLower |
numeric | Lower threshold for indeterminate predictions. | N/A |
Type |
character | The type of Random Forest model to use. Options include classification or regression models. | N/A |
nTopImportance |
integer | The number of top important features to consider for the model. | N/A |
Output
The function returns a Random Forest model trained with
cross-validation (CV) and includes a list of variable importance scores.
Specifically, it returns the result of the
get_rf_model_with_cv
function, which includes the trained
model, cross-validation results, and feature importance scores.
Key Steps
-
Data Source Selection:
- If
use_xpt_file
isTRUE
, the function loads data from an XPT file. - If
fake_study
isTRUE
, it fetches data from a SQLite database and filters based onrat_studies
. - If neither condition is met, it retrieves study IDs from the
database using
get_repeat_dose_parallel_studyids
.
- If
-
Data Harmonization:
- The function calls
get_liver_om_lb_mi_tox_score_list
to calculate liver scores for the studies, which are then harmonized usingget_col_harmonized_scores_df
.
- The function calls
-
Machine Learning Data Preparation:
- The function prepares data for Random Forest model training by
calling
get_ml_data_and_tuned_hyperparameters
. This step involves imputation, optional hyperparameter tuning, and data balancing.
- The function prepares data for Random Forest model training by
calling
-
Random Forest Model Training and Evaluation:
- The function calls
get_rf_model_with_cv
to train and evaluate the Random Forest model with cross-validation. The model’s performance is evaluated across multiple repetitions (testReps
), with the option to include top importance features.
- The function calls
-
Error Correction:
- If specified, the function applies an error correction method (either “Flip”, “Prune”, or “None”).
-
Return:
- The function returns the trained Random Forest model along with cross-validation results and feature importance scores.
Example Usage
```r result <- get_rf_input_param_list_output_cv_imp( path_db = “path/to/database”, rat_studies = TRUE, studyid_metadata = metadata_df, fake_study = FALSE, use_xpt_file = FALSE, Round = TRUE, Impute = TRUE, reps = 10, holdback = 0.2, Undersample = TRUE, hyperparameter_tuning = TRUE, error_correction_method = “Flip”, best.m = NULL, testReps = 5, indeterminateUpper = 0.9, indeterminateLower = 0.1, Type = “classification”, nTopImportance = 10 )