Documentation for get_prediction_plot Function
Source:vignettes/get_prediction_plot.Rmd
      get_prediction_plot.RmdFunction Purpose
The get_prediction_plot function performs model building
and prediction for a dataset using a random forest model. It iterates
over multiple test repetitions, trains the model on the training data,
and makes predictions on the test data. The function then generates a
histogram to visualize the distribution of predictions for the outcome
variable (LIVER).
Input Parameters
The function accepts the following input parameters:
| Parameter | Description | Type | 
|---|---|---|
Data | 
The dataset to use for training and testing. If NULL,
it will be fetched using the
get_Data_formatted_for_ml_and_best.m function. | 
DataFrame (optional) | 
path_db | 
The path to the database that contains the dataset. | String | 
rat_studies | 
A flag indicating whether to use rat studies data. Default is
FALSE. | 
Boolean | 
studyid_metadata | 
Metadata related to the study IDs. Default is
NULL. | 
DataFrame (optional) | 
fake_study | 
A flag indicating whether to use fake study data. Default is
FALSE. | 
Boolean | 
use_xpt_file | 
A flag indicating whether to use an XPT file. Default is
FALSE. | 
Boolean | 
Round | 
A flag indicating whether to round the predictions. Default is
FALSE. | 
Boolean | 
Impute | 
A flag indicating whether to impute missing values. Default is
FALSE. | 
Boolean | 
reps | 
The number of repetitions for the cross-validation process. | Integer | 
holdback | 
The proportion of data to hold back for testing during cross-validation. | Numeric | 
Undersample | 
A flag indicating whether to perform undersampling on the dataset.
Default is FALSE. | 
Boolean | 
hyperparameter_tuning | 
A flag indicating whether to perform hyperparameter tuning. Default
is FALSE. | 
Boolean | 
error_correction_method | 
The method to use for error correction (e.g., “Flip”, “Prune”, or “None”). | String | 
testReps | 
The number of test repetitions for model evaluation. | Integer | 
Output
The function returns a histogram plot visualizing the predicted
probabilities for the LIVER variable across test
repetitions. The plot shows the distribution of predictions
(probabilities) for both classes (LIVER = “Y” or “N”).
Key Steps
- 
Data Preparation:
- If 
DataisNULL, the function fetches and formats the data using theget_Data_formatted_for_ml_and_best.mfunction. 
 - If 
 - 
Cross-Validation:
- The dataset is divided into training and testing sets for each
repetition (
testReps). - If 
Undersampleis enabled, undersampling is applied to balance the dataset. 
 - The dataset is divided into training and testing sets for each
repetition (
 - 
Model Training:
- A random forest model is trained using the training set for each repetition.
 
 - 
Prediction:
- The model makes predictions on the test set.
 - The predicted probabilities are stored for each repetition.
 
 - 
Result Visualization:
- The predictions are averaged across repetitions, and a histogram is
created to visualize the distribution of the predicted probabilities for
the 
LIVERvariable. 
 - The predictions are averaged across repetitions, and a histogram is
created to visualize the distribution of the predicted probabilities for
the 
 - 
Plot:
- The histogram is displayed using 
ggplot2, showing the predicted probabilities for theLIVERoutcome (coded as “Y” or “N”). 
 - The histogram is displayed using