Documentation for get_prediction_plot Function
Source:vignettes/get_prediction_plot.Rmd
get_prediction_plot.RmdFunction Purpose
The get_prediction_plot function performs model building
and prediction for a dataset using a random forest model. It iterates
over multiple test repetitions, trains the model on the training data,
and makes predictions on the test data. The function then generates a
histogram to visualize the distribution of predictions for the outcome
variable (LIVER).
Input Parameters
The function accepts the following input parameters:
| Parameter | Description | Type |
|---|---|---|
Data |
The dataset to use for training and testing. If NULL,
it will be fetched using the
get_Data_formatted_for_ml_and_best.m function. |
DataFrame (optional) |
path_db |
The path to the database that contains the dataset. | String |
rat_studies |
A flag indicating whether to use rat studies data. Default is
FALSE. |
Boolean |
studyid_metadata |
Metadata related to the study IDs. Default is
NULL. |
DataFrame (optional) |
fake_study |
A flag indicating whether to use fake study data. Default is
FALSE. |
Boolean |
use_xpt_file |
A flag indicating whether to use an XPT file. Default is
FALSE. |
Boolean |
Round |
A flag indicating whether to round the predictions. Default is
FALSE. |
Boolean |
Impute |
A flag indicating whether to impute missing values. Default is
FALSE. |
Boolean |
reps |
The number of repetitions for the cross-validation process. | Integer |
holdback |
The proportion of data to hold back for testing during cross-validation. | Numeric |
Undersample |
A flag indicating whether to perform undersampling on the dataset.
Default is FALSE. |
Boolean |
hyperparameter_tuning |
A flag indicating whether to perform hyperparameter tuning. Default
is FALSE. |
Boolean |
error_correction_method |
The method to use for error correction (e.g., “Flip”, “Prune”, or “None”). | String |
testReps |
The number of test repetitions for model evaluation. | Integer |
Output
The function returns a histogram plot visualizing the predicted
probabilities for the LIVER variable across test
repetitions. The plot shows the distribution of predictions
(probabilities) for both classes (LIVER = “Y” or “N”).
Key Steps
-
Data Preparation:
- If
DataisNULL, the function fetches and formats the data using theget_Data_formatted_for_ml_and_best.mfunction.
- If
-
Cross-Validation:
- The dataset is divided into training and testing sets for each
repetition (
testReps). - If
Undersampleis enabled, undersampling is applied to balance the dataset.
- The dataset is divided into training and testing sets for each
repetition (
-
Model Training:
- A random forest model is trained using the training set for each repetition.
-
Prediction:
- The model makes predictions on the test set.
- The predicted probabilities are stored for each repetition.
-
Result Visualization:
- The predictions are averaged across repetitions, and a histogram is
created to visualize the distribution of the predicted probabilities for
the
LIVERvariable.
- The predictions are averaged across repetitions, and a histogram is
created to visualize the distribution of the predicted probabilities for
the
-
Plot:
- The histogram is displayed using
ggplot2, showing the predicted probabilities for theLIVERoutcome (coded as “Y” or “N”).
- The histogram is displayed using