Function Documentation: get_imp_features_from_rf_model_with_cv
Your Name
2025-01-02
Source:vignettes/get_imp_features_from_rf_model_with_cv.Rmd
get_imp_features_from_rf_model_with_cv.Rmd
Purpose
The get_imp_features_from_rf_model_with_cv
function
performs cross-validation with test repetitions on a random forest
model, calculates feature importance using Gini importance, and returns
the top n
important features. It is primarily used for
evaluating feature importance in classification tasks by utilizing
Random Forest with optional under-sampling and custom test
repetitions.
Input Parameters
The function accepts the following parameters:
-
Data
: A data frame containing the training data (typically with rows as samples and columns as features). The first column is assumed to be the target variable. -
Undersample
: A logical value (TRUE
orFALSE
) indicating whether to apply under-sampling to balance the classes in the training data. Default isFALSE
. -
best.m
: A numeric value representing the number of variables to be considered at each split of the Random Forest model (or a function to determine this). Default isNULL
. -
testReps
: A numeric value indicating the number of test repetitions (must be at least 2). -
Type
: A numeric value indicating the type of importance to be calculated.1
for Mean Decrease Accuracy and2
for Mean Decrease Gini. -
nTopImportance
: A numeric value indicating the number of top important features to return based on their importance scores.
Output
The function returns a list containing:
-
gini_scores
: A matrix of Gini importance scores for each feature across the different cross-validation iterations. The matrix has rows representing features and columns representing test iterations.
Key Steps
Initialize Metrics: The function starts by defining several empty vectors to track performance metrics like Sensitivity, Specificity, PPV, NPV, and others, which are initialized but not used in the current version.
Prepare Data: The function prepares the data by renaming the columns of the input
Data
for consistency and initializing a new data frame (rfTestData
) to store prediction results across iterations.Cross-Validation Setup: The function sets up a cross-validation loop with test repetitions. For each repetition, it selects a random subset of data to test and uses the rest for training. Optionally, under-sampling can be applied to balance the dataset.
Model Training: A Random Forest model is trained on the training data in each iteration using the
randomForest
package. It uses the specified value forbest.m
to control the number of variables considered at each split.Calculate Gini Importance: After training the model, Gini importance scores are calculated for each feature using the
randomForest::importance
function. The Gini scores are aggregated across all test repetitions.Aggregate and Sort Importance Scores: After completing the cross-validation iterations, the mean Gini importance scores for each feature are calculated and sorted in decreasing order.
Plot Feature Importance: A dotchart is generated to visualize the top
nTopImportance
features based on their importance scores.Return Results: The function returns a list containing the Gini importance scores across all iterations.
```r # Example of how to call the function result <- get_imp_features_from_rf_model_with_cv( Data = scores_df, Undersample = FALSE, best.m = 3, testReps = 5, Type = 2, nTopImportance = 10 )