Function Documentation: get_imp_features_from_rf_model_with_cv
Your Name
2025-05-01
Source:vignettes/get_imp_features_from_rf_model_with_cv.Rmd
      get_imp_features_from_rf_model_with_cv.RmdPurpose
The get_imp_features_from_rf_model_with_cv function
performs cross-validation with test repetitions on a random forest
model, calculates feature importance using Gini importance, and returns
the top n important features. It is primarily used for
evaluating feature importance in classification tasks by utilizing
Random Forest with optional under-sampling and custom test
repetitions.
Input Parameters
The function accepts the following parameters:
- 
Data: A data frame containing the training data (typically with rows as samples and columns as features). The first column is assumed to be the target variable. - 
Undersample: A logical value (TRUEorFALSE) indicating whether to apply under-sampling to balance the classes in the training data. Default isFALSE. - 
best.m: A numeric value representing the number of variables to be considered at each split of the Random Forest model (or a function to determine this). Default isNULL. - 
testReps: A numeric value indicating the number of test repetitions (must be at least 2). - 
Type: A numeric value indicating the type of importance to be calculated.1for Mean Decrease Accuracy and2for Mean Decrease Gini. - 
nTopImportance: A numeric value indicating the number of top important features to return based on their importance scores. 
Output
The function returns a list containing:
- 
gini_scores: A matrix of Gini importance scores for each feature across the different cross-validation iterations. The matrix has rows representing features and columns representing test iterations. 
Key Steps
Initialize Metrics: The function starts by defining several empty vectors to track performance metrics like Sensitivity, Specificity, PPV, NPV, and others, which are initialized but not used in the current version.
Prepare Data: The function prepares the data by renaming the columns of the input
Datafor consistency and initializing a new data frame (rfTestData) to store prediction results across iterations.Cross-Validation Setup: The function sets up a cross-validation loop with test repetitions. For each repetition, it selects a random subset of data to test and uses the rest for training. Optionally, under-sampling can be applied to balance the dataset.
Model Training: A Random Forest model is trained on the training data in each iteration using the
randomForestpackage. It uses the specified value forbest.mto control the number of variables considered at each split.Calculate Gini Importance: After training the model, Gini importance scores are calculated for each feature using the
randomForest::importancefunction. The Gini scores are aggregated across all test repetitions.Aggregate and Sort Importance Scores: After completing the cross-validation iterations, the mean Gini importance scores for each feature are calculated and sorted in decreasing order.
Plot Feature Importance: A dotchart is generated to visualize the top
nTopImportancefeatures based on their importance scores.Return Results: The function returns a list containing the Gini importance scores across all iterations.
```r # Example of how to call the function result <- get_imp_features_from_rf_model_with_cv( Data = scores_df, Undersample = FALSE, best.m = 3, testReps = 5, Type = 2, nTopImportance = 10 )