Get Important Features from Random Forest Model with Cross-Validation — get_imp_features_from_rf_model_with

This function performs cross-validation with test repetitions on a random forest model, calculates feature importance using Gini importance, and returns the top n important features.

Usage

get_imp_features_from_rf_model_with_cv(
  scores_data_df,
  Undersample = FALSE,
  best.m = NULL,
  testReps,
  Type,
  nTopImportance
)

Arguments

Undersample: A logical value indicating whether to apply under-sampling to balance the classes in the training data. Default is FALSE.
best.m: A numeric value representing the number of variables to consider at each split of the Random Forest model (or a function to determine this). Default is NULL.
testReps: A numeric value indicating the number of test repetitions (must be at least 2).
Type: A numeric value indicating the type of importance to be calculated. 1 for Mean Decrease Accuracy and 2 for Mean Decrease Gini.
nTopImportance: A numeric value indicating the number of top important features to return based on their importance scores.
Data: A data frame containing the training data (rows as samples, columns as features). The first column is assumed to be the target variable.

Value

A list containing:

gini_scores: A matrix of Gini importance scores for each feature across the different cross-validation iterations. The matrix has rows representing features and columns representing test iterations.

Details

This function trains a Random Forest model using cross-validation with specified repetitions and calculates the feature importance using Gini importance scores. The function also supports optional under-sampling to balance the class distribution in the training set.

The function performs the following steps:

Initializes performance metric trackers.
Prepares the input data for cross-validation.
Performs cross-validation, where each repetition involves training the model on a subset of data and testing on the remaining data.
Optionally applies under-sampling to the training data.
Trains a Random Forest model on each fold and calculates Gini importance scores.
Aggregates and sorts the Gini importance scores to identify the top features.
Plots the importance of top features.

Examples

if (FALSE) { # \dontrun{
# Example of calling the function
result <- get_imp_features_from_rf_model_with_cv(
  Data = scores_df,
  Undersample = FALSE,
  best.m = 3,
  testReps = 5,
  Type = 2,
  nTopImportance = 10
)
} # }