Skip to contents

This function performs cross-validation with test repetitions on a random forest model, calculates feature importance using Gini importance, and returns the top n important features.

Usage

get_imp_features_from_rf_model_with_cv(
  Data = NULL,
  Undersample = FALSE,
  best.m = NULL,
  testReps,
  Type,
  nTopImportance
)

Arguments

Data

A data frame containing the training data (rows as samples, columns as features). The first column is assumed to be the target variable.

Undersample

A logical value indicating whether to apply under-sampling to balance the classes in the training data. Default is FALSE.

best.m

A numeric value representing the number of variables to consider at each split of the Random Forest model (or a function to determine this). Default is NULL.

testReps

A numeric value indicating the number of test repetitions (must be at least 2).

Type

A numeric value indicating the type of importance to be calculated. 1 for Mean Decrease Accuracy and 2 for Mean Decrease Gini.

nTopImportance

A numeric value indicating the number of top important features to return based on their importance scores.

Value

A list containing:

gini_scores

A matrix of Gini importance scores for each feature across the different cross-validation iterations. The matrix has rows representing features and columns representing test iterations.

Details

This function trains a Random Forest model using cross-validation with specified repetitions and calculates the feature importance using Gini importance scores. The function also supports optional under-sampling to balance the class distribution in the training set.

The function performs the following steps:

  • Initializes performance metric trackers.

  • Prepares the input data for cross-validation.

  • Performs cross-validation, where each repetition involves training the model on a subset of data and testing on the remaining data.

  • Optionally applies under-sampling to the training data.

  • Trains a Random Forest model on each fold and calculates Gini importance scores.

  • Aggregates and sorts the Gini importance scores to identify the top features.

  • Plots the importance of top features.

Examples

if (FALSE) { # \dontrun{
# Example of calling the function
result <- get_imp_features_from_rf_model_with_cv(
  Data = scores_df,
  Undersample = FALSE,
  best.m = 3,
  testReps = 5,
  Type = 2,
  nTopImportance = 10
)
} # }