Get Important Features from Random Forest Model with Cross-Validation
Source:R/get_imp_features_from_rf_model_with_cv.R
get_imp_features_from_rf_model_with_cv.Rd
This function performs cross-validation with test repetitions on a random forest model, calculates feature importance using Gini importance, and returns the top n
important features.
Usage
get_imp_features_from_rf_model_with_cv(
Data = NULL,
Undersample = FALSE,
best.m = NULL,
testReps,
Type,
nTopImportance
)
Arguments
- Data
A data frame containing the training data (rows as samples, columns as features). The first column is assumed to be the target variable.
- Undersample
A logical value indicating whether to apply under-sampling to balance the classes in the training data. Default is
FALSE
.- best.m
A numeric value representing the number of variables to consider at each split of the Random Forest model (or a function to determine this). Default is
NULL
.- testReps
A numeric value indicating the number of test repetitions (must be at least 2).
- Type
A numeric value indicating the type of importance to be calculated.
1
for Mean Decrease Accuracy and2
for Mean Decrease Gini.- nTopImportance
A numeric value indicating the number of top important features to return based on their importance scores.
Value
A list containing:
- gini_scores
A matrix of Gini importance scores for each feature across the different cross-validation iterations. The matrix has rows representing features and columns representing test iterations.
Details
This function trains a Random Forest model using cross-validation with specified repetitions and calculates the feature importance using Gini importance scores. The function also supports optional under-sampling to balance the class distribution in the training set.
The function performs the following steps:
Initializes performance metric trackers.
Prepares the input data for cross-validation.
Performs cross-validation, where each repetition involves training the model on a subset of data and testing on the remaining data.
Optionally applies under-sampling to the training data.
Trains a Random Forest model on each fold and calculates Gini importance scores.
Aggregates and sorts the Gini importance scores to identify the top features.
Plots the importance of top features.