Skip to contents

Purpose

The get_imp_features_from_rf_model_with_cv function performs cross-validation with test repetitions on a random forest model, calculates feature importance using Gini importance, and returns the top n important features. It is primarily used for evaluating feature importance in classification tasks by utilizing Random Forest with optional under-sampling and custom test repetitions.

Input Parameters

The function accepts the following parameters:

  • Data: A data frame containing the training data (typically with rows as samples and columns as features). The first column is assumed to be the target variable.
  • Undersample: A logical value (TRUE or FALSE) indicating whether to apply under-sampling to balance the classes in the training data. Default is FALSE.
  • best.m: A numeric value representing the number of variables to be considered at each split of the Random Forest model (or a function to determine this). Default is NULL.
  • testReps: A numeric value indicating the number of test repetitions (must be at least 2).
  • Type: A numeric value indicating the type of importance to be calculated. 1 for Mean Decrease Accuracy and 2 for Mean Decrease Gini.
  • nTopImportance: A numeric value indicating the number of top important features to return based on their importance scores.

Output

The function returns a list containing:

  • gini_scores: A matrix of Gini importance scores for each feature across the different cross-validation iterations. The matrix has rows representing features and columns representing test iterations.

Key Steps

  1. Initialize Metrics: The function starts by defining several empty vectors to track performance metrics like Sensitivity, Specificity, PPV, NPV, and others, which are initialized but not used in the current version.

  2. Prepare Data: The function prepares the data by renaming the columns of the input Data for consistency and initializing a new data frame (rfTestData) to store prediction results across iterations.

  3. Cross-Validation Setup: The function sets up a cross-validation loop with test repetitions. For each repetition, it selects a random subset of data to test and uses the rest for training. Optionally, under-sampling can be applied to balance the dataset.

  4. Model Training: A Random Forest model is trained on the training data in each iteration using the randomForest package. It uses the specified value for best.m to control the number of variables considered at each split.

  5. Calculate Gini Importance: After training the model, Gini importance scores are calculated for each feature using the randomForest::importance function. The Gini scores are aggregated across all test repetitions.

  6. Aggregate and Sort Importance Scores: After completing the cross-validation iterations, the mean Gini importance scores for each feature are calculated and sorted in decreasing order.

  7. Plot Feature Importance: A dotchart is generated to visualize the top nTopImportance features based on their importance scores.

  8. Return Results: The function returns a list containing the Gini importance scores across all iterations.

```r # Example of how to call the function result <- get_imp_features_from_rf_model_with_cv( Data = scores_df, Undersample = FALSE, best.m = 3, testReps = 5, Type = 2, nTopImportance = 10 )