Perform Cross-Validation with Random Forest and Feature Importance Calculation

This function performs cross-validation on a Random Forest model, tracks performance metrics (such as sensitivity, specificity, accuracy), handles indeterminate predictions, and computes feature importance based on either Gini or Accuracy. The function returns performance summaries and feature importance rankings after a specified number of test repetitions.

Usage

get_rf_model_output_cv_imp(
  scores_data_df,
  Undersample = FALSE,
  best.m = NULL,
  testReps,
  indeterminateUpper,
  indeterminateLower,
  Type,
  nTopImportance
)

Arguments

Undersample: A logical flag indicating whether to apply undersampling to the training data. Defaults to FALSE.
best.m: A numeric value representing the number of features to sample for the Random Forest model, or NULL to calculate it automatically.
testReps: An integer specifying the number of repetitions for cross-validation. Must be at least 2.
indeterminateUpper: A numeric threshold above which predictions are not considered indeterminate.
indeterminateLower: A numeric threshold below which predictions are not considered indeterminate.
Type: An integer specifying the type of importance to compute. 1 for MeanDecreaseAccuracy, 2 for MeanDecreaseGini.
nTopImportance: An integer specifying the number of top features to display based on their importance scores.
scores_df: A data frame containing the features and target variable for training and testing the model.

Value

A list with the following elements:

performance_metrics: A vector of aggregated performance metrics (e.g., sensitivity, specificity, accuracy, etc.).
feature_importance: A matrix containing the importance of the top nTopImportance features, ordered by their importance score.
raw_results: A list containing raw results for debugging or further analysis, including sensitivity, specificity, accuracy, and Gini scores across all test repetitions.

Details

The function splits the input data into training and testing sets based on the specified number of test repetitions (testReps). During each iteration, it trains a Random Forest model and makes predictions on the test data. Indeterminate predictions are handled by marking them as NA. The function tracks performance metrics such as sensitivity, specificity, and accuracy, and computes the top nTopImportance features based on either Mean Decrease Accuracy or Mean Decrease Gini.

Examples

if (FALSE) { # \dontrun{
#Example usage of the function
result <- get_rf_model_output_cv_imp(
  scores_df = your_data,
  Undersample = FALSE,
  best.m = 3,
  testReps = 5,
  indeterminateUpper = 0.8,
  indeterminateLower = 0.2,
  Type = 1,
  nTopImportance = 10
)

#View performance metrics
print(result$performance_metrics)

#View top features by importance
print(result$feature_importance)
} # }