Skip to contents

This function prepares the data for training a Random Forest (RF) model with cross-validation, handles imputation, hyperparameter tuning, and evaluates the model's performance. It supports both real and fake study data, with options for rat studies, error correction, and feature importance selection.

Usage

get_rf_input_param_list_output_cv_imp(
  path_db,
  rat_studies = FALSE,
  studyid_metadata,
  fake_study = FALSE,
  use_xpt_file = FALSE,
  Round = FALSE,
  Impute = FALSE,
  reps,
  holdback,
  Undersample = FALSE,
  hyperparameter_tuning = FALSE,
  error_correction_method,
  best.m = NULL,
  testReps,
  indeterminateUpper,
  indeterminateLower,
  Type,
  nTopImportance
)

Arguments

path_db

A character string specifying the path to the SQLite database or directory containing the XPT file.

rat_studies

A logical value indicating whether to filter for rat studies. Default is FALSE.

studyid_metadata

A data frame containing metadata for the studies.

fake_study

A logical value indicating whether to use fake study data. Default is FALSE.

use_xpt_file

A logical value indicating whether to use XPT file data. Default is FALSE.

Round

A logical value indicating whether to round the liver scores. Default is FALSE.

Impute

A logical value indicating whether to impute missing values. Default is FALSE.

reps

An integer specifying the number of repetitions for model evaluation.

holdback

A numeric value specifying the proportion of data to hold back for validation.

Undersample

A logical value indicating whether to undersample the data to balance classes. Default is FALSE.

hyperparameter_tuning

A logical value indicating whether to tune the Random Forest model's hyperparameters. Default is FALSE.

error_correction_method

A character string specifying the error correction method. Options are 'Flip', 'Prune', or 'None'.

best.m

A numeric value specifying the number of trees in the Random Forest model. If NULL, the function determines this automatically.

testReps

An integer specifying the number of test repetitions for model evaluation.

indeterminateUpper

A numeric value for the upper threshold of indeterminate predictions.

indeterminateLower

A numeric value for the lower threshold of indeterminate predictions.

Type

A character string specifying the type of Random Forest model to use. Options include 'classification' or 'regression'.

nTopImportance

An integer specifying the number of top important features to consider for the model.

Value

A list containing the trained Random Forest model, cross-validation results, and feature importance scores. The list is returned by the get_rf_model_with_cv function.

Details

The function performs the following steps:

  • Fetches the study data based on the specified parameters.

  • Calculates liver scores and harmonizes the data.

  • Prepares data for machine learning, including imputation and optional hyperparameter tuning.

  • Trains and evaluates the Random Forest model with cross-validation.

  • Applies error correction (if specified) and selects the most important features.

Examples

if (FALSE) { # \dontrun{
# Example usage of the function
result <- get_rf_input_param_list_output_cv_imp(
  path_db = "path/to/database",
  rat_studies = TRUE,
  studyid_metadata = metadata_df,
  fake_study = FALSE,
  use_xpt_file = FALSE,
  Round = TRUE,
  Impute = TRUE,
  reps = 10,
  holdback = 0.2,
  Undersample = TRUE,
  hyperparameter_tuning = TRUE,
  error_correction_method = "Flip",
  best.m = NULL,
  testReps = 5,
  indeterminateUpper = 0.9,
  indeterminateLower = 0.1,
  Type = "classification",
  nTopImportance = 10
)
} # }