Documentation for `get_Data_formatted_for_ml_and_best.m` Function
Your Name
Source:vignettes/get_Data_formatted_for_ml_and_best.m.Rmd
get_Data_formatted_for_ml_and_best.m.Rmd
Purpose
The function get_Data_formatted_for_ml_and_best.m
is
designed to retrieve and preprocess data for machine learning (ML)
models from a given SQLite database or XPT file. It performs several
tasks such as fetching study IDs, retrieving study metadata, calculating
liver toxicity scores, and tuning hyperparameters for ML models. The
final output is a list containing processed data ready for machine
learning and the best model.
Input Parameters
Parameter | Description | Type | Default Value |
---|---|---|---|
path_db |
Path to the SQLite database or XPT file location. | character | None |
rat_studies |
Flag to filter for rat studies. | logical | FALSE |
studyid_metadata |
Optional metadata for study IDs. If NULL, will be generated. | data.frame | NULL |
fake_study |
Flag to use fake study data. | logical | FALSE |
use_xpt_file |
Flag to indicate whether to use an XPT file instead of SQLite database. | logical | FALSE |
Round |
Flag to round liver toxicity scores. | logical | FALSE |
Impute |
Flag to impute missing values in the dataset. | logical | FALSE |
reps |
Number of repetitions for cross-validation. | integer | None |
holdback |
Fraction of data to hold back for validation. | numeric | None |
Undersample |
Flag to undersample the majority class. | logical | FALSE |
hyperparameter_tuning |
Flag to perform hyperparameter tuning for the model. | logical | FALSE |
error_correction_method |
Method to handle error correction. Must be one of ‘Flip’, ‘Prune’, or ‘None’. | character | None |
Output
The function returns a list with the following elements:
-
Data
: A data frame containing the preprocessed data ready for machine learning. -
best.m
: The best machine learning model after hyperparameter tuning, if applicable.
Key Steps
-
Fetch Study IDs:
- If
use_xpt_file
isTRUE
, it retrieves study IDs from directories within the specified path. - If
use_xpt_file
isFALSE
andfake_study
isTRUE
, the function connects to an SQLite database and retrieves the study IDs from the ‘dm’ table. - If
fake_study
isFALSE
, it fetches repeat-dose and parallel study IDs from the database.
- If
-
Process Study Metadata:
- If
studyid_metadata
is not provided, it generates metadata by selecting unique study IDs and assigning random “Target_Organ” values (either “Liver” or “not_Liver”).
- If
-
Calculate Liver Toxicity Scores:
- The function calculates liver toxicity scores using the
get_liver_om_lb_mi_tox_score_list
function.
- The function calculates liver toxicity scores using the
-
Harmonize Scores:
- The calculated liver toxicity scores are harmonized using the
get_col_harmonized_scores_df
function, optionally rounding them based on theRound
parameter.
- The calculated liver toxicity scores are harmonized using the
-
Machine Learning Data Preparation:
- The function prepares the data for machine learning and performs
hyperparameter tuning (if
hyperparameter_tuning
isTRUE
) using theget_ml_data_and_tuned_hyperparameters
function.
- The function prepares the data for machine learning and performs
hyperparameter tuning (if
-
Return Processed Data and Best Model:
- The final output consists of the processed data and the best machine
learning model (
best.m
).
- The final output consists of the processed data and the best machine
learning model (