SENDQSAR: QSAR Modeling with SEND Database
About
This package facilitates developing Quantitative Structure-Activity Relationship (QSAR) models using the SEND database. It streamlines data acquisition, preprocessing, descriptor calculation, and model evaluation, enabling researchers to efficiently explore molecular descriptors and create robust predictive models.
Features
- Automated Data Processing: Simplifies data acquisition and preprocessing steps.
- Comprehensive Analysis: Provides z-score calculations for various parameters such as body weight, liver-to-body weight ratio, and laboratory tests.
- Machine Learning Integration: Supports classification modeling, hyperparameter tuning, and performance evaluation.
- Visualization Tools: Includes histograms, bar plots, and AUC curves for better data interpretation.
Functions Overview
Data Acquisition and Processing
-
get_compile_data
- Fetches data from the database specified by the database path into a structured data frame for analysis. -
get_bw_score
- Calculates body weight (BW) z-scores for each animal. -
get_livertobw_zscore
- Computes liver-to-body weight z-scores. -
get_lb_score
- Calculates z-scores for laboratory test (LB) results. -
get_mi_score
- Computes z-scores for microscopic findings (MI). -
get_liver_om_lb_mi_tox_score_list
- Combines z-scores of LB, MI, and liver-to-BW into a single data frame. -
get_col_harmonized_scores_df
- Harmonizes column names across studies.
Machine Learning Preparation and Modeling
-
get_ml_data_and_tuned_hyperparameters
- Prepares data and tunes hyperparameters for machine learning. -
get_rf_model_with_cv
- Builds a random forest model with cross-validation and outputs performance metrics. -
get_zone_exclusioned_rf_model_with_cv
- Introduces an indeterminate zone for improved classification accuracy. -
get_imp_features_from_rf_model_with_cv
- Computes feature importance for model interpretation. -
get_auc_curve_with_rf_model
- Generates AUC curves to evaluate model performance.
Visualization and Reporting
-
get_histogram_barplot
- Creates bar plots for target variable classes. -
get_reprtree_from_rf_model
- Builds representative decision trees for interpretability. -
get_prediction_plot
- Visualizes prediction probabilities with histograms.
Automated Pipelines
-
get_Data_formatted_for_ml_and_best.m
- Formats data for machine learning pipelines. -
get_rf_input_param_list_output_cv_imp
- Automates preprocessing, modeling, and evaluation in one step. -
get_zone_exclusioned_rf_model_cv_imp
- Similar to the above function, but excludes uncertain predictions based on thresholds.
Workflow
- Input Database Path: Provide the database path containing nonclinical study results for each STUDYID.
- Preprocessing: Use functions 1-8 to clean, harmonize, and prepare data.
- Model Building: Employ machine learning functions (9-18) for training, validation, and evaluation.
- Visualization: Generate plots and performance metrics for better interpretation.
Installation
# Install from GitHub
devtools::install_github("aminuldu07/SENDQSAR")
Examples
Example 1: Basic Data Compilation
library(SENDQSAR)
data <- get_compile_data("/path/to/database")
Example 2: Z-Score Calculation
bw_scores <- get_bw_score(data)
liver_scores <- get_livertobw_zscore(data)
Example 3: Machine Learning Model
model <- get_rf_model_with_cv(data, n_repeats=10)
print(model$confusion_matrix)
Example 4: Visualization
get_histogram_barplot(data, target_col="target_variable")
Contact
For more information, visit the project GitHub Page or contact email@example.com.