SENDQSAR: A R Package for QSAR Modeling with SEND Database
About
- This package facilitates developing Quantitative Structure-Activity Relationship (QSAR) models using the SEND database. It streamlines data acquisition, pre-processing, organ wise toxicity score calculation, descriptor calculation, and model evaluation, enabling researchers to efficiently explore molecular descriptors and create robust predictive models.
- Detailed descriptions of each function are available in the “Articles” section of the GitHub-hosted website.
Features
- Automated Data Processing: Simplifies data acquisition and pre-processing steps.
- Comprehensive Analysis: Provides z-score calculations for various parameters such as body weight, liver-to-body weight ratio, and laboratory tests.
- Machine Learning Integration: Supports classification modeling, hyperparameter tuning, and performance evaluation.
- Visualization Tools: Includes but not limited to histograms, bar plots, and AUC curves for better data interpretation.
Workflow
-
Input Database Path: Provide the path for database or
.xptfiles containing nonclinical study data inSENDformat. -
Data Pre processing: Use functions
f1tof8to clean, harmonize, and prepare data for Machine Learning (ML). -
Model Building: Employ ML functions (
f9tof18) for ML model training and evaluation. -
Visualization: Generate plots and performance metrics for better interpretation (
f12tof15). -
Automated Pipelines: Use functions
f15tof18to perform the above workflows in A single step by providing the database path and a.csvfile containing the label (TOXIC/NON-TOXIC) of theSTUDYID.
Modular Functions Overview
-
Liver Toxicity Score Calculation for Individual
STUDYID:-
f1:get_compile_data- Fetches structured data from the specified database path. -
f2:get_bw_score- Calculates body weight z-scores for each animal (depends onf1). -
f3:get_livertobw_zscore- Computes liver-to-body weight z-scores(depends onf1). -
f4:get_lb_score- Calculates z-scores for laboratory test results(depends onf1). -
f5:get_mi_score- Computes z-scores for microscopic findings(depends onf1).
-
-
Liver Toxicity Score Calculation and Aggregation for Multiple
STUDYID:-
f6:get_liver_om_lb_mi_tox_score_list- Combines z-scores for LB, MI, and liver-to-BW ratio into a single data frame.
- Internally calls
f1tof5.
-
-
Machine Learning Data Preparation:
f7:get_col_harmonized_scores_df- Harmonizes column names across columns for consistency from the data frame (depends onf6).f8:get_ml_data_and_tuned_hyperparameters- Prepares data and tunes hyper parameters for machine learning (depends onf7).
-
Machine Learning Model Building and Performance Evaluation:
-
Model Training
-
f9:get_rf_model_with_cv- Builds a random forest model with cross-validation (depends on
f8).
- Builds a random forest model with cross-validation (depends on
-
-
Improved Classification Accuracy
-
f10:get_zone_exclusioned_rf_model_with_cv- Enhances classification accuracy by excluding uncertain predictions (depends on
f8).
- Enhances classification accuracy by excluding uncertain predictions (depends on
-
-
Feature Importance
-
f11:get_imp_features_from_rf_model_with_cv- Computes feature importance for model interpretation.
-
-
Model Performance Visualization
-
f12:get_auc_curve_with_rf_model- Generates AUC curves to evaluate model performance.
-
-
Model Training
Notes for MOdular Functions
-
Data Preparation
- Functions
f1tof8must be executed sequentially to prepare theDataargument required by these functions.
- Alternatively, the composite function
f18can be used to directly generate theDataargument, combining the functionality off1tof8. - For
f9,f10,f11, andf12, Functionsf1,f2,f3,f4,f5,f6,f7, andf8must be executed sequentially to prepare theDataargument. Alternatively, the composite functionf18can be used to directly generate theDataargument.
- Functions
Composite Functions Overview
Combine multiple modular functions for complex operations.
-
Visualization and Reporting :
-
f13:get_histogram_barplot- Creates bar plots for target variable classes (depends on functionsf1tof8). -
f14:get_reprtree_from_rf_model- Builds representative decision trees (depends on functionsf1tof8).. -
f15:get_prediction_plot- Visualizes prediction probabilities with histograms(depends on functionsf1tof8)..
-
Automated Pipelines
-
f16:get_Data_formatted_for_ml_and_best.m- Creates machine learning-ready data by executing
f1tof8-Formats data for ML pipelines. - Provide the same result as
f8by merging functionality of functions fromf1tof7
- Creates machine learning-ready data by executing
-
f17:get_rf_input_param_list_output_cv_imp- Automates pre-processing, modeling, and evaluation.
- Provide the same result as
f9by merging functionality of functions fromf1tof8
-
f18:get_zone_exclusioned_rf_model_cv_imp- Similar to
f17but excludes uncertain predictions. - Provide the same result as
f10by merging functionality of functions fromf1tof8 - Optional argument for hyperparameter tuning.
- Similar to
Helper Functions
-
h1:get_treatment_group_&_dose- Retrieve treatment groups from thetxdomain. -
h2: -get_repeat_dose_parallel_studyids- Retrieves
STUDYIDs for repeat dose and parallel study designs. - Optional filtering for “rat” species studies.
- Retrieves
Functions in Development
-
fid1:get_all_LB_TESTCD_score- Calculates scores for eachLBTESTCDbased onget_lb_score. -
fid2:get_indiv_score_om_lb_mi_domain_df- Returns domain-specific scores including liver-to-BW ratio, LB, and MI scores.
Installation
# Install from GitHub
devtools::install_github("aminuldu07/SENDQSAR")Examples
Example 1: Basic Data Compilation
library(SENDQSAR)
data <- get_compile_data("/path/to/database")Example 2: Z-Score Calculation
bw_scores <- get_bw_score(data)
liver_scores <- get_livertobw_zscore(data)Example 3: Machine Learning Model
model <- get_rf_model_with_cv(data, n_repeats=10)
print(model$confusion_matrix)Example 4: Visualization
get_histogram_barplot(data, target_col="target_variable")Contact
For more information, visit the project GitHub Page or contact email@example.com.