Supervised Learning
1. Supervised Learning: Developing Predictive Models
With patterns identified, supervised learning uses labeled data to build models that predict outcomes, such as disease states or treatment responses, turning insights into practical applications.
2. Data Preprocessing: Ensuring Quality
Raw genomic data requires preprocessing. Techniques like normalize, log transformation, remove near-zero variance (NZV) genes, imputation for missing values, and noise filtering enhance data quality, ensuring models focus on relevant signals.
3. Data Splitting: Testing Real Performance
3.1. Holdout Test Dataset
We split data (e.g., 70% training, 30% testing) to test on unseen samples.
3.2. Cross-Validation (CV)
For small datasets, k-fold CV (e.g., 5-fold) trains and validates repeatedly, averaging results.
3.3. Bootstrap Resampling
Bootstrap samples with replacement, using out-of-bag data for error checks, repeated for reliability.
4. Predictive Models
Algorithms vary by task:
5. Model Evaluation: Assessing Performance
Confusion matrices and ROC curves measure model accuracy and reliability.
5.1. Handling Challenges: Addressing Real-World Issues
5.1.1. Class Imbalance
- Sampling: Down-sample majority or up-sample minority (e.g., SMOTE).
- Case Weights: Assign higher weights to minority class.
- Threshold Adjustment: Optimize ROC thresholds. Genomic data often has class imbalance (e.g., more healthy samples) or correlated predictors (e.g., gene pathways). Sampling, weighting, or PCA fixes these.
6. Overall Purpose
Unlike unsupervised learning’s exploratory focus, supervised learning uses labeled data for prediction, transforming patterns into practical applications. Each step, from data splitting to regularization, enhances generalization, applicable to both classification and regression, offering robust tools for genomic research.