Unsupervised Learning

Author

Ding Yang Wang

Published

April 3, 2025

1. Unsupervised Learning: Why Explore the Data?

With a statistical foundation, we know our data’s basics, but we’re curious: Are there hidden patterns? For example, do some genes cluster together, hinting at shared roles? Unsupervised learning explores it further by detecting inherent patterns without predefined labels. This step bridges initial data understanding to more focused investigations.

2 Distance Metrics: Defining Similarity

To group data, we need a similarity rule. Clustering hinges on measuring similarity or dissimilarity between data points, which relies on distance metrics. The choice of metric shapes the clustering outcome, making it a critical decision. Common distance metrics include: Euclidean Distance, Manhattan Distance, Correlation Distance, Cosine Distance, and Jaccard Distance. You can choose a proper metrics base on your data type, data characteristics, and goal.

3 Clustering: Spotting Natural Groups

3.1. Hierarchical Clustering

This builds a tree structure (dendrogram), like patient subtypes, great for exploring relationships.

3.2. K-means Clustering

K-means splits data into k clusters based on centroids, minimizing distances, and works well for big datasets.

3.3. Hierarchical Clustering vs. K-means

Hierarchical Clustering:

Strengths: Ideal for exploratory analysis, provides a full hierarchical structure without needing to specify k, revealing nested relationships.
Weaknesses: High computational complexity (O(n3)), less practical for large datasets.

K-means:

Strengths: Fast computation, scalable to big data.
Weaknesses: Requires predefining k, sensitive to initialization, and assumes uniform cluster shapes.

3.4. Choosing k: Picking the Right Number of Groups

K-means requires k. Silhouette analysis checks fit, gap statistic compares to random data, and NbClust tools find the best number.

4. Dimensionality Reduction: Seeing the Big Picture

Clustering groups data based on similarity, but high-dimensional datasets (e.g., gene expression with thousands of features) pose a visualization challenge. Dimensionality reduction techniques bridge this gap by mapping high-dimensional data into lower dimensions as part of unsupervised learning. These methods reveal underlying structures, enhancing our ability to interpret clustering results or discover patterns not immediately apparent in high-dimensional space.

4.1. Comparison of Techniques

PCA: Linear, global variance, fast, misses nonlinear patterns.
NMF/ICA: Linear, specific patterns (non-negative/independent), biologically interpretable, less visualization-focused.
MDS: Linear, distance-preserving, balances global structure, less variance-centric than PCA.
t-SNE: Nonlinear, local focus, cluster-rich, slow and stochastic.
UMAP: Nonlinear, local-global balance, fast and robust.

4.2. Selection Guidance

Scale: PCA for small data; t-SNE/UMAP for larger sets; NMF for interpretability.
Goal: Variance (PCA), distances (MDS), local clusters (t-SNE), comprehensive view (UMAP).
Genomics: Bulk data suits PCA/MDS, single-cell prefers t-SNE/UMAP, expression patterns favor NMF.

These methods enhance clustering: PCA/MDS offer broad views, NMF interprets components, t-SNE/UMAP refine local/global structures. Discrepancies with clustering (e.g., K-means vs. MDS) guide metric or k adjustments.

5. Overall Purpose

Unsupervised learning maps the data’s natural landscape, revealing groups and structures. It’s like finding landmarks on a research map

References

https://compgenomr.github.io/book/unsupervisedLearning.html