Cluster analysis concepts
 Interactive exploration of clustering

- number of genes; distance between objects (tissues or gene expression time series)
- agglomeration method for tree construction
- number of groups via cutree
- mean bootstrapped Jaccard similarity
- PC ordination with colors
- assignment (mouseover)
 
 Exploring clusters with tissue-of-origin data
 
 Some definitions

 
 Example: Euclidean distance
- High-school analytic geometry: distance between two points in \(R^3\)
- \(p_1 = (x_1, y_1, z_1)\), \(p_2 = (x_2, y_2, z_2)\)
- \(\Delta x = x_1 - x_2\), etc.
- \(d(p_1, p_2) = \sqrt{(\Delta x)^2 + (\Delta y)^2 + (\Delta z)^2}\)
 
 What is the ward.D2 agglomeration method?
- Enables very rapid update upon change of distance or # genes

 
 What is the Jaccard similarity coefficient?

 
 What is the bootstrap distribution of a statistic?
- classic example from 1983: correlation of grades and achievement test scores
- arbitrary replication of multivariate records

 
 What is the bootstrap distribution of a statistic?
- sampling with replacement from the base records

 
 How to use the bootstrap distribution?
- estimate quantiles of the empirical distribution

 
 Bootstrap distributions of Jaccard

- dispersion should be reckoned along with mean
 
 Now that we know the definitions:
 
 Summary
- number of features (and detailed selection of features, not explored here) has impact
- agglomeration procedure has substantial impact
- number of clusters based on tree cutting
- bootstrap distribution of Jaccard index measures cluster stability
- ordination using principal components can be illuminating
- procedure is sensitive but sensible choices recapitulate biology
- there are bivariate outliers in PC1-PC2 view
 
 Road map
- yeast cell cycle
- compute distance to an interpretable prototype
- illustrate silhouette measure against random grouping
- define families of exemplars through trigonometric regression
- use F-statistics and parameter estimates to filter and discriminate patterns
 
- normalized tissue-specific expression
 
 Yeast cell cycle: phenotypic transitions

- Lee, Rinaldi et al., Science 2002
 
 Yeast cell cycle: regulatory model

- TF binding data added to expression patterns
 
 Raw trajectories for some of the genes in MCM cluster

- consistent with annotation to M/G1
 
 A pattern of interest (“prototype”, but not in the data)
- Define a basal oscillator to be a gene with expression varying with the cell cycle in a specific way
- for alpha-synchronized colony, cell cycle period is about 66 minutes
 
- Theoretical expression trajectory  
 
 Application of the distance concept
- What is the dimension of the space in which a yeast gene expression trajectory resides?
- How can we define the distance between a given gene’s trajectory and that corresponding to the basal oscillator pattern? Assumptions?
 
 
 Computing distances to basal oscillator pattern
bot = function(tim) sin(2*pi*tim/66)
bo = bot(alp$time)
d2bo = function(x) sqrt(sum((x-bo)^2))
suppressWarnings({ds = apply(uea, 1, d2bo)})
md = which.min(ds)
md
## YMR011W 
##    4411
summary(ds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       2       4       4       4       4       6    1689
 
 The nearest gene
plot(alp$time, bo, type="l", xlab="time", ylab="sfe")
lines(alp$time, uea[md,], lty=2)
legend(38,.87, lty=c(1,2), legend=c("basal osc.", featureNames(alp)[md]))

 
 The distribution of distances
hist(ds, xlab="Euclidean distance to basal oscillator pattern")

- What should this look like if there is clustering?
 
 “Top ten!”
todo = names(sort(ds))[1:10]
plot(alp$time, bo, type="l", xlab="time", ylab="sfe")
for (md in todo) lines(alp$time, uea[md,], lty=2, col="gray")

#legend(38,.87, lty=c(1,2), legend=c("basal osc.", featureNames(alp)[md]))
 
 Is it a cluster?
- Recall the definition:
- Variability in feature measures exhibits structure
- For some grouping, between-group variation is larger than within-group variation
- Can we make this more precise?
 
 
 Definition from ?silhouette
   For each observation i, the _silhouette width_ s(i) is defined as
     follows:
     Put a(i) = average dissimilarity between i and all other points of
     the cluster to which i belongs (if i is the _only_ observation in
     its cluster, s(i) := 0 without further calculations).  For all
     _other_ clusters C, put d(i,C) = average dissimilarity of i to all
     observations of C.  The smallest of these d(i,C) is b(i) := \min_C
     d(i,C), and can be seen as the dissimilarity between i and its
     “neighbor” cluster, i.e., the nearest one to which it does _not_
     belong.  Finally,
                   s(i) := ( b(i) - a(i) ) / max( a(i), b(i) ).         
     
     ‘silhouette.default()’ is now based on C code donated by Romain
     Francois (the R version being still available as
     ‘cluster:::silhouette.default.R’).
 
 Realizations of an unstructured grouping scheme
- Form some arbitrarily chosen groups of size ten
- The code:
allf = featureNames(alp)
set.seed(2345)
scramble = function(x) sample(x, size=length(x), replace=FALSE) 
cands = scramble(setdiff(allf, todo))[1:40]
sc = split(cands, gids <- rep(2:5,each=10))
ml = lapply(sc, function(x) uea[x,])
 
 Trajectories from the arbitrary groups

 
 The silhouette plot

 
 Recap
- A target pattern was defined, using knowledge of the cell cycle period
- Expression patterns were transformed to conform to the dynamic range of the target pattern
- A distance function was defined and genes ordered by proximity to target
- A group of ten genes nearest to target trajectory was identified and compared to arbitrarily formed groups using silhouette widths
- Transformation, Distance, and Comparison are fundamental elements of all cluster analysis
- but a fixed target pattern is not so common in genomics
- compare handwriting or speech waveform analysis
 
 
 Another exemplar
- Give the mathematical definition of the hyperbasal oscillator pattern, that has value 1 at time 0 and returns to 1 at 66 minutes
- Which gene has expression trajectory closest to that of the hyperbasal oscillator?
- How many steps to determine the mean silhouette width for the ten genes closest to the hyperbasal pattern?
 
 Solution
hbot = function(tim) cos(2*pi*tim/66)
hbo = hbot(alp$time)
d2hbo = function(x) sqrt(sum((x-hbo)^2))
suppressWarnings({cds = apply(uea, 1, d2hbo)})
cmd = which.min(cds)
cmd
## YHR005C 
##    2572
summary(cds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1       3       4       4       4       6    1689
 
 The most hyperbasal gene
plot(alp$time, hbo, type="l", xlab="time", ylab="sfe")
lines(alp$time, uea[cmd,], lty=2)
legend(20,.87, lty=c(1,2), legend=c("basal osc.", featureNames(alp)[cmd]))

 
 “Top ten!”
ctodo = names(sort(cds)[1:10])
plot(alp$time, hbo, type="l", xlab="time", ylab="sfe")
for (md in ctodo) lines(alp$time, uea[md,], lty=2, col="gray")

 
 Silhouette continuation

 
 Caveats
- We ignored the natural units reported for expression
- The pure sinusoid need not be a reasonable trajectory model for any gene
- We arbitrarily thresholded to achieve groups of size 10
- Can we use the data to define groups of genes with similar expression patterns without so much conceptual intervention?
 
 Hierarchical clustering
- Subdivision of the genome into coexpressed groups is of general interest
- Agglomerative algorithms use the distance measure to combine very similar genes into new entities
- The process repeats until only one entity remains
 
 Filtering
- We would like to look for structure among expression patterns of genes exhibiting oscillatory behavior
- Thus we would like to filter away genes with non-periodic trajectories
- Trigonometric regression can help
- \(t\) is transformed to [0,1]; estimate \(s_{gj}\) and \(c_{gj}\) in \[ X_g(t) = \sum_{j=1}^J s_{gj} \sin 2j\pi t + \sum_{j=1}^J c_{gj} \cos 2j \pi t + e_g(t) \]
 
- We will cluster genes possessing relatively large \(F\) statistics for this model, fixing \(J=2\)
 
 limma for trigonometric regression fits
options(digits=2)
x = (alp$time %% 66)/66
mm = model.matrix(~sin(2*pi*x)+cos(2*pi*x)+sin(4*pi*x)+cos(4*pi*x)-1)
colnames(mm) = c("s1", "c1", "s2", "c2")
library(limma)
m1 = lmFit(alp, mm)
em1 = eBayes(m1)
topTable(em1, 1:4, n=5)
##            s1    c1     s2      c2  AveExpr  F P.Value adj.P.Val
## YIL129C -0.55 -0.75 -0.176 -0.1035  1.1e-03 30 2.6e-08   9.5e-05
## YJL159W  0.69  0.91 -0.059 -0.0074 -1.3e-17 30 3.1e-08   9.5e-05
## YMR011W  0.88  0.31  0.208  0.0368  1.7e-03 25 1.2e-07   2.0e-04
## YPL256C  1.20 -0.58  0.156 -0.3544 -5.6e-04 25 1.5e-07   2.0e-04
## YKR013W  1.04 -0.59  0.072  0.0085 -1.1e-03 24 1.7e-07   2.0e-04
 
 Tuning hclust: dendrogram structure
 Distance and fusion method selection
 
 
 Characteristic traces, raw expression data
 
 Summary on clustering
- Choice of object, representation, and distance: allow flexibility when substantive considerations do not dictate
- Choice of algorithm: qualitative distinctions exist (single vs complete linkage, for example)
- “figure of merit” – Jaccard similarity and silhouette are guides; silhouette is distance-dependent
- Consider how to validate or corroborate clustering results with TF binding data from the harbChIP package
- We’ve not considered divisive methods, self-organizing maps; see Hastie, Tibshirani and Friedman