PoisMS package contains implementation of Principal Curve-based methods for chromatin reconstruction. The input data is a contact map \(C\in\mathbf{Z}^{n\times n}\), a symmetric integer-valued matrix of contact counts between \(n\) binned genomic loci for a given chromosome. Here, we treat bins as equi-sized and equi-spaced and accordingly index as \(1,...,n\), this being the typical scenario; however, generalizations to alternate indexing (e.g. genomic coordinates) is straightforward. Then the goal is to use the contact matrix \(C\) to obtain spatial coordinates \(x_1, \ldots, x_n\in \mathbb{R}^3\) of genomic loci \(1, \ldots, n,\) respectively.
In our experiments we use Hi-C data for IMR90 cells for chromosome 20 and resolution 100kb obtained from the Gene Expression Omnibus (check the package for different Hi-C data available).
library(PoisMS)
data(IMR90_100kb_chr20)
C = data.matrix(IMR90_100kb_chr20)
n = ncol(C)
Usually, contact matrices are diagonally dominant so, for the sake of visual convenience, we plot the heatmap for \(\log(C)\).
library(fields)
C_log = log(C)
C_log[C == 0] = NA
heatmap = function(C, shift = 1, scale = 1, title = NULL){
#change the color scale
a = min(scale * C, na.rm = TRUE)
b = max(scale * C, na.rm = TRUE)
br = scale * (exp(seq(log(shift), log(b - a + shift), (log(b - a + shift) - log(shift))/100)) + a - shift)
#plot heatmap
par(mar = c(0, 2, 1, 2))
image.plot(C, xaxt = 'n', yaxt = 'n', nlevel = 100, breaks = sort(br), main = title)
}
heatmap(C_log)
Note that there are some unobserved loci near the centromere; we save the genomic coordinates of \(599\) observed loci and drop unobserved ones from the contact matrix.
index = which(colSums(C) != 0)
n_obs = length(index)
C = C[index, index]
C_log = log(C)
C_log[C == 0] = NA
par(mar = c(0, 2, 1, 2))
heatmap(C_log)
Principal Curve-based approaches model chromatin by a smooth curve, assuming \[x_1,\ldots,x_n\in \gamma\text{, where }\gamma\text{ is a smooth one-dimensional curve in $\mathbb{R}^3$}.\] We model each coordinate of \(\gamma\) by a cubic spline with \(k\) degrees-of-freedom. Thus the previous constraint can be written in matrix form as \(X=H\Theta.\) Here:
\(X\in\mathbf{R}^{n\times 3}\) is the matrix of genomic loci coordinates,
\(H\in\mathbf{R}^{n\times k}\) is the matrix of spline basis evaluations at genomic coordinates, i.e. \(H_{i\ell} = h_\ell(i)\) where \(h_1(t),\ldots,h_k(t)\) are cubic spline basis functions,
\(\Theta\) is the matrix of unknown parameters.
In the presence of unobserved genomic loci one can use the following function to construct matrix \(H\).
library(splines)
load_H = function(df, index){
n_knots = df - 2
knots = unique(seq(from = 1, to = max(index), length = n_knots))
knots = knots[-c(1,n_knots)]
H = bs(index, knots = knots, intercept = TRUE)
return(H)
}
Example of spline basis with \(25\) degrees-of-freedom evaluated at \(599\) points corresponding to the observed loci.
H = load_H(25, index)
par(mar = c(2, 5, 0, 2))
matplot(H, type = 'l', lwd = 2)
Principal Curve Metric Scaling (PCMS) solves (classical) MDS problem equipped with the smooth curve constraint: \[\text{minimize } \|Z-S(X)\|^2_F \text{ w.r.t. } \Theta\in\mathbb{R}^{k\times 3} \text{ subject to } X = H\Theta.\] Here \(S(X) = X X^T\) is the inner products matrix and \(Z\) is some symmetric matrix, e.g. the contact matrix after some proper transformation applied. An example of such a transformation could be the combination of power law, that converts contacts to distances, combined with double centering, that turns distances to inner products.
D = 1/(C+1)
Z = -D^2/2
Z = scale(Z, scale = FALSE, center = TRUE)
Z = t(scale(t(Z), scale = FALSE, center = TRUE))
par(mar = c(0, 2, 1, 2))
image.plot(Z, xaxt='n', yaxt = 'n', nlevel = 100)
The PCMS approach assumes \(H\) to be an orthogonal matrix. Orthogonalize \(H\) from previous part by means of QR decomposition and run PCMS(Z, H)
function to find the PCMS reconstruction.
H = qr.Q(qr(H))
solution_PCMS = PCMS(Z, H)
cat('The optimal loss value is', solution_PCMS$loss)
## The optimal loss value is 0.02516578
Plot the contact matrix approximation \(\hat Z = S(X) = XX^T\).
Z_hat = solution_PCMS$X %*% t(solution_PCMS$X)
par(mar = c(0, 2, 1, 2))
image.plot(Z_hat, xaxt='n', yaxt = 'n', nlevel = 100)
Plot projections of 3D reconstruction using visualize(..., type = 'projection')
.
visualize(solution_PCMS$X, index, type = 'projection')
One can also use to create an interactive 3D plot of the reconstruction.
visualize(solution_PCMS$X, index, type = '3D')
Weighted Principal Curve Metric Scaling (WPCMS) is an upgrade of the PCMS approach that solves the following optimization problem: \[\text{minimize } \|W*(C-D^2(X)+\beta)\|^2_F \text{ w.r.t. } \Theta\in\mathbb{R}^{k\times 3} \text{ and } \beta\in\mathbb{R} \text{ subject to } X = H\Theta.\] Here \(*\) refers to the Hadamard product, \(W\in[0,1]^{n\times n}\) is some symmetric matrix of weights, \(\beta\) is an intercept and \(D(X)\) corresponds to the matrix of pairwise distances.
To make the result to be compatible with Poisson Metric Scaling described below we apply the WPCMS technique to the log-transformed contact matrix \(\log(C)\). Let’s create the binary matrix of weights \(W\) with elements \(W_{ij} = 0\) if \(C_{ij} = 0\) and \(W_{ij} = 1\) otherwise. These weights will allow us to ignore \(-\infty\) in matrix \(\log(C)\). Note that if \(W\) is a binary matrix WPCMS treats the elements with zero weights as missing.
W = matrix(1, n_obs, n_obs)
W[C == 0] = 0
C_log[C == 0] = 0
par(mar = c(0, 2, 1, 2))
image.plot(W, xaxt='n', yaxt = 'n')
One can run WPCMS(Z, H, W, ...)
to find the WPCMS solution for the log-transformed contact matrix \(-\log(C)\).
solution_WPCMS = WPCMS(-C_log, H, W = W, eps = 1e-8, maxiter = 1000)
Summary for the WPCMS method.
cat(' It takes', solution_WPCMS$iter, 'iterations to converge.\n', 'The optimal intercept value is', solution_WPCMS$beta, 'and the achieved loss value is', solution_WPCMS$loss)
## It takes 101 iterations to converge.
## The optimal intercept value is 2.11469 and the achieved loss value is 0.5324334
The convergence plots for both loss and \(\beta\) are presented below.
solution_WPCMS$plot$objective
solution_WPCMS$plot$intercept
Next we vizualize the WPCMS solution.
D = as.matrix(dist(solution_WPCMS$X, diag = TRUE, upper = TRUE))
logL = -D^2 + solution_WPCMS$beta
heatmap(logL, 0.1, -1)
visualize(solution_WPCMS$X, index, type = 'projection')
visualize(solution_WPCMS$X, index, type = '3D')
Poisson Metric Scaling (PoisMS) models contact counts by a Poisson distribution: \[C_{ij}\sim Pois(\lambda_{ij}),~~\log(\lambda_{ij}) = - \| x_i - x_j\|^2 + \beta.\] Here \(\beta\in\mathbb{R}\) is the unknown intercept. The optimization problem is, therefore, to find minimum of the negative log-likelihood under the smooth curve constraint: \[\text{minimize } \sum_{1 \leq i,j \leq n} \left[e^{- \| x_i - x_j\|^2 + \beta} - C_{ij}\left(- \| x_i - x_j\|^2 + \beta\right) \right] \text{ w.r.t. } X \text{ subject to } X = H\Theta.\] Since the log-transformation is implicitly introduced in the model through \(\log(\lambda_{ij}) = - \| x_i - x_j\|^2 + \beta,\) we apply the PoisMS technique to the original data \(C\).
Run PoisMS(C, H)
function to calculate the PoisMS solution.
solution_PoisMS = PoisMS(C, H, eps_wpcms = 1e-7, maxiter = 100, eps_poisms = 1e-7, maxepoch = 1000)
Summary for the PoisMS method.
cat(' It takes', solution_PoisMS$epoch, 'iterations to converge.\n', 'The optimal intercept value is', solution_PoisMS$beta, 'and the achieved loss value is', solution_PoisMS$loss)
## It takes 7 iterations to converge.
## The optimal intercept value is 5.583499 and the achieved loss value is -50.37123
The convergence plots for both loss and \(\beta\) are presented below.
solution_PoisMS$plot$objective
solution_PoisMS$plot$intercept
Plot the corresponding contact matrix approximation.
D = as.matrix(dist(solution_PoisMS$X, diag = TRUE, upper = TRUE))
logL = -D^2 + solution_PoisMS$beta
heatmap(logL, 0.1, -1)
Plot the projection of the resulting 3D reconstruction.
visualize(solution_PoisMS$X, index, type = 'projection')
Create the 3D interactive plot.
visualize(solution_PoisMS$X, index, type = '3D')
The main hyperparameter for the PoisMS approach is the spline degrees-of-freedom \(\operatorname{df}\) controlling the reconstruction smoothness. Calculate the PoisMS solutions for different degrees-of-freedom values.
dfs = c(10, 15, 25, 50, 100, 200)
Xs = c()
betas = c()
for(df in dfs){
H = load_H(df, index)
H = qr.Q(qr(H))
poisms = PoisMS(C, H, eps_wpcms = 1e-7, maxiter = 100, eps_poisms = 1e-7, maxepoch = 1000)
Xs = rbind(Xs, data.frame(poisms$X, 'DF' = df))
betas = rbind(betas, data.frame('beta' = poisms$beta, 'DF' = df))
}
Plot the heatmaps.
for(df in dfs){
X = as.matrix(subset(Xs, DF == df)[,1:3])
beta = subset(betas, DF == df)$beta
D = as.matrix(dist(X, diag = TRUE, upper = TRUE))
logL = -D^2 + beta
heatmap(logL, 0.1, -1, paste('df =', df))
}
Plot the projections.
for(df in dfs){
X = as.matrix(subset(Xs, DF == df)[,1:3])
visualize(X, index, type = 'projection', title = paste('df =', df))
}
Low \(\operatorname{df}\) leads to the reconstructions capturing only general structure, high \(\operatorname{df}\) reconstructions lose smoothness property.
Let’s have a closer look at the 3D reconstructions obtained via PoisMS for \(\operatorname{df} = 200\).
visualize(as.matrix(subset(Xs, DF == 200)[,1:3]), index, type = '3D')
Vary degrees-of-freedom value and calculate the loss value for each PCMS reconstruction.
dfs = seq(5, 200, 5)
loss = c()
for(df in dfs){
H = load_H(df, index)
H = qr.Q(qr(H))
solution_PCMS = PCMS(Z, H)
loss = append(loss, solution_PCMS$loss)
}
The loss decreases with the growth of \(\operatorname{df}\).
library(ggplot2)
data = data.frame(dfs, loss)
plt = ggplot(data, aes(dfs, loss))+
geom_point(color = 'blue', size = 2)+
geom_line(color = 'blue')+
scale_x_continuous(breaks = seq(0, 200, 10))+
ylab('PCMS loss')+
xlab('df')
plt
Use segmented regression to determine the kink location (thus, optimal value of \(\operatorname{df}\)).
library(segmented)
LR <- lm(loss ~ dfs, data = data)
SR <- segmented(LR, seg.Z = ~ dfs, npsi = 1, tol = 1e-6, it.max = 1000)
kink = SR$psi[2]
prediction = predict(object = SR, newdata = data.frame('dfs' = c(min(dfs), kink, max(dfs))))
newdata = data.frame('dfs' = c(min(dfs), kink, max(dfs)), 'loss' = prediction)
plt+geom_line(newdata, mapping = aes(dfs, loss), colour = 'black', size = 0.8)+
geom_vline(xintercept = kink, color = 'red', size = 0.4, linetype = 'dashed')+
ggtitle(paste('Optimal df value = ', round(kink)))