Introduction

PoisMS package contains implementation of Principal Curve-based methods for chromatin reconstruction. The input data is a contact map \(C\in\mathbf{Z}^{n\times n}\), a symmetric integer-valued matrix of contact counts between \(n\) binned genomic loci for a given chromosome. Here, we treat bins as equi-sized and equi-spaced and accordingly index as \(1,...,n\), this being the typical scenario; however, generalizations to alternate indexing (e.g. genomic coordinates) is straightforward. Then the goal is to use the contact matrix \(C\) to obtain spatial coordinates \(x_1, \ldots, x_n\in \mathbb{R}^3\) of genomic loci \(1, \ldots, n,\) respectively.

Contact matrix

In our experiments we use Hi-C data for IMR90 cells for chromosome 20 and resolution 100kb obtained from the Gene Expression Omnibus (check the package for different Hi-C data available).

library(PoisMS)
data(IMR90_100kb_chr20)
C = data.matrix(IMR90_100kb_chr20)
n = ncol(C)

Usually, contact matrices are diagonally dominant so, for the sake of visual convenience, we plot the heatmap for \(\log(C)\).

library(fields)
C_log = log(C)
C_log[C == 0] = NA

heatmap = function(C, shift = 1, scale = 1, title = NULL){
  #change the color scale
  a = min(scale * C, na.rm = TRUE)
  b = max(scale * C, na.rm = TRUE)
  br = scale * (exp(seq(log(shift), log(b - a + shift), (log(b - a + shift) - log(shift))/100)) + a - shift)
  #plot heatmap
  par(mar = c(0, 2, 1, 2))
  image.plot(C, xaxt = 'n', yaxt = 'n', nlevel = 100, breaks = sort(br), main = title)
}

heatmap(C_log)

Note that there are some unobserved loci near the centromere; we save the genomic coordinates of \(599\) observed loci and drop unobserved ones from the contact matrix.

index = which(colSums(C) != 0)
n_obs = length(index)
C = C[index, index]
C_log = log(C)
C_log[C == 0] = NA
par(mar = c(0, 2, 1, 2))
heatmap(C_log)

Principal Curve

Principal Curve-based approaches model chromatin by a smooth curve, assuming \[x_1,\ldots,x_n\in \gamma\text{, where }\gamma\text{ is a smooth one-dimensional curve in $\mathbb{R}^3$}.\] We model each coordinate of \(\gamma\) by a cubic spline with \(k\) degrees-of-freedom. Thus the previous constraint can be written in matrix form as \(X=H\Theta.\) Here:

In the presence of unobserved genomic loci one can use the following function to construct matrix \(H\).

library(splines)
load_H = function(df, index){
  n_knots = df - 2
  knots = unique(seq(from = 1, to = max(index), length = n_knots))
  knots = knots[-c(1,n_knots)]
  H = bs(index, knots = knots, intercept = TRUE)
  return(H)
}

Example of spline basis with \(25\) degrees-of-freedom evaluated at \(599\) points corresponding to the observed loci.

H = load_H(25, index)
par(mar = c(2, 5, 0, 2))
matplot(H, type = 'l', lwd = 2)

PCMS

Principal Curve Metric Scaling (PCMS) solves (classical) MDS problem equipped with the smooth curve constraint: \[\text{minimize } \|Z-S(X)\|^2_F \text{ w.r.t. } \Theta\in\mathbb{R}^{k\times 3} \text{ subject to } X = H\Theta.\] Here \(S(X) = X X^T\) is the inner products matrix and \(Z\) is some symmetric matrix, e.g. the contact matrix after some proper transformation applied. An example of such a transformation could be the combination of power law, that converts contacts to distances, combined with double centering, that turns distances to inner products.

D = 1/(C+1)
Z = -D^2/2
Z = scale(Z, scale = FALSE, center = TRUE)
Z = t(scale(t(Z), scale = FALSE, center = TRUE))
par(mar = c(0, 2, 1, 2))
image.plot(Z, xaxt='n', yaxt = 'n', nlevel = 100)

The PCMS approach assumes \(H\) to be an orthogonal matrix. Orthogonalize \(H\) from previous part by means of QR decomposition and run PCMS(Z, H) function to find the PCMS reconstruction.

H = qr.Q(qr(H))
solution_PCMS = PCMS(Z, H)
cat('The optimal loss value is', solution_PCMS$loss)
## The optimal loss value is 0.02516578

Visualization

Plot the contact matrix approximation \(\hat Z = S(X) = XX^T\).

Z_hat =  solution_PCMS$X %*% t(solution_PCMS$X)
par(mar = c(0, 2, 1, 2))
image.plot(Z_hat, xaxt='n', yaxt = 'n', nlevel = 100)

Plot projections of 3D reconstruction using visualize(..., type = 'projection').

visualize(solution_PCMS$X, index, type = 'projection')

One can also use to create an interactive 3D plot of the reconstruction.

visualize(solution_PCMS$X, index, type = '3D')

WPCMS

Weighted Principal Curve Metric Scaling (WPCMS) is an upgrade of the PCMS approach that solves the following optimization problem: \[\text{minimize } \|W*(C-D^2(X)+\beta)\|^2_F \text{ w.r.t. } \Theta\in\mathbb{R}^{k\times 3} \text{ and } \beta\in\mathbb{R} \text{ subject to } X = H\Theta.\] Here \(*\) refers to the Hadamard product, \(W\in[0,1]^{n\times n}\) is some symmetric matrix of weights, \(\beta\) is an intercept and \(D(X)\) corresponds to the matrix of pairwise distances.

To make the result to be compatible with Poisson Metric Scaling described below we apply the WPCMS technique to the log-transformed contact matrix \(\log(C)\). Let’s create the binary matrix of weights \(W\) with elements \(W_{ij} = 0\) if \(C_{ij} = 0\) and \(W_{ij} = 1\) otherwise. These weights will allow us to ignore \(-\infty\) in matrix \(\log(C)\). Note that if \(W\) is a binary matrix WPCMS treats the elements with zero weights as missing.

W = matrix(1, n_obs, n_obs)
W[C == 0] = 0
C_log[C == 0] = 0
par(mar = c(0, 2, 1, 2))
image.plot(W, xaxt='n', yaxt = 'n')

One can run WPCMS(Z, H, W, ...) to find the WPCMS solution for the log-transformed contact matrix \(-\log(C)\).

solution_WPCMS = WPCMS(-C_log, H, W = W, eps = 1e-8, maxiter = 1000)

Summary for the WPCMS method.

cat(' It takes', solution_WPCMS$iter, 'iterations to converge.\n', 'The optimal intercept value is', solution_WPCMS$beta, 'and the achieved loss value is', solution_WPCMS$loss)
##  It takes 101 iterations to converge.
##  The optimal intercept value is 2.11469 and the achieved loss value is 0.5324334

The convergence plots for both loss and \(\beta\) are presented below.

solution_WPCMS$plot$objective

solution_WPCMS$plot$intercept

Next we vizualize the WPCMS solution.

D = as.matrix(dist(solution_WPCMS$X, diag = TRUE, upper = TRUE))
logL = -D^2 + solution_WPCMS$beta
heatmap(logL, 0.1, -1)

visualize(solution_WPCMS$X, index, type = 'projection')

visualize(solution_WPCMS$X, index, type = '3D')

PoisMS

Poisson Metric Scaling (PoisMS) models contact counts by a Poisson distribution: \[C_{ij}\sim Pois(\lambda_{ij}),~~\log(\lambda_{ij}) = - \| x_i - x_j\|^2 + \beta.\] Here \(\beta\in\mathbb{R}\) is the unknown intercept. The optimization problem is, therefore, to find minimum of the negative log-likelihood under the smooth curve constraint: \[\text{minimize } \sum_{1 \leq i,j \leq n} \left[e^{- \| x_i - x_j\|^2 + \beta} - C_{ij}\left(- \| x_i - x_j\|^2 + \beta\right) \right] \text{ w.r.t. } X \text{ subject to } X = H\Theta.\] Since the log-transformation is implicitly introduced in the model through \(\log(\lambda_{ij}) = - \| x_i - x_j\|^2 + \beta,\) we apply the PoisMS technique to the original data \(C\).

Run PoisMS(C, H) function to calculate the PoisMS solution.

solution_PoisMS = PoisMS(C, H,  eps_wpcms = 1e-7, maxiter = 100, eps_poisms = 1e-7, maxepoch = 1000)

Summary for the PoisMS method.

cat(' It takes', solution_PoisMS$epoch, 'iterations to converge.\n', 'The optimal intercept value is', solution_PoisMS$beta, 'and the achieved loss value is', solution_PoisMS$loss)
##  It takes 7 iterations to converge.
##  The optimal intercept value is 5.583499 and the achieved loss value is -50.37123

The convergence plots for both loss and \(\beta\) are presented below.

solution_PoisMS$plot$objective

solution_PoisMS$plot$intercept

Plot the corresponding contact matrix approximation.

D = as.matrix(dist(solution_PoisMS$X, diag = TRUE, upper = TRUE))
logL = -D^2 + solution_PoisMS$beta
heatmap(logL, 0.1, -1)

Plot the projection of the resulting 3D reconstruction.

visualize(solution_PoisMS$X, index, type = 'projection')

Create the 3D interactive plot.

visualize(solution_PoisMS$X, index, type = '3D')

Vary degrees-of-freedom

The main hyperparameter for the PoisMS approach is the spline degrees-of-freedom \(\operatorname{df}\) controlling the reconstruction smoothness. Calculate the PoisMS solutions for different degrees-of-freedom values.

dfs = c(10, 15, 25, 50, 100, 200)
Xs = c()
betas = c()
for(df in dfs){
    H = load_H(df, index)
    H = qr.Q(qr(H))
    poisms = PoisMS(C, H,  eps_wpcms = 1e-7, maxiter = 100, eps_poisms = 1e-7, maxepoch = 1000)
    Xs = rbind(Xs, data.frame(poisms$X, 'DF' = df))
    betas = rbind(betas, data.frame('beta' = poisms$beta, 'DF' = df))
}

Plot the heatmaps.

for(df in dfs){
  X = as.matrix(subset(Xs, DF == df)[,1:3])
  beta = subset(betas, DF == df)$beta
  D = as.matrix(dist(X, diag = TRUE, upper = TRUE))
  logL = -D^2 + beta
  heatmap(logL, 0.1, -1, paste('df =', df))
}  

Plot the projections.

for(df in dfs){
  X = as.matrix(subset(Xs, DF == df)[,1:3])
  visualize(X, index, type = 'projection', title = paste('df =', df))
}  

Low \(\operatorname{df}\) leads to the reconstructions capturing only general structure, high \(\operatorname{df}\) reconstructions lose smoothness property.

Let’s have a closer look at the 3D reconstructions obtained via PoisMS for \(\operatorname{df} = 200\).

visualize(as.matrix(subset(Xs, DF == 200)[,1:3]), index, type = '3D')

Pick optimal degrees-of-freedom value (PCMS)

Vary degrees-of-freedom value and calculate the loss value for each PCMS reconstruction.

dfs = seq(5, 200, 5)
loss = c()
for(df in dfs){
  H = load_H(df, index)
  H = qr.Q(qr(H))
  solution_PCMS = PCMS(Z, H)
  loss = append(loss, solution_PCMS$loss)
}  

The loss decreases with the growth of \(\operatorname{df}\).

library(ggplot2)
data = data.frame(dfs, loss)
plt = ggplot(data, aes(dfs, loss))+
  geom_point(color = 'blue', size = 2)+
  geom_line(color = 'blue')+
  scale_x_continuous(breaks =  seq(0, 200, 10))+
  ylab('PCMS loss')+
  xlab('df')
plt

Use segmented regression to determine the kink location (thus, optimal value of \(\operatorname{df}\)).

library(segmented)
LR <- lm(loss ~ dfs, data = data)
SR <- segmented(LR, seg.Z = ~ dfs, npsi = 1, tol = 1e-6, it.max = 1000)
kink = SR$psi[2]
prediction = predict(object = SR, newdata = data.frame('dfs' = c(min(dfs), kink, max(dfs))))
newdata = data.frame('dfs' = c(min(dfs), kink, max(dfs)), 'loss' = prediction)
plt+geom_line(newdata, mapping = aes(dfs, loss), colour = 'black', size = 0.8)+
  geom_vline(xintercept = kink, color = 'red', size = 0.4, linetype = 'dashed')+
  ggtitle(paste('Optimal df value = ', round(kink)))