Brown University Department of Biostatistics
Center for Statistical Sciences
Computation in Brain and Mind
Brown Institute for Brain Science
Brown Data Science Initiative
ABCD Research Group
We allow $|C_{11} | \lt | C_{12} |$ or $C \prec 0$
Kmeans/HC leads to block-diagonal cor matrices (permutation)
Clustering based on $G$-Block
Generalizing $G$-Latent which requires $C\succ 0$
Minimum $G$ Partition
Theorem: $G^{\beta}(X)$ is the minimal partition induced by $a\stackrel{G^{\beta}}{\sim} b$
iff $\var(X_{a})=\var(X_{b})$ and $\cov(X_{a},X_{c})=\cov(X_{b},X_{c})$ for all $c\neq a,b$. Moreover, if the matrix of covariances $C$ corresponding to the partition $G(X)$ is positive-semidefinite, then this is the unique minimal partition according to which ${X}$ admits a latent decomposition.
The minimal partition is unique under regularity conditions.
Gaussian copula: Kendall's tau transformed, $R_{ab} = \sin (\frac{\pi}{2}\tau_{ab})$
Second, maximum difference of correlation distances $$\d(a,b) := \max_{c\neq a,b}|R_{ac}-R_{bc}|$$
Third, group variables $a$, $b$ together if $\d(a,b) = 0$
The enemy of my enemy is my friend!
Image credit: http://sutherland-careers.com/
Algorithm: Main Idea
Greedy: one cluster at a time, avoiding NP-hard
Cluster variables together if CORD metric $$ \max_{c\neq a,b}|\hat{R}_{ac}-\hat{R}_{bc}| \lt \alpha$$ where $\alpha$ is a tuning parameter
$\alpha$ is chosen by theory or CV
Theory
Condition
Let $\eta \geq 0$ be given. Let ${ X}$ be a zero mean random vector with a Gaussian copula distribution with parameter $R$. $$ \begin{multline} \mathcal{R}(\eta) := \{R: \ \d(a,b) := \max_{c\neq a,b}|R_{ac}-R_{bc}|>\eta\quad \\ \textrm{for all}\ a\stackrel{G(X)}{\nsim}b.\} \end{multline} $$ Group separation condition: $R \in \mathcal{R}(\eta)$.
The signal strength $\eta$ need to be large.
Consistency
Theorem: Define $\tau=|\widehat R-R|_{\infty}$ and we consider two parameters $(\alpha,\eta)$ fulfilling $$\begin{equation} \alpha\geq 2\tau\quad\textrm{and}\quad \eta\geq2\tau+\alpha. \end{equation}$$ Then, applying our algorithm we have $\widehat G=G(X)$ whp.
Ours recovers exact clustering with high probability.
Minimax
Theorem: $P_{\Sigma}$ the likelihood based on $n$ independent observations of ${ X} \stackrel{d}{=} \mathcal{N}(0,\Sigma)$. For any \begin{equation}\label{etamin} 0\leq \eta \lt \eta^{*}:={\sqrt{\log(p(p-1)/2)}\over \sqrt{(2+e^{-1})n} +\sqrt{\log(p(p-1)/2)}}\,, \end{equation} we have $$\inf_{\widehat G}\sup_{R \in \mathcal{R}(\eta)} P_{\Sigma}(\widehat G\neq G^{\beta}(X))\geq {1\over 2e+1}\geq {1\over 7} \,,$$ where the infimum is taken over all possible estimators.
Our method is minimax optimal.
Choosing Number of Clusters
Split data into 3 parts
Use part 1 of data to estimate clusters $\hat{G}$ for each $\alpha$
Use part 2 to compute between variable difference $$ \delta^{(2)}_{ab} = R_{ac}^{(2)} - R_{bc}^{(2)}, \quad c \ne a, b. $$
Use part 3 to generate "CV" loss $$ \mbox{CV}(\hat{G}) = \sum_{a \lt b} \| \delta^{(3)}_{ab} - \delta^{(2)}_{ab} 1\{ a \mbox{ not clustered w/ } b \} \|^2_\infty. $$
Pick $\alpha$ with the smallest loss above
Theory for CV
Theorem: If either: (i) $X$ is sub-Gaussian with correlation matrix $R$; or (ii) $X$ has a copula distribution with copula correlation matrix $R$, then we have $E[\mbox{CV}(G^*)] \lt E[\mbox{CV}(G)]$, for any $G\ne G^*$.
Theoretical support for selecting $G^*$ from data.
Simulations
Setup
Generate from various $C$: block, sparse, negative
Compare:
Exact recovery of groups (theoretical tuning parameter)
Cross validation (data-driven tuning parameter)
Cord metric vs (semi)parametetric cor (regardless of tuning)
Exact Recovery
Different models for $C$="$\cov(Z)$" and $\alpha = 2 n^{-1/2} \log^{1/2} p$
Vertical lines: theoretical sample size based on our lower bound
HC and Kmeans fail even if inputting the true $K$.
Data-driven choice of $\alpha$: Averaging
Introduce block averaging operator $\left[\varUpsilon\left(R,G\right)\right]_{ab} =$
$$\begin{cases} \left|G_{k}\right|^{-1}\left(\left|G_{k}\right|-1\right)^{-1}\sum_{i,j\in G_{k},i\ne j}R_{ij} & \mbox{if } a\ne b \mbox{ and } k=k^{\prime}\\ \left|G_{k}\right|^{-1}\left|G_{k^{\prime}}\right|^{-1}\sum_{i\in G_{k},j\in G_{k^{\prime}}}R_{ij} & \mbox{if } a\ne b \mbox{ and } k\ne k^{\prime}\\ 1 & \mbox{if }a=b. \end{cases} $$
Data-driven choice of $\alpha$: Prediction
Choose $\alpha$ via cross validation using minimization over a grid of $\alpha$: $$\min_\alpha \| \varUpsilon\left(\hat R,G_\alpha \right) - \hat{R}_{test}\|_F^2 $$
Cross Validation
Recovery % in red and CV loss in black.
CV selects the constants to yield close to 100% recovery, as predicted by our theory (at least for large $n>200$)
Cross Validation
Recovery % in red and CV loss in black.
CV selects the tuning to yield close to 100% recovery
Metric Comparison: Without Threhold
HC and Kmeans metrics yield more false discoveries (FD) as the threshold (or $K$) varies.
Better metric for clustering regardless of threshold
Real Data
fMRI Studies
Sub 1, Sess 1
Time 1
2
…
~200
⋮
Sub i, Sess j
…
⋮
Sub ~100, Sess ~4
…
This talk: one subject, two sessions (to test replicability)
Functional MRI
fMRI matrix: BOLD from different brain regions
Variable: different brain regions
Sample: time series (after whitening or removing temporal correlations)
Clusters of brain regions
Two data matrices from two scan sessions OpenfMRI.org
Use Power's 264 regions/nodes
Test Prediction/Reproducibilty
Find partitions using the first session data
Average each block cor to improve estimation
Compare with the cor matrix from the second scan $$ \| Avg_{\hat{G}}(\hat{\Sigma}_1) - \hat{\Sigma}_2 \|$$
Difference is
smaller if clustering $\hat{G}$ is
better
Vertical lines: fixed (solid) and data-driven (dashed) thresholds