We allow $|C_{11} | \lt | C_{12} |$ or $C \prec 0$
Kmeans/HC leads to block-diagonal cor matrices (permutation)
Clustering based on $G$-Block
Generalizing $G$-Latent which requires $C\succ 0$
Minimum $G$ Partition
Theorem: $G^{\beta}(X)$ is the minimal partition induced by $a\stackrel{G^{\beta}}{\sim} b$
iff $\var(X_{a})=\var(X_{b})$ and $\cov(X_{a},X_{c})=\cov(X_{b},X_{c})$ for all $c\neq a,b$. Moreover, if the matrix of covariances $C$ corresponding to the partition $G(X)$ is positive-semidefinite, then this is the unique minimal partition according to which ${X}$ admits a latent decomposition.
The minimal partition is unique under regularity conditions.
Gaussian copula: Kendall's tau transformed, $R_{ab} = \sin (\frac{\pi}{2}\tau_{ab})$
Second, maximum difference of correlation distances $$\d(a,b) := \max_{c\neq a,b}|R_{ac}-R_{bc}|$$
Third, group variables $a$, $b$ together if $\d(a,b) = 0$
The enemy of my enemy is my friend!
Image credit: http://sutherland-careers.com/
Algorithm: Main Idea
Greedy: one cluster at a time, avoiding NP-hard
Cluster variables together if CORD metric $$ \max_{c\neq a,b}|\hat{R}_{ac}-\hat{R}_{bc}| \lt \alpha$$ where $\alpha$ is a tuning parameter
$\alpha$ is chosen by theory or CV
Theory
Condition
Let $\eta \geq 0$ be given. Let ${ X}$ be a zero mean random vector with a Gaussian copula distribution with parameter $R$. $$ \begin{multline} \mathcal{R}(\eta) := \{R: \ \d(a,b) := \max_{c\neq a,b}|R_{ac}-R_{bc}|>\eta\quad \\ \textrm{for all}\ a\stackrel{G(X)}{\nsim}b.\} \end{multline} $$ Group separation condition: $R \in \mathcal{R}(\eta)$.
The signal strength $\eta$ need to be large.
Consistency
Theorem: Define $\tau=|\widehat R-R|_{\infty}$ and we consider two parameters $(\alpha,\eta)$ fulfilling $$\begin{equation} \alpha\geq 2\tau\quad\textrm{and}\quad \eta\geq2\tau+\alpha. \end{equation}$$ Then, applying our algorithm we have $\widehat G=G(X)$ whp.
Ours recovers exact clustering with high probability.
Minimax
Theorem: $P_{\Sigma}$ the likelihood based on $n$ independent observations of ${ X} \stackrel{d}{=} \mathcal{N}(0,\Sigma)$. For any \begin{equation}\label{etamin} 0\leq \eta \lt \eta^{*}:={\sqrt{\log(p(p-1)/2)}\over \sqrt{(2+e^{-1})n} +\sqrt{\log(p(p-1)/2)}}\,, \end{equation} we have $$\inf_{\widehat G}\sup_{R \in \mathcal{R}(\eta)} P_{\Sigma}(\widehat G\neq G^{\beta}(X))\geq {1\over 2e+1}\geq {1\over 7} \,,$$ where the infimum is taken over all possible estimators.
Our method is minimax optimal.
Choosing Number of Clusters
Split data into 3 parts
Use part 1 of data to estimate clusters $\hat{G}$ for each $\alpha$
Use part 2 to compute between variable difference $$ \delta^{(2)}_{ab} = R_{ac}^{(2)} - R_{bc}^{(2)}, \quad c \ne a, b. $$
Use part 3 to generate "CV" loss $$ \mbox{CV}(\hat{G}) = \sum_{a \lt b} \| \delta^{(3)}_{ab} - \delta^{(2)}_{ab} 1\{ a \mbox{ not clustered w/ } b \} \|^2_\infty. $$
Pick $\alpha$ with the smallest loss above
Theory for CV
Theorem: If either: (i) $X$ is sub-Gaussian with correlation matrix $R$; or (ii) $X$ has a copula distribution with copula correlation matrix $R$, then we have $E[\mbox{CV}(G^*)] \lt E[\mbox{CV}(G)]$, for any $G\ne G^*$.
Theoretical support for selecting $G^*$ from data.
Simulations
Setup
Generate from various $C$: block, sparse, negative
Compare:
Exact recovery of groups (theoretical tuning parameter)
Cross validation (data-driven tuning parameter)
Cord metric vs (semi)parametetric cor (regardless of tuning)
Exact Recovery
Different models for $C$="$\cov(Z)$" and $\alpha = 2 n^{-1/2} \log^{1/2} p$
Vertical lines: theoretical sample size based on our lower bound
HC and Kmeans fail even if inputting the true $K$.
Cross Validation
Recovery % in red and CV loss in black.
CV selects the constants to yield close to 100% recovery, as predicted by our theory (at least for large $n>200$)
Real Data
fMRI Studies
Sub 1, Sess 1
Time 1
2
…
~200
⋮
Sub i, Sess j
…
⋮
Sub ~100, Sess ~4
…
This talk: one subject, two sessions (to test replicability)
Functional MRI
fMRI matrix: BOLD from different brain regions
Variable: different brain regions
Sample: time series (after whitening or removing temporal correlations)
Clusters of brain regions
Two data matrices from two scan sessions OpenfMRI.org
Use Power's 264 regions/nodes
Test Prediction/Reproducibilty
Find partitions using the first session data
Average each block cor to improve estimation
Compare with the cor matrix from the second scan $$ \| Avg_{\hat{G}}(\hat{\Sigma}_1) - \hat{\Sigma}_2 \|$$
where we used $\hat{G}$ to do block-averaging.
Difference is
smaller if clustering $\hat{G}$ is
better
Vertical lines: fixed (solid) and data-driven (dashed) thresholds
Repeat for all $(j,k) \in \{1,\dotsc, p\}^2$ pairs
Essentially $O(p^2)$ regressions for each connection
Limitations: multiple testing $O(p^2)$, failure to accout for dependencies between regressions
Model and Method
Model
Find principal direction (PD) $\gamma \in \real^p$, such that:
$$ \log({\gamma}^\top\Sigma_{i}{\gamma})=\beta_{0}+x_{i}^\top{\beta}_{1}, \quad i =1,\dotsc, n$$
Example (p=2): PD1 largest variation but not related to $x$
PCA selects PD1, Ours selects PD2
Advantages
Scalability: potentially for $p \sim 10^6$ or larger
Interpretation: covariate assisted PCA
Turn unsupervised PCA into supervised
Sensitivity: target those covariate-related variations
Covariate assisted SVD?
Applicability: other big data problems besides fMRI
Proposition: When (C1) $H=\boldsymbol{\mathrm{I}}$ in the optimization problem, for any fixed $\boldsymbol{\beta}$, the solution of $\boldsymbol{\gamma}$ is the eigenvector corresponding to the minimum eigenvalue of matrix
$$ \sum_{i=1}^{n}\frac{\Sigma_{i}}{\exp(x_{i}^\top\boldsymbol{\beta})} $$
Will focus on the constraint (C2)
Algoirthm
Iteratively update $\beta$ and then $\gamma$
Prove explicit updates
Extension to multiple $\gamma$:
After finding $\gamma^{(1)}$, we will update $\Sigma_i$ by removing its effect
Search for the next PD $\gamma^{(k)}$, $k=2, \dotsc$
Impose the orthogonal constraints such that $\gamma^{k}$ is orthogonal to all $\gamma^{(t)}$ for $t\lt k$
Theory for $\beta$
Theorem:
Assume $\sum_{i=1}^{n}x_{i}x_{i}^\top/n\rightarrow Q$ as $n\rightarrow\infty$. Let $T=\min_{i}T_{i}$, $M_{n}=\sum_{i=1}^{n}T_{i}$, under the true $\boldsymbol{\gamma}$, we have
\begin{equation}
\sqrt{M_{n}}\left(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}\right)\overset{\mathcal{D}}{\longrightarrow}\mathcal{N}\left(\boldsymbol{\mathrm{0}},2 Q^{-1}\right),\quad \text{as } n,T\rightarrow\infty,
\end{equation}
where $\hat{\boldsymbol{\beta}}$ is the maximum likelihood estimator when the true $\boldsymbol{\gamma}$ is known.
Theory for $\gamma$
Theorem:
Assume $\Sigma_{i}=\Gamma\Lambda_{i}\Gamma^\top$, where $\Gamma=(\boldsymbol{\gamma}_{1},\dots,\boldsymbol{\gamma}_{p})$ is an orthogonal matrix and $\Lambda_{i}=\mathrm{diag}\{\lambda_{i1},\dots,\lambda_{ip}\}$ with $\lambda_{ik}\neq\lambda_{il}$ ($k\neq l$), for at least one $i\in\{1,\dots,n\}$. There exists $k\in\{1,\dots,p\}$ such that for $\forall~i\in\{1,\dots,n\}$, $\boldsymbol{\gamma}_{k}^\top\Sigma_{i}\boldsymbol{\gamma}_{k}=\exp(x_{i}^\top\boldsymbol{\beta})$. Let $\hat{\boldsymbol{\gamma}}$ be the maximum likelihood estimator of $\boldsymbol{\gamma}_{k}$ in Flury, 84. Then assuming that the assumptions are satisfied, $\hat{ \boldsymbol{\beta}}$ from our algorithm is $\sqrt{M_{n}}$-consistent estimator of $\boldsymbol{\beta}$.
Simulations
PCA and common PCA do not find the first principal direction, because they don't model covariates
Resting-state fMRI
Regression Coefficients
Age
Sex
Age*Sex
No statistical significant changes were found by massive edgewise regression