Variable Clustering via $G$-Models of Large Covariance Matrices
Xi (Rossi) Luo
Brown University Department of Biostatistics
Center for Statistical Sciences
Computation in Brain and Mind
Brown Institute for Brain Science
ABCD Research Group
Real-world cov: maybe
non-sparse and other structures
Clustering successful for Big Data Science Donoho, 2015
Exploratory Data Analysis (EDA)Tukey, 1977
Hierarchical clustering and KmeansHartigan & Wong, 1979
Mostly based on
marginal/pairwise distances
Can we combine clustering and big cov estimation?
Example: SP 100 Data
Daily returns from stocks in SP 100
Stocks listed in Standard & Poor 100 Indexas of March 21, 2014
between January 1, 2006 to December 31, 2008
Each stock is a variable
Cov/Cor matrices (Pearson's or Kendall's tau)
Re-order stocks by clusters
Compare cov patterns with different clustering/ordering
Cor after Grouping by Clusters
Ours yields stronger
off-diagonal, tile patterns. Black = 1.
Color bars: variable groups/clusters
Off-diagonal: correlations across clusters
Clustering Results
Industry
Ours
Kmeans
Hierarchical Clustering
Home Improvement
Home Depot, Lowe’s
Home Depot, Lowe’s, Starbucks, Target
Home Depot, Lowe’s, Starbucks, Target, Costco, Target, Wal-Mart, FedEx, United Parcel Service, Nike, McDonald’s
Telecom
ATT, Verizon
ATT, Verizon, Exelon, Comcast, Walt Disney, Time Warner
ATT, Verizon, Comcast, Walt Disney, Time Warner, AIG, Allstate, Metlife, American Express, Bank of America, Citigroup, US Bancorp, Wells Fargo, Capital One, Goldman Sachs, JP Morgan Chase, Morgan Stanley, Simon Property, General Electric
We allow $|C_{11} | \lt | C_{12} |$ or
$C \prec 0$
Kmeans/HC leads to block-diagonal cor matrices (permutation)
Clustering based on $G$-Block
From $G$-block we can read out
"negative" $\cov(Z)$
Cov defined for semiparametric distributions
Clusters can contain singletons
Minimum $G$ Partition
Theorem: $G^{\beta}(X)$ is the minimal partition induced by $a\stackrel{G^{\beta}}{\sim} b$
iff $\var(X_{a})=\var(X_{b})$ and $\cov(X_{a},X_{c})=\cov(X_{b},X_{c})$ for all $c\neq a,b$. Moreover, if the matrix of covariances $C$ corresponding to the partition $G(X)$ is positive-semidefinite, then this is the unique minimal partition according to which ${X}$ admits a latent decomposition.
We define the minimal cluster/partition.
The minimal partition is unique under conditions.
We will aim to recover the minimal partition (thus $K$).
Third, group variables $a$, $b$ together if $\d(a,b) = 0$
Do not care
any pairwise distance between $a,b$
"The enemy of my enemy is my friend"
Algorithm: Main Idea
Greedy: one cluster at a time, avoiding NP-hard
Cluster variables together if CORD metric $$\widehat \d(a,b) \lt \alpha$$ where $\alpha$ is a tuning parameter
$\alpha$ is chosen by theory or CV
Theory
Condition
Let $\eta \geq 0$ be given. Let ${ X}$ be a zero mean random vector with a Gaussian copula distribution with parameter $R$. $$ \begin{multline} \mathcal{R}(\eta) := \{R: \ \d(a,b) := \max_{c\neq a,b}|R_{ac}-R_{bc}|>\eta\quad \\ \textrm{for all}\ a\stackrel{G(X)}{\nsim}b.\} \end{multline} $$ Group separation condition: $R \in \mathcal{R}(\eta)$.
The signal strength $\eta$ is large.
Consistency
Theorem: Define $\tau=|\widehat R-R|_{\infty}$ and we consider two parameters $(\alpha,\eta)$ fulfilling $$\begin{equation} \alpha\geq 2\tau\quad\textrm{and}\quad \eta\geq2\tau+\alpha. \end{equation}$$ Then, applying our algorithm we have $\widehat G=G(X)$ whp.
Ours recovers the exact clustering with high probability.
Minimax
Theorem: $P_{\Sigma}$ the likelihood based on $n$ independent observations of ${ X} \stackrel{d}{=} \mathcal{N}(0,\Sigma)$. For any \begin{equation} 0\leq \eta
< \eta^{*}:=\frac{0.6\sqrt{\frac{ \log(p)}{n}}}{1+0.6\sqrt{\frac{ \log(p)}{n}}} \end{equation} we have $$\inf_{\widehat G}\sup_{R \in \mathcal{R}(\eta)} P_{\Sigma}(\widehat G\neq G^{\beta}(X))\geq {1\over 2e+1}\geq {1\over 7} \,,$$ where the infimum is taken over all possible estimators.
Group separation condition on $\eta$ is optimal.
Choosing Number of Clusters
Split data into 3 parts
Use part 1 of data to estimate clusters $\hat{G}$ for each $\alpha$
Use part 2 to compute between variable difference $$ \delta^{(2)}_{ab} = R_{ac}^{(2)} - R_{bc}^{(2)}, \quad c \ne a, b. $$
Use part 3 to generate "CV" loss $$ \mbox{CV}(\hat{G}) = \sum_{a \lt b} \| \delta^{(3)}_{ab} - \delta^{(2)}_{ab} 1\{ a \mbox{ not clustered w/ } b \} \|^2_\infty. $$
Pick $\alpha$ with the smallest loss above
Theory for CV
Theorem: If either: (i) $X$ is sub-Gaussian with correlation matrix $R$; or (ii) $X$ has a copula distribution with copula correlation matrix $R$, then we have $E[\mbox{CV}(G^*)] \lt E[\mbox{CV}(G)]$, for any $G\ne G^*$.
This shows that our CV will select $G^*$ consistently.
Simulations
Setup
Model $C$ ($\cov(Z)$): positive semidefinite or negative
True $G^*$: singletons or no-singleton clusters
Simulate $X$ from $G$-block cov
Variable clustering using $X$
Compare with K-means or Hierarchical Clustering:
Exact recovery of groups
Cross validation loss and choosing $K$
Exact Recovery
Different models for $C$="$\cov(Z)$" and $G$
HC and Kmeans fail even if inputting the true $K$ and $n \rightarrow \infty$
Our CORD methods recover both the true $G^*$ and $K$ as predicted by our theory.
Cross Validation
Recovery % in red and CV loss in black.
CV selects the constants to yield close to 100% recovery, as predicted by our theory (at least for large $n>200$)
Real Data
Functional MRI
fMRI matrix: BOLD from different brain regions
Variable: different brain regions
Sample: time series (after whitening or removing temporal correlations)
Clusters of brain regions
Two data matrices from two scan sessions OpenfMRI.org
Use Power's 264 regions/nodes
Test Prediction/Reproducibilty
Find partitions using the first session data
Average each block cor to improve estimation
Compare with the cor matrix from the second scan $$ \| Avg_{\hat{G}}(\hat{\Sigma}_1) - \hat{\Sigma}_2 \|$$
Difference is smaller if clustering $\hat{G}$ is better
Vertical lines: fixed (solid) and data-driven (dashed) thresholds
Our CORD $\hat{G}$ leads to smaller between-session variability for almost all $K$, than HC and Kmeans.