Network Communities and Variable Clustering via CORD
Xi (Rossi) Luo
Brown University Department of Biostatistics
Center for Statistical Sciences
Computation in Brain and Mind
Brown Institute for Brain Science
The ABCD Research Group
Statistical Learning and Data Science June 7, 2016
Funding: NSF/DMS (BD2K) 1557467; NIH P20GM103645, P01AA019072, P30AI042853; AHA
Collaborators
Florentina Bunea
Cornell University
Christophe Giraud
Paris Sud University
Data Science Problem
We are interested in
big cov with many variables
Global property for certain joint distributions
Real-world cov: maybe
non-sparse and other structures
Clustering successful for > 40 years and for DSDonoho, 2015
Exploratory Data Analysis (EDA)Tukey, 1977
Hierarchical clustering and KmeansHartigan & Wong, 1979
Usually based on
marginal/pairwise distances
Can clustering and big cov estimation be combined?
Example: SP 100 Data
Daily returns from stocks in SP 100
Stocks listed in Standard & Poor 100 Indexas of March 21, 2014
between January 1, 2006 to December 31, 2008
Each stock is a variable
Cov/Cor matrices (Pearson's or Kendall's tau)
Re-order stocks by clusters
Compare cov patterns with different clustering/ordering
Cor after Grouping by Clusters
Ours yields stronger
off-diagonal, tile patterns. Black = 1.
Color bars: variable groups/clusters
Off-diagonal: correlations across clusters
Clustering Results
Industry
Ours
Kmeans
Hierarchical Clustering
Telecom
ATT, Verizon
ATT, Verizon, Pfizer, Merck, Lilly, Bristol-Myers
ATT, Verizon
Railroads
Norfolk Southern, Union Pacific
Norfolk Southern, Union Pacific
Norfolk Southern, Union Pacific, Du Pont, Dow, Monsanto
Home Improvement
Home Depot, Lowe’s
Home Depot, Lowe’s, Starbucks
Home Depot, Lowe’s, Starbucks, Costco, Target, Wal-Mart, FedEx, United Parcel Service
$\cdots$
All methods yield 30 clusters.
Model
Problem
Let ${X} \in \real^p$ be a zero mean random vector
Denote $a\stackrel{G(X)}{\sim} b$ by the partition $G(X)$ if $\var(X_{a})=\var(X_{b})$ and $\cov(X_{a},X_{c})=\cov(X_{b},X_{c})$ for all $c\neq a,b$
Note that $a\stackrel{G^{\prime}(X)}{\sim} b$ if $G\leq G^{\prime}$
Exist multiple partitions that yield $G$-block cov
Minimum $G$ Partition
Theorem: $G^{\beta}(X)$ is the minimal partition induced by $a\stackrel{G^{\beta}}{\sim} b$
iff $\var(X_{a})=\var(X_{b})$ and $\cov(X_{a},X_{c})=\cov(X_{b},X_{c})$ for all $c\neq a,b$. Moreover, if the matrix of covariances $C$ corresponding to the partition $G(X)$ is positive-semidefinite, then this is the unique minimal partition according to which ${X}$ admits a latent decomposition.
The minimal partition is unique under regularity conditions.
Gaussian copula: Kendall's tau transformed, $R_{ab} = \sin (\frac{\pi}{2}\tau_{ab})$
Second, maximum difference of correlation distances $$\d(a,b) := \max_{c\neq a,b}|R_{ac}-R_{bc}|$$
Third, group variables $a$, $b$ together if $\d(a,b) = 0$
"The enemy of my enemy is my friend"
Algorithm: Main Idea
Greedy: one cluster at a time, avoiding NP-hard
Cluster variables together if CORD metric $$\widehat \d(a,b) \lt \alpha$$ where $\alpha$ is a tuning parameter
$\alpha$ is chosen by theory or CV
Theory
Condition
Let $\eta \geq 0$ be given. Let ${ X}$ be a zero mean random vector with a Gaussian copula distribution with parameter $R$. $$ \begin{multline} \mathcal{R}(\eta) := \{R: \ \d(a,b) := \max_{c\neq a,b}|R_{ac}-R_{bc}|>\eta\quad \\ \textrm{for all}\ a\stackrel{G(X)}{\nsim}b.\} \end{multline} $$ Group separation condition: $R \in \mathcal{R}(\eta)$.
The signal strength $\eta$ is large.
Consistency
Theorem: Define $\tau=|\widehat R-R|_{\infty}$ and we consider two parameters $(\alpha,\eta)$ fulfilling $$\begin{equation} \alpha\geq 2\tau\quad\textrm{and}\quad \eta\geq2\tau+\alpha. \end{equation}$$ Then, applying our algorithm we have $\widehat G=G(X)$ whp.
Ours recovers the exact clustering with high probability.
Minimax
Theorem: $P_{\Sigma}$ the likelihood based on $n$ independent observations of ${ X} \stackrel{d}{=} \mathcal{N}(0,\Sigma)$. For any \begin{equation}\label{etamin} 0\leq \eta \lt \eta^{*}:={\sqrt{\log(p(p-1)/2)}\over \sqrt{(2+e^{-1})n} +\sqrt{\log(p(p-1)/2)}}\,, \end{equation} we have $$\inf_{\widehat G}\sup_{R \in \mathcal{R}(\eta)} P_{\Sigma}(\widehat G\neq G^{\beta}(X))\geq {1\over 2e+1}\geq {1\over 7} \,,$$ where the infimum is taken over all possible estimators.
Group separation condition on $\eta$ is necessary.
Simulations
Setup
Generate from various $C$: block, sparse, negative
Compare:
Exact recovery of groups (theoretical tuning parameter)
Cross validation (data-driven tuning parameter)
Cord metric vs (semi)parametetric cor (regardless of tuning)
Exact Recovery
Different models for $C$="$\cov(Z)$" and $\alpha = 2 n^{-1/2} \log^{1/2} p$
Vertical lines: theoretical sample size based on our lower bound
HC and Kmeans fail even if inputting the true $K$.
Data-driven choice of $\alpha$
Given $G$ and $R$, we introduce a block averaging operator $\left[\varUpsilon\left(R,G\right)\right]_{ab} =$
$$\begin{cases} \left|G_{k}\right|^{-1}\left(\left|G_{k}\right|-1\right)^{-1}\sum_{i,j\in G_{k},i\ne j}R_{ij} & \mbox{if } a\ne b \mbox{ and } k=k^{\prime}\\ \left|G_{k}\right|^{-1}\left|G_{k^{\prime}}\right|^{-1}\sum_{i\in G_{k},j\in G_{k^{\prime}}}R_{ij} & \mbox{if } a\ne b \mbox{ and } k\ne k^{\prime}\\ 1 & \mbox{if }a=b. \end{cases} $$
We choose $\alpha$ via cross validation using minimization over a grid of $\alpha$: $$\min_\alpha \| \varUpsilon\left(\hat R,G_\alpha \right) - \hat{R}_{test}\|_F^2 $$
Cross Validation
Recovery % in red and CV loss in black.
CV selects the constants to yield close to 100% recovery.
Metric Comparison: Without Threhold
HC and Kmeans metrics yield more false discoveries (FD) as the threshold (or $K$) varies.
Better metric for clustering regardless of threshold.
Real Data
fMRI data: blood-oxygen-level dependent (BOLD) signals from each
cube/voxel (~millimeters),
$10^5$ ~ $10^6$ voxels in total.
fMRI Studies
Sub 1, Sess 1
Time 1
2
…
~200
⋮
Sub i, Sess j
…
⋮
Sub ~100, Sess ~4
…
This talk: one subject, two sessions (to test replicability)
Functional MRI
fMRI matrix: BOLD from different brain regions
Variable: different brain regions
Sample: time series (after whitening or removing temporal correlations)
Clusters of brain regions
Two data matrices from two scan sessions OpenfMRI.org
Use Power's 264 regions/nodes
Test Prediction/Reproducibilty
Find partitions using the first session data
Average each block cor to improve estimation
Compare with the cor matrix from the second scan $$ \| Avg_{\hat{G}}(\hat{\Sigma}_1) - \hat{\Sigma}_2 \|$$
Difference is smaller if clustering $\hat{G}$ is better
Vertical lines: fixed (solid) and data-driven (dashed) thresholds
Prediction Comparison: smaller prediction loss for almost all $K$