Brown University Department of Biostatistics
Center for Statistical Sciences
Computation in Brain and Mind
Brown Institute for Brain Science
The ABCD Research Group
The 25th ICSA Symposium, Atlanta, Georgia June 13, 2016
Funding: NSF/DMS (BD2K) 1557467; NIH P20GM103645, P01AA019072, P30AI042853; AHA
Collaborators
Florentina Bunea
Cornell University
Christophe Giraud
Paris Sud University
Task fMRI
Perform tasks while under fMRI scanning
Specific parts of the brain responsible for the task
Remove task effects, data like "resting-state"
Goal: which parts of the brain function together?
fMRI data: blood-oxygen-level dependent (BOLD) signals from each
cube/voxel (~millimeters),
$10^5$ ~ $10^6$ voxels in total.
Data Matrix
Matrix $X_{n \times p}$, all columns standardized
$n$ time points but temporal correlation removed, like iid
$p$ voxels but with spatial corraltion
Interested in
big spatial networks
Voxel level: $10^6 \times 10^6$ cov matrix but limited interpretability
General Pipeline
This talk: how to do clustering with justification?
Big Picture
We are interested in
big cov with many variables
Global property for certain
joint distributions
Real-world cov: maybe
non-sparse and other structures
Clustering successful for > 40 years and for DSDonoho, 2015
Exploratory Data Analysis (EDA)Tukey, 1977
Hierarchical clustering and KmeansHartigan & Wong, 1979
Usually based on
marginal/pairwise distances
Can clustering and big cov estimation be combined?
Example
Example: SP 100 Data
Daily returns from stocks in SP 100
Stocks listed in Standard & Poor 100 Indexas of March 21, 2014
between January 1, 2006 to December 31, 2008
Each stock is a variable
Cov/Cor matrices (Pearson's or Kendall's tau)
Re-order stocks by clusters
Compare cov patterns with different clustering/ordering
Cor after Grouping by Clusters
Ours yields stronger
off-diagonal, tile patterns. Black = 1.
Color bars: variable groups/clusters
Off-diagonal: correlations across clusters
Clustering Results
Industry
Ours
Kmeans
Hierarchical Clustering
Telecom
ATT, Verizon
ATT, Verizon, Pfizer, Merck, Lilly, Bristol-Myers
ATT, Verizon
Railroads
Norfolk Southern, Union Pacific
Norfolk Southern, Union Pacific
Norfolk Southern, Union Pacific, Du Pont, Dow, Monsanto
Home Improvement
Home Depot, Lowe’s
Home Depot, Lowe’s, Starbucks
Home Depot, Lowe’s, Starbucks, Costco, Target, Wal-Mart, FedEx, United Parcel Service
$\cdots$
All methods yield 30 clusters.
Model
Problem
Let ${X} \in \real^p$ be a zero mean random vector
Denote $a\stackrel{G(X)}{\sim} b$ by $G(X)$ if $\var(X_{a})=\var(X_{b})$ and $\cov(X_{a},X_{c})=\cov(X_{b},X_{c})$ for all $c\neq a,b$
Note that $a\stackrel{G^{\prime}(X)}{\sim} b$ if $G\leq G^{\prime}$
Exist
multiple partitions that yield $G$-block cov
Minimum $G$ Partition
Theorem: $G^{\beta}(X)$ is the minimal partition induced by $a\stackrel{G^{\beta}}{\sim} b$
iff $\var(X_{a})=\var(X_{b})$ and $\cov(X_{a},X_{c})=\cov(X_{b},X_{c})$ for all $c\neq a,b$. Moreover, if the matrix of covariances $C$ corresponding to the partition $G(X)$ is positive-semidefinite, then this is the unique minimal partition according to which ${X}$ admits a latent decomposition.
The minimal partition is unique under regularity conditions.
Gaussian copula: Kendall's tau transformed, $R_{ab} = \sin (\frac{\pi}{2}\tau_{ab})$
Second, maximum difference of correlation distances $$\d(a,b) := \max_{c\neq a,b}|R_{ac}-R_{bc}|$$
Third, group variables $a$, $b$ together if $\d(a,b) = 0$
The enemy of my enemy is my friend!
Image credit: http://sutherland-careers.com/
Algorithm: Main Idea
Greedy: one cluster at a time, avoiding NP-hard
Cluster variables together if CORD metric $$ \max_{c\neq a,b}|\hat{R}_{ac}-\hat{R}_{bc}| \lt \alpha$$ where $\alpha$ is a tuning parameter
$\alpha$ is chosen by theory or CV
Theory
Condition
Let $\eta \geq 0$ be given. Let ${ X}$ be a zero mean random vector with a Gaussian copula distribution with parameter $R$. $$ \begin{multline} \mathcal{R}(\eta) := \{R: \ \d(a,b) := \max_{c\neq a,b}|R_{ac}-R_{bc}|>\eta\quad \\ \textrm{for all}\ a\stackrel{G(X)}{\nsim}b.\} \end{multline} $$ Group separation condition: $R \in \mathcal{R}(\eta)$.
The signal strength $\eta$ need to be large.
Consistency
Theorem: Define $\tau=|\widehat R-R|_{\infty}$ and we consider two parameters $(\alpha,\eta)$ fulfilling $$\begin{equation} \alpha\geq 2\tau\quad\textrm{and}\quad \eta\geq2\tau+\alpha. \end{equation}$$ Then, applying our algorithm we have $\widehat G=G(X)$ whp.
Ours recovers exact clustering with high probability.
Minimax
Theorem: $P_{\Sigma}$ the likelihood based on $n$ independent observations of ${ X} \stackrel{d}{=} \mathcal{N}(0,\Sigma)$. For any \begin{equation}\label{etamin} 0\leq \eta \lt \eta^{*}:={\sqrt{\log(p(p-1)/2)}\over \sqrt{(2+e^{-1})n} +\sqrt{\log(p(p-1)/2)}}\,, \end{equation} we have $$\inf_{\widehat G}\sup_{R \in \mathcal{R}(\eta)} P_{\Sigma}(\widehat G\neq G^{\beta}(X))\geq {1\over 2e+1}\geq {1\over 7} \,,$$ where the infimum is taken over all possible estimators.
Our method is minimax optimal.
Simulations
Setup
Generate from various $C$: block, sparse, negative
Compare:
Exact recovery of groups (theoretical tuning parameter)
Cross validation (data-driven tuning parameter)
Cord metric vs (semi)parametetric cor (regardless of tuning)
Exact Recovery
Different models for $C$="$\cov(Z)$" and $\alpha = 2 n^{-1/2} \log^{1/2} p$
Vertical lines: theoretical sample size based on our lower bound
HC and Kmeans fail even if inputting the true $K$.
Data-driven choice of $\alpha$: Averaging
Introduce block averaging operator $\left[\varUpsilon\left(R,G\right)\right]_{ab} =$
$$\begin{cases} \left|G_{k}\right|^{-1}\left(\left|G_{k}\right|-1\right)^{-1}\sum_{i,j\in G_{k},i\ne j}R_{ij} & \mbox{if } a\ne b \mbox{ and } k=k^{\prime}\\ \left|G_{k}\right|^{-1}\left|G_{k^{\prime}}\right|^{-1}\sum_{i\in G_{k},j\in G_{k^{\prime}}}R_{ij} & \mbox{if } a\ne b \mbox{ and } k\ne k^{\prime}\\ 1 & \mbox{if }a=b. \end{cases} $$
Data-driven choice of $\alpha$: Prediction
Choose $\alpha$ via cross validation using minimization over a grid of $\alpha$: $$\min_\alpha \| \varUpsilon\left(\hat R,G_\alpha \right) - \hat{R}_{test}\|_F^2 $$
Cross Validation
Recovery % in red and CV loss in black.
CV selects the tuning to yield close to 100% recovery
Metric Comparison: Without Threhold
HC and Kmeans metrics yield more false discoveries (FD) as the threshold (or $K$) varies.
Better metric for clustering regardless of threshold
Real Data
fMRI Studies
Sub 1, Sess 1
Time 1
2
…
~200
⋮
Sub i, Sess j
…
⋮
Sub ~100, Sess ~4
…
This talk: one subject, two sessions (to test replicability)
Functional MRI
fMRI matrix: BOLD from different brain regions
Variable: different brain regions
Sample: time series (after whitening or removing temporal correlations)
Clusters of brain regions
Two data matrices from two scan sessions OpenfMRI.org
Use Power's 264 regions/nodes
Test Prediction/Reproducibilty
Find partitions using the first session data
Average each block cor to improve estimation
Compare with the cor matrix from the second scan $$ \| Avg_{\hat{G}}(\hat{\Sigma}_1) - \hat{\Sigma}_2 \|$$
Difference is
smaller if clustering $\hat{G}$ is
better
Vertical lines: fixed (solid) and data-driven (dashed) thresholds