
Maximal Sparsity with Deep Networks: A Technical Analysis
"Explore maximal sparsity with deep networks in this insightful study by Bo Xin, Yizhou Wang, Wen Gao, and David Wipf from Peking University and Microsoft Research, Beijing. Learn about iterative algorithms, practical designs, applications, and the challenges faced in achieving maximal sparsity. Discover how deep learning offers a new perspective on this complex topic."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Maximal Sparsity with Deep Networks? Bo Xin1,2, Yizhou Wang1, Wen Gao1and David Wipf2 1 Peking University 2 Microsoft Research, Beijing
Outline Background and motivation Unfolding iterative algorithms Theoretical analysis Practical designs and empirical supports Applications Discussion
Maximal Sparsity min ? ???0 ?.?. ? = ? ?? ? ? ?? 0# of non-zeros
Maximal Sparsity is NP hard min ? ?0 ?.?. ? = ? Combinatorial NP hard and close approximations are highly non convex Practical alternatives: 1norm minimization orthogonal matching pursuit (OMP) Iterative hard thresholding (IHT)
Pros and Cons Numerous practical applications: Feature selection [Cotter and Rao 2001; Figueiredo, 2002] Outlier removal [Candes and Tao 2005; Ikehata et al. 2012] Compressive sensing [Donoho, 2006] Source localization [Baillet et al. 2001; Malioutov et al. 2005] Computer vision applications [John Wright et al 2009] Fundamental weakness: If the Gram matrix T has high off-diagonal energy Estimation of ? can be extremely poor
Restricted Isometry Property (RIP) A matrix satisfies the RIP with constant ?? < 1 if 2 ?2 2 1 + ?? 2 (1 ?? ) ?2 ?2 Holds for all {?: ?0 ?}. [Cand s et al., 2006] small RIP constant ?2 large RIP constant ?2 = 2 k
Guaranteed Recovery with IHT Suppose there exists some ? such that ? = ? ? 0 ? ?3? < 1/ 32 Then the IHT iterations are guaranteed to converge to ? . [Blumensath and Davies, 2009] Only very small degrees of correlation can be tolerated
Checkpoint Thus far Maximal sparsity is NP hard Practical alternative suffers when the dictionary has high RIP constant What s coming A deep learning base perspective Technical analysis
Iterative algorithms Iterative hard thresholding (IHT) Iterative hard thresholding (IHT) Iterative soft thresholding (ISTA) Iterative soft thresholding (ISTA) ? ??? ??? ????????,?? { ?=?(?)= ? ?(?) ?? ? = ?(?) ??? ?? ?=?(?) ?(?+1)= ? ?? (?) } (?+1)= ?? if |??| ????? ? ??????? (?+1)= ???? ??|??| ? 0 ?? ?? > ? ?? ?????? ?? ?? 0 ?? ??????
Iterative algorithms Iterative hard thresholding (IHT) Iterative hard thresholding (IHT) Iterative soft thresholding (ISTA) Iterative soft thresholding (ISTA) ? ??? ??? ????????,?? { linear op linear op ?=?(?)= ? ?(?) ?? ?? ? = ?(?) ??? ?=?(?) none linear op none linear op ?(?+1)= ? ?? (?) } (?+1)= ?? if |??| ????? ? ??????? (?+1)= ???? ??|??| ? 0 ?? ?? > ? ?? ?????? ?? ?? 0 ?? ??????
Deep Network = Unfolded Optimization? Basic DNN Template + ) 1 ( ) 1 ( ) 1 ( x b W linear filter nonlinearity/threshold + ) 2 ( ) 2 ( ) 2 ( x b W Observation: Many common iterative algorithms follow the exact same script: ( ) + ( ) ( ) ( ) t t t x b W ) 1 + = + ( ( ) t t x x b f W
Deep Network = Unfolded Optimization? Basic DNN Template + ) 1 ( ) 1 ( ) 1 ( x b W Fast sparse encoders: [Gregor and LeCun, 2010] [Wang et al. 2015] + ) 2 ( ) 2 ( ) 2 ( x b W What s more? + ( ) ( ) ( ) t t t x b W
Unfolded IHT Iterations ) 1 (+ x b W linear filter non-linearity ? = ? ? ? + ) 2 ( x b W ? ?? ? = (+ ) t x b W Question 1: So is there an advantage to learning the weights?
Effects of Correlation Structure High Correlation: Hard T Low Correlation: Easy T ( ) entries = + uncor iid 0, N ( ) ( ) cor uncor ( ) low rank small RIP constant ] [ k large RIP constant ] [ k 1 1 3 3 32 32
Performance Bound with Learned Layer Weights Theorem 1 There will always exist layer weights ? and bias ? such that the effective RIP constant is reduced via ?3? = inf ??3? ? ?3?[ ] effective RIP constant original RIP constant where is arbitrary and D is diagonal. [Xin et al. 2016] It is therefore possible to reduce high RIP constants
Practical Consequences Theorem 2 Suppose we have correlated dictionary formed via (???)= ?????+ large RIP small RIP With ????? iid ?(0,?) entries and sufficiently low rank. Then ( ] [ ) ( 3 cor k E ) ( ) ] ) * E [ 3 ( k uncor [Xin et al. 2016] So we can undo low rank correlations that would otherwise produce a high RIP constant
Advantages of Independent Layer Weights (and Activations) Theorem 3 + ) 1 ( ) 1 ( ) 1 ( x b W With independent weights on each layer + ) 2 ( ) 2 ( ) 2 ( x b W Often possible to obtain nearly ideal RIP constant even when full rank is present + ( ) ( ) ( ) t t t x b W [Xin et al. 2016] Question 2: Do independent weights (and activations) has the potential to do even better? Yes
Advantages of Independent Layer Weights (and Activations) ?= ??????+ i = [ 1, ?] ?(?+1)= ?( ?? (?))[????+ ??] ?, ??? + ) 1 ( ) 1 ( ) 1 ( x b W + ) 2 ( ) 2 ( ) 2 ( x b W
Advantages of Independent Layer Weights (and Activations) ?= ??????+ i = [ 1, ?] ?(?+1)= ?( ?? (?))[????+ ??] ?, ??? + ) 1 ( ) 1 ( ) 1 ( x b W + ) 2 ( ) 2 ( ) 2 ( x b W
Advantages of Independent Layer Weights (and Activations) ?= ??????+ i = [ 1, ?] ?(?+1)= ?( ?? (?))[????+ ??] ?, ??? + ) 1 ( ) 1 ( ) 1 ( x b W + ) 2 ( ) 2 ( ) 2 ( x b W
Advantages of Independent Layer Weights (and Activations) ?= ??????+ i = [ 1, ?] ?(?+1)= ?( ?? (?))[????+ ??] ?, ??? + ) 1 ( ) 1 ( ) 1 ( x b W + ) 2 ( ) 2 ( ) 2 ( x b W
Checkpoint Idealized deep network weights exist that improve RIP constants. Thus far What s coming Practical design to facilitate success Empirical results Applications
Alternative Learning-Based Strategy Treat as a multi-label DNN classification problem to estimate support of ? . The main challenge is estimating supp[x] Once support is obtained, computing actual value is trivial ? ?2 based loss will be unaware and expend undue effort to match coefficient magnitudes. Specifically, we learn to find supp[x] using multilabel softmax loss layer
Alternative Learning-Based Strategy Adopt highway and gating mechanisms Relatively deep nets for challenging problems such designs help with information flow Our analysis show such designes seem natural for challenging multi-scales sparse estimation problems The philosophies of generating training sets. Generative perspective ? ? = ? Not ? ? = ????????????(?) Unsupervised training ? are randomly generated
Experiments 1 ?2???? ?where ??,??are iid ?(0,1) We generate = ? Super-linear decaying singular values With structure but quite general Values of ? ?-distribution : drawn from ? 0.5,0.5 excluding ?[ 0.1,0.1] ?-distribution : drawn from ? +0.3,0.1 ??? ?( 0.3,0.1) Experiments Basic: U2U Cross: U2N, N2U
Results Strong estimation accuracy Robust to training distributions Question 3: Does deep learning really learn ideal weights as analyzed? Hard to say, but it DOES achieve strong empirical performance for maximal sparsity
Robust Surface Normal Estimation Input: Per-pixel model: = + y n x L Specular reflections, shadows, etc. (outliers) observations under different lightings raw unknown surface normal lighting matrix Can apply any sparse learning method to obtain outliers [Ikehata et al., 2012]
Convert to Sparse Estimation Problem ( ) L ( L ) ( ) x L Proj = + Proj = y n x Proj L T T T Null Null Null y~ ~ = x y x min s.t. 0 z # of nonzero elements Once outliers are known, can estimate n via: ( n = ) ( ) -1 T T y x [Cand s and Tao, 2004]
Conclusions First rigorous analysis of how unfolded iterative algorithms can be provably enhanced by learning. Detailed characterization of how different architecture choices affect performance. Narrow benefit: First ultra-fast method for obtaining optimal sparse representations with correlated designs (i.e., high RIP constants). Broad benefit: General insights into why DNNs can outperform hand-crafted algorithms.
Network structure Residual LSTM