Sparse Factor Models for High Dimensional Data
- 2012-11-30 (Fri.), 10:30 AM
- Recreation Hall, 2F, Institute of Statistical Science
- Professor David Causeur
- Agrocampus – Applied Mathematics Department, France
Abstract
Sparse Factor Models for High Dimensional Data David Causeur Agrocampus – Applied Mathematics Department, France ?: Analysis of data generated by high-throughput technologies has received an increased scrutiny in the statistical literature, especially motivated by emerging challenges in systems biology, neuroscience or astronomy. Microarray technologies for genome analysis or brain imaging and electroencephalography share the common goal to provide a detailed overview of complex systems on a large scale. Statistical analysis of the resulting data usually aims at identifying key components of the whole system essentially by large-scale significance, regression or supervised classification analysis. However, usual issues such as the control of the error rates in multiple testing or model selection in classification turns out to be challenging in high dimensional situations. For example, some papers (Leek and Storey, 2007 and 2008, Friguet et al., 2009) have pointed out the negative impact of dependence among tests on the consistency of the ranking which results from multiple testing procedures in high dimension. These papers essentially show that unmodeled heterogeneity factors can result in an unexpected dependence across data, which generates a high variability in the actual False Discovery Proportion and more generally affects the efficiency of the classical simultaneous testing methods. Models for interaction network among the components of a complex system often reveal some key components whose changes lead to variations of other connected components. This suggests that it is crucial to account for the system-wide dependence structure to select these key components. A sparse factor model is proposed to identify a low-dimensional linear kernel which captures data dependence.? -penalized estimation algorithms are presented and strategies for module detection in Graphical Gaussian Models for networks or model selection in supervised classification are derived. The properties are illustrated by issues in statistical genomics (see Blum et al, 2010) and analysis of ERP curves (see Causeur et al., 2012). ? Keywords: high dimension, factor model, Graphical Gaussian Model, LASSO, selection stability, supervised classification. References: [1] Blum, Y., Le Mignon, G., Lagarrigue, S. and Causeur, D. (2010). A factor model to analyse heterogeneity in gene expressions. BMC Bioinformatics. 11:368. [2] Causeur, D., Chu, M.-C., Hsieh, S. and Sheu, C.-F. (2012) A factor-adjusted multiple testing procedure for ERP data analysis. Behavior Research Methods. 44 (3), 635–643. [3] Friguet, C., Kloareg, M. and Causeur, D. A factor model approach to multiple testing under dependence. Journal of the American Statistical Association, 104, 1406–1415, 2009. [4] Leek, J.T., and Storey, J. Capturing heterogeneity in gene expression studies by Surrogate Variable Analysis. PLoS Genetics, 3, e161, 2007. [5] Leek, J.T., and Storey, A general framework for multiple testing dependence. Proceedings of the National Academy of Sciences, 105, 18718–18723, 2008.