A Discussion of Partial Least Squares and Novel Applications to Ill-posed Problems
- 2013-01-14 (Mon.), 10:30 AM
- Recreation Hall, 2F, Institute of Statistical Science
- Dr. Andrew Woolston
- Centre for Epidemiology and Biostatistics, University of Leeds, UK
Abstract
A Discussion of Partial Least Squares and Novel Applications to Ill-posed Problems Dr. Andrew Woolston Centre for Epidemiology and Biostatistics, University of Leeds, UK ? Collinearity is a natural and unavoidable feature of most biological data. Small departures from independence can severely distort the interpretation of a model and the role of each covariate. Common ‘symptoms’ of collinearity include changes of sign, implausible magnitudes of the coefficients and inflated standard errors. However, the presence of such features is not dictated solely by the correlation structure of the covariates. The response variable plays a crucial role in moderating the impact of collinearity on a regression model and the conclusions formed from the analysis. Therefore, the covariance structure between the response and the covariates is vital to understanding and handling the effects of collinearity. However, there is sometimes a more terminal problem produced by a perfect collinearity in the data that ensures at least one redundant dimension. In this case, least squares regression is unable to provide a unique solution. Latent variable methods provide a solution to such ill-posed problems, and in particular we focus on the application of partial least squares (PLS). PLS produces components that summarize the observed variables to maximize the covariance of the predictors with the response, a feature particularly beneficial to reducing the effects of collinearity. By retaining fewer than the maximum number of components, PLS is able to bypass the seemingly terminal issue of a singular design matrix. We present two novel applications of PLS. The first is in a lifecourse setting, in which predictors are entered that generate a perfect collinearity. The practical challenge is to identify a critical period of growth in the lifecourse that dictates a later life health outcome. The second example looks at a ‘cross-omics’ analysis. In this case, the number of variables greatly outweighs the number of observations. Sparse methods of PLS provide a powerful dimension reduction and feature selection method for data that typically suffers from high dimensionality and collinearity.