A Cross-Validation Study for Reproducibility
- 2016-06-27 (Mon.), 10:30 AM
- Recreation Hall, 2F, Institute of Statistical Science
- Prof. Lo-Bin Chang
- The Ohio State University, USA
Abstract
In recent years “reproducibility” has emerged as a key factor in evaluating applications of statistics to the biomedical sciences, for example predictors of disease phenotypes learned from high-throughput “omics” data.? Among other factors, validation of such predictors entails comparing the reported error rates, usually estimated by standard cross-validation, to the accuracy observed on additional data collected from new studies. Unfortunately, the rates originally published are frequently lower, and this discrepancy is then seen as a barrier to translational research. In this talk, I will provide a statistical formulation in the large sample limit to study this inconsistency based on the gap between the error rates in cross-study validation (CSV) and that in ordinary randomized cross-validation (RCV). Theoretical results cohere with the trends observed in practice: for any number m of studies, the cross-study error rate exceeds that of ordinary randomized cross-validation, the latter (averaged) increases with m, and both converge to the optimal rate.