Good prediction, especially in the context of big data, is important. The prediction for large data set is typically carried out in two stages, variable selection and pattern recognition (or prediction). Ordinarily variable selection involves seeing how well individual explanatory variables are correlated with the dependent variable using a significance-based criterion. This practice neglects the possible interactions among the explanatory variables and thus can choose less-predictive variables, because significance does not imply predictivity and important joint information may be omitted. When a subset of truly influential variables is identified, one may expect a noticeable increase of correct prediction rate, being true in both simple and complex data. However, in the era of Big Data, high dimensionality and complicated interactions have posed great difficulties for existing selection procedures.
We consider an alternative selection approach that directly measures a variable set's ability to predict (termed “predictivity”), the I-score, without relying on the CV. We argue that the I-score not only reflects the true amount of interactions among variables, it can be related to a lower bound of the correct prediction rate and does not over fit. The values of the I-score measure the amount of “influence” of the variables set under consideration. We suggest shifting the research agenda toward searching for a new criterion to locate highly predictive variables using partition retention (PR) method with I-score. The PR was effective in reducing prediction error from 30% to 8% on a long-studied breast cancer data set. Furthermore, we offer some recommendations (on daily applications) how to determine a significant variable(s) is predictive or no value of prediction. When the two concepts of significance and predictivity converge?