Abstract

Advances in molecular technology have shifted the development of new drugs towards precision medicine, to identify right patients for the right treatment. The success of precision medicine lies in development of biomarker-based subgroup selection strategy to match the disease with specific therapies for individual patients. This presentation covers three steps to develop a subgroup selection strategy: 1) biomarker identification, 2) subgroup selection, and 3) subgroup analysis to assess clinical utility. Biomarker identification involves fitting interaction models to identify sets of potential prognostic and/or predictive biomarkers from a set of measured genomic variables. Subgroup Selection develops a prediction model based on the biomarkers identified to partition patients into subgroups that are homogeneous with respect to disease outcomes and/or responses to a specific treatment. Subgroup Analysis evaluates accuracy of patient treatment assignment and assesses enhancement of treatment efficacy. Procedures are illustrated by simulations and analyses of cancer datasets. Major statistical issues and challenges will be discussed, including identification of prognostic and predictive biomarkers, false and true positives in biomarker identification respect to predictive model development, safety biomarkers for drug-induced toxicity, subgroup domain and clinical target variable.

]]>Abstract

Current practices toward prediction problems generally include using a significance-based criterion for evaluating variables to use in a chosen model and evaluating variables and models simultaneously for prediction, using cross-validation or independent test data. Our recent works showed that significant variables may not necessarily be predictive, and that strong predictors may not appear statistically significant at all. This left us with an important question: how can we find highly predictive variables then, if not through a guideline of statistical significance? To respond, we suggest a “Partition Retention (PR)” approach, for handling general big data variable selection and classification (prediction) problems. PR alters standard statistical practice in big data analysis, switching from significance-based modeling to seeking variables with high predictivity, a novel parameter of interest. We introduce the I-score, a statistic that can select variables sets with very high prediction rates and is closely related to a very useful lower bound of the predictivity.

There are diverse scientific applications for which the PR approach would be useful, for example in formulating predictions about diseases with high dimensional data, such as gene datasets, in the social sciences for text prediction or financial markets predictions; in terrorism, civil war, elections and financial markets. We're hoping this opens up a new field of work that would focus on designing new statistics that measure predictivity."

]]>Abstract

Computational performance is challenging today. Datasets can be big, computational complexity of analytic methods can be high, and computer hardware power can be limited. Divide & Recombine (D&R) is a statistical approach to meet the challenges.

In D&R, The analyst divides the data into subsets by a D&R division method. Each analytic method is applied to each subset, independently, without communication. The outputs of each analytic method are recombined by a D&R recombination method. Sometimes the goal is one result for all of the data, such as a logistic regression; D&R theory and methods seek division and recombination methods to optimize the statistical accuracy. Much more common in practice is a division based on the subject matter. The data are divided by conditioning on variables important to the analysis. In this case the outputs can be the final result, or further analysis is carried out, an analytic recombination.

D&R computation is mostly embarrassingly parallel, the simplest parallel computation. DeltaRho soft- ware is an open-source implementation of D&R. The front end is the R package datadr, which is a language for D&R. It makes programming D&R simple. At the back end, running on a cluster, is a distributed database and parallel compute engine such as Hadoop, which spreads subsets and outputs across the cluster, and executes the analyst R and datadr code in parallel. The R package RHIPE provides communication between datadr and Hadoop. DeltaRho protects the analyst from management of parallel computation and database management. D&R with DeltaRho can increase the dramatically the data size and analytic computational complexity that are feasible in practice, whether hardware power of an available cluster is small, medium, or large. The data can have a memory size that is larger than the physical cluster memory.

]]>Abstract

Pairing serves as a way of lessening hetero geneity but pays the price of introducing more parameters to the model. This complicates the probability structure and makes inference more intricate. We employ the simpler structure of the parallel design to develop a robust score statistic for testing the equality of two multinomial distributions in paired designs. This test incorporates the within-pair correlation in a data-driven manner without a full model specification. In the paired binary data scenario the robust score statistic becomes the McNemar’s test. We provide simulations and real data analysis to demonstrate the advantage of the robust procedure.

Keywords: Paired design; Parallel design; Multinomial distribution; Robust score statistic; McNemar's test.

]]>