Institute of Statistical Science Academia Sinica [Seminar Feed] Statistics, Stat, Edu en-us Fri, 13 Dec 2019 18:08:13 +0800 PHP Survival Analysis with Presence of Informative Censoring via Nonparametric Multiple Imputation  Abstract

We propose a nonparametric multiple imputation approach to recover information for censored observations while analyzing survival data with presence of informative censoring. A working shared frailty model is proposed to estimate the magnitude of informative censoring, which is only used to determine the size of imputing risk set for each censored subject. We have shown that the distance between the posterior means of frailty is equivalent to the distance between the observed times. We, therefore, propose to use the observed times for subjects at risk to calculate the distance from each censored subject to select an imputing risk set for each censored subject. In simulation, we have shown the nonparametric multiple imputation approach produces survival estimates comparable to the targeted values and coverage rates comparable to the nominal level 95% even in a situation with a high degree of informative censoring. We have also demonstrated the approach on ACTG-175 and developed an alternative sensitivity analysis based on the approach for informative censoring.

Tue, 8 Oct 2019 18:31:49 +0800
Fast Algorithms for Detection of Structural Breaks in High Dimensional Data Abstract

    Many real time series data sets exhibit structural changes over time. It is then of interest to both estimate the (unknown) number of structural break points, together with the parameters of the statistical model employed to capture the relationships amongst the variables/features of interest. An additional challenge emerges in the presence of very large data sets, namely on how to accomplish these two objectives in a computational efficient manner. In this talk, we outline a novel procedure which leverages a block segmentation scheme (BSS) that reduces the number of model parameters to be estimated through a regularized least squares criterion. Specifically, BSS examines appropriately defined blocks of the available data, which when combined with a fused lasso based estimation criterion, leads to significant computational gains without compromising on the statistical accuracy in identifying the number and location of the structural breaks. This procedure is further coupled with new local and global screening steps to consistently estimate the number and location of break points. The procedure is scalable to large size high-dimensional time series data sets and can provably achieve significant computational gains. It is further applicable to various statistical models, including regression, graphical models and vector-autoregressive models. Extensive numerical work on synthetic data supports the theoretical findings and illustrates the attractive properties of the procedure. Applications to neuroimaging data will also be discussed.

Tue, 8 Oct 2019 18:50:59 +0800
Prediction with Confidence – General Framework for Predictive Inference


    We propose a general framework for prediction in which a prediction is in the form of a distribution function, called ‘predictive distribution function’. This predictive distribution function is well suited for prescribing the notion of confidence under the frequentist interpretation and  providing meaningful answers for prediction-related questions. Its very form of a distribution function also lends itself as a useful tool for quantifying uncertainty in prediction. A general approach under this framework is formulated and illustrated using the so-called confidence distributions (CDs). This CD-based prediction approach inherits many desirable properties of CD, including its capacity to serve as a common platform for directly connecting the existing procedures of predictive inference in Bayesian, fiducial and frequentist paradigms. We discuss the theory underlying the CD-based predictive distribution and related efficiency and optimality. We also propose a simple yet broadly applicable Monte-Carlo algorithm for implementing the proposed approach. This concrete algorithm together with the proposed definition and associated theoretical development provide a comprehensive statistical inference framework for prediction. Finally, the approach is demonstrated by simulation studies and a real project on predicting the volume of application submissions to a government agency. The latter shows the applicability of the proposed approach to even dependent data settings.


This is joint work with Jieli Shen (Goldman Sachs) and Minge Xie (Rutgers University).

Wed, 11 Dec 2019 17:29:57 +0800
AI, Big Data, and the Future Abstract

    Artificial Intelligence (or Deep learning) and big data have attracted great attention in recent years. The availability of big data and advance in computation methods and capability further accelerate the development in machine learning. There is no doubt that AI will affect every aspect of our life in the future. In this talk, we discuss the following issues: (a) What is AI?  (b) What is machine learning? (c) The role data play in the development of AI, (d) What are the potential impact of AI? And (e) How to prepare for the AI challenges? The talk will emphasize on the value of data and the importance of statistical reasoning and methods in the development of smart AI.

Fri, 13 Dec 2019 14:51:06 +0800
Statistical methods research done as science rather than mathematics Abstract

    This paper is about how we study statistical methods. As an example, it uses the random regressions model, in which the intercept and slope of cluster-specific regression lines are modeled as a bivariate random effect. Maximizing this model's restricted likelihood often gives a boundary value for the random effect correlation or variances. We argue that this is a problem; that it is a problem because our discipline has little understanding of how contemporary models and methods map data to inferential summaries; that we lack such understanding, even for models as simple as this, because of a near-exclusive reliance on mathematics as a means of understanding; and that math alone is no longer sufficient. We then argue that as a discipline, we can and should break open our black-box methods by mimicking the five steps that molecular biologists commonly use to break open Nature's black boxes: design a simple model system, formulate hypotheses using that system, test them in experiments on that system, iterate as needed to reformulate and test hypotheses, and finally test the results in an "in vivo" system. We demonstrate this by identifying conditions under which the random-regressions restricted likelihood is likely to be maximized at a boundary value. Resistance to this approach seems to arise from a view that it lacks the certainty or intellectual heft of mathematics, perhaps because simulation experiments in our literature rarely do more than measure a new method's operating characteristics in a small range of situations. We argue that such work can make useful contributions including, as in molecular biology, the findings themselves and sometimes the designs used in the five steps; that these contributions have as much practical value as mathematical results; and that therefore they merit publication as much as the mathematical results our discipline esteems so highly.

Mon, 14 Oct 2019 11:52:14 +0800
Automated learning of mixtures of factor analyzers with missing values


    The mixture of factor analyzers (MFA) model has emerged as a useful tool to perform dimensionality reduction and model-based clustering of heterogeneous data. In seeking the most appropriate number of factors (q) of a MFA model with the number of components (g) fixed a priori, a two-stage procedure is commonly implemented by firstly carrying out parameter estimation over a set of prespecified numbers of factors, and then selecting the best q according to certain penalized likelihood criteria. When the dimensionality of data grows higher, such a procedure can be computationally prohibitive. To overcome this obstacle, we develop an automated learning scheme, called the automated MFA (AMFA) algorithm, to effectively merge parameter estimation and selection of q into a one-stage algorithm. The proposed AMFA procedure that allows for much lower computational cost is also extended to accommodate missing values. Moreover, we explicitly derive the score vector and the empirical information matrix for calculating standard errors associated with the estimated parameters. The potential and applicability of the proposed method are demonstrated through a number of real datasets with genuine and synthetic missing values.


Keywords: automated learning; factor analysis; maximum likelihood estimation; missing values; model selection; one-stage algorithm

Mon, 9 Dec 2019 13:26:54 +0800
TBD Fri, 29 Nov 2019 17:37:18 +0800 Thinking Outside the Cancer Cell: Cancer Genomics for Immunotherapy<br /> 用癌細胞基因體學改善免疫療法


Cancer immunotherapy has dramatically transformed the treatment landscape of advanced cancers. There are now seven FDA-approvedimmune checkpoint inhibitors targeting key immune checkpoint proteins, such as programmed death receptor-1 (PD-1), its ligand (PD-L1), or cytotoxic T-lymphocyte-associated protein 4 (CTLA-4). For some patients, these agents extend lifespan and provide durable benefits. However, the majority of patients receiving these treatments do not benefit from them: response rates in clinical trials range from 10-50% and a fraction suffer from adverse immune toxicities. There is an urgent need for reliable predictors of immunotherapy response. Since the first report of cancer genome sequencing in 2006, we have gained a considerable understanding of the cell-autonomous effects—the effects induced upon cancer cells—caused by genomic alterations. Yet, how they affect the tumor microenvironment remains unclear. We studied the non-cell-autonomous effects, especially those affecting tumor-immune interactions, of the same cancer genomic alterations. The results highlight cancer predisposition and driver genes that may serve as biomarkers for prioritizing patient sub-populations for treatment. Notably, they also suggest response and resistance mechanisms that can be targeted to improve cancer immunotherapy. 

Fri, 13 Dec 2019 16:45:27 +0800
TBD Tue, 26 Nov 2019 09:18:21 +0800