Institute of Statistical Science Academia Sinica [Seminar Feed] Statistics, Stat, Edu en-us Wed, 16 Oct 2019 14:42:46 +0800 PHP Phylogenetic analysis to explore the association between anti-NMDA receptor encephalitis and tumor based on microRNA biomarkers Abstract

    MicroRNA (miRNA) is a small non-coding RNA that functions in epigenetics control of gene expression, which can be used as a useful biomarker for diseases. Anti-NMDA receptor (Anti-NMDAR) encephalitis is an acute autoimmune disorder. Some patients were found to have tumors, especially teratoma. It occurs more often in female than in male.  Most of them have a significant recovery after tumor resection. It reveals that the tumor may induce anti-NMDAR encephalitis. In this study, we review microRNA (miRNA) biomarkers that associate with anti-NMDAR encephalitis and related tumors, respectively. To the best of our knowledge, there have not been any researches in the literature investigating the relationship between anti-NMDAR encephalitis and tumors through their miRNA biomarkers. We adopt a phylogenetic analysis to plot the phylogenetic trees of their miRNA biomarkers. From the analyzed results, we may explain that (i) there is a relationship between these tumors and anti-NMDAR encephalitis, and (ii) this disease occurs more often in female than in male. This sheds light on exploring this issue through miRNA intervention.

Wed, 11 Sep 2019 14:08:46 +0800
Comparison between the marginal hazard models and sub-distribution hazard models with an assumed copula Abstract

    For the analysis of competing risks data, three different types of hazard functions have been considered in the literature, namely the cause-specific hazard, the sub-distribution hazard, and the marginal hazard function. Accordingly, medical researchers can fit three different types of the Cox model to estimate the effect of covariates on each of the hazard function. Many authors studied the difference between the cause-specific hazard and the sub-distribution hazard. Comparative studies including the marginal hazard function do not exist due to the difficulties related to non-identifiability. In this paper, we adopt an assumed copula model to deal with the model identifiability issue, making it possible to establish a relationship between the sub-distribution hazard and the marginal hazard function.

    We develop a model diagnostic tool for comparing the subhazard and marginal hazard models. We then extend our comparative analysis to clustered competing risks data that are frequently used in medical studies. To facilitate the numerical comparison, we implement the computing algorithm for marginal Cox regression with clustered competing risks data in the R joint.Cox package and check its performance via simulations. For illustration, we analyze two survival datasets from lung cancer and bladder cancer patients. This is joint work with Shih Jia-Han, Il-Do Ha, and Ralf Wilke.

Wed, 11 Sep 2019 13:47:10 +0800
A Heterogeneity Measure for Cluster Identification with Application to Disease Mapping Abstract

    Mapping of disease incidence has long been of importance to epidemiology and public health. In this paper, we consider identification of clusters of spatial units with elevated disease rates and develop a new approach that estimates the relative disease risk in association with potential risk factors and simultaneously identifies clusters corresponding to elevated risks. A heterogeneity measure is proposed to enable the comparison of a candidate cluster and its complement under a pair of complementary models. A quasi-likelihood procedure is developed for estimating the model parameters and identifying the clusters. An advantage of our approach over traditional spatial clustering methods is the identification of clusters that can have arbitrary shapes due to abrupt or non-contiguous changes while accounting for risk factors and spatial correlation. Asymptotic properties of the proposed methodology are established and a simulation study shows empirically sound finite-sample properties. The mapping and clustering of enterovirus 71 infection in Taiwan are carried out for illustration.

Fri, 27 Sep 2019 10:07:29 +0800
Predicting the number of newly discovered rare species in an as-yet-unsurveyed sample Abstract

    In natural ecological communities, most species are rare and thus very likely to become extinct. As a consequence, the prediction and identification of rare species are of enormous value for conservation purposes. The main research question of interest is: how many newly found species will be rare in the next field survey? By using biodiversity information in an ecological sample, we developed an accurate estimator for estimating the number of new rare species (e.g., singletons, doubletons, and tripletons) that will be found in an as-yet-unsurveyed sample. A semi-numerical study showed that the proposed Bayesian-weight estimator accurately predicted the number of rare new species with low relative bias and relative root mean squared error and accordingly, high accuracy. Additionally, in this talk, I will employ some conservation-directed empirical applications to demonstrate the predicting power of the proposed method.

Thu, 3 Oct 2019 17:45:48 +0800
Principal Sub-manifolds and Classification on Manifolds Abstract

    We will discuss the problem of finding principal components to the multivariate datasets, that lie on an embedded nonlinear Riemannian manifold within the higher-dimensional space. Our aim is to extend the geometric interpretation of PCA, while being able to capture the nongeodesic form of variation in the data. We introduce the concept of a principal sub-manifold, a manifold passing through the center of the data, and at any point of the manifold, it moves in the direction of the highest curvature in the space spanned by the eigenvectors of the local tangent space PCA. We show the principal sub-manifold yields the usual principal components in Euclidean space. We illustrate how to find, use and interpret the principal sub-manifold, with which a classification boundary can be defined for data sets on manifolds.

Mon, 14 Oct 2019 17:04:40 +0800
On excursions inside an excursion Abstract

    The distribution of ranked heights of excursions of a Brownian bridge is given in a paper by Pitman and Yor (2001). In this talk, we consider excursions of a Brownian excursion above a random level. We study the maximum heights of these excursions as Pitman and Yor did for excursions of a Brownian bridge.

Mon, 14 Oct 2019 17:10:15 +0800
Unveiling ground truths from data containing vague truths: the story of cryo-electron microscopy  

    Detailed interactions between biological molecules are the fundamental to life, of which the compromise may cause diseases. These interactions can be exemplified by famous antibody-antigen interactions and many others. Direct visualization of such interactions at atomic resolution or at the level of chemical bonds have been made possible by protein X-ray crystallography depends on many technical advances in particular, synchrotron radiation and crystallization screen. Recent advance in low temperature electron microscopy (cryo-EM) has fulfilled a long-waited promise that protein structure can be revealed to near atomic resolution in the absence of crystal. This means a structure of a protein in its working conditions is now accessible. However, it has been a mis-concept that these detailed structures are directly available in the raw data that getting a powerful microscope is sufficient. In this talk, I will first brief X-ray crystallography and the ground truths of protein structure established by it. Then I will use a few detailed structures obtained here to illustrate the process of getting ground truths out from the very noisy cryo-EM data through correct “data averaging” through computation. As it is evident, the challenges of data reduction from very noisy data have presented great opportunities for statisticians.

Fri, 30 Aug 2019 17:57:47 +0800
Survival Analysis with Presence of Informative Censoring via Nonparametric Multiple Imputation  Abstract

We propose a nonparametric multiple imputation approach to recover information for censored observations while analyzing survival data with presence of informative censoring. A working shared frailty model is proposed to estimate the magnitude of informative censoring, which is only used to determine the size of imputing risk set for each censored subject. We have shown that the distance between the posterior means of frailty is equivalent to the distance between the observed times. We, therefore, propose to use the observed times for subjects at risk to calculate the distance from each censored subject to select an imputing risk set for each censored subject. In simulation, we have shown the nonparametric multiple imputation approach produces survival estimates comparable to the targeted values and coverage rates comparable to the nominal level 95% even in a situation with a high degree of informative censoring. We have also demonstrated the approach on ACTG-175 and developed an alternative sensitivity analysis based on the approach for informative censoring.

Tue, 8 Oct 2019 18:31:49 +0800
Fast Algorithms for Detection of Structural Breaks in High Dimensional Data Abstract

    Many real time series data sets exhibit structural changes over time. It is then of interest to both estimate the (unknown) number of structural break points, together with the parameters of the statistical model employed to capture the relationships amongst the variables/features of interest. An additional challenge emerges in the presence of very large data sets, namely on how to accomplish these two objectives in a computational efficient manner. In this talk, we outline a novel procedure which leverages a block segmentation scheme (BSS) that reduces the number of model parameters to be estimated through a regularized least squares criterion. Specifically, BSS examines appropriately defined blocks of the available data, which when combined with a fused lasso based estimation criterion, leads to significant computational gains without compromising on the statistical accuracy in identifying the number and location of the structural breaks. This procedure is further coupled with new local and global screening steps to consistently estimate the number and location of break points. The procedure is scalable to large size high-dimensional time series data sets and can provably achieve significant computational gains. It is further applicable to various statistical models, including regression, graphical models and vector-autoregressive models. Extensive numerical work on synthetic data supports the theoretical findings and illustrates the attractive properties of the procedure. Applications to neuroimaging data will also be discussed.

Tue, 8 Oct 2019 18:50:59 +0800
Statistical methods research done as science rather than mathematics Abstract

    This paper is about how we study statistical methods. As an example, it uses the random regressions model, in which the intercept and slope of cluster-specific regression lines are modeled as a bivariate random effect. Maximizing this model's restricted likelihood often gives a boundary value for the random effect correlation or variances. We argue that this is a problem; that it is a problem because our discipline has little understanding of how contemporary models and methods map data to inferential summaries; that we lack such understanding, even for models as simple as this, because of a near-exclusive reliance on mathematics as a means of understanding; and that math alone is no longer sufficient. We then argue that as a discipline, we can and should break open our black-box methods by mimicking the five steps that molecular biologists commonly use to break open Nature's black boxes: design a simple model system, formulate hypotheses using that system, test them in experiments on that system, iterate as needed to reformulate and test hypotheses, and finally test the results in an "in vivo" system. We demonstrate this by identifying conditions under which the random-regressions restricted likelihood is likely to be maximized at a boundary value. Resistance to this approach seems to arise from a view that it lacks the certainty or intellectual heft of mathematics, perhaps because simulation experiments in our literature rarely do more than measure a new method's operating characteristics in a small range of situations. We argue that such work can make useful contributions including, as in molecular biology, the findings themselves and sometimes the designs used in the five steps; that these contributions have as much practical value as mathematical results; and that therefore they merit publication as much as the mathematical results our discipline esteems so highly.

Mon, 14 Oct 2019 11:52:14 +0800