We propose a nonparametric multiple
imputation approach to recover information for censored observations while
analyzing survival data with presence of informative censoring. A working
shared frailty model is proposed to estimate the magnitude of informative
censoring, which is only used to determine the size of imputing risk set for
each censored subject. We have shown that the distance between the posterior
means of frailty is equivalent to the distance between the observed times. We,
therefore, propose to use the observed times for subjects at risk to calculate
the distance from each censored subject to select an imputing risk set for each
censored subject. In simulation, we have shown the nonparametric multiple
imputation approach produces survival estimates comparable to the targeted
values and coverage rates comparable to the nominal level 95% even in a
situation with a high degree of informative censoring. We have also
demonstrated the approach on ACTG-175 and developed an alternative sensitivity
analysis based on the approach for informative censoring.

Many real time series data sets exhibit
structural changes over time. It is then of interest to both estimate the
(unknown) number of structural break points, together with the parameters of
the statistical model employed to capture the relationships amongst the
variables/features of interest. An additional challenge emerges in the presence
of very large data sets, namely on how to accomplish these two objectives in a
computational efficient manner. In this talk, we outline a novel procedure
which leverages a block segmentation scheme (BSS) that reduces the number of
model parameters to be estimated through a regularized least squares criterion.
Specifically, BSS examines appropriately defined blocks of the available data,
which when combined with a fused lasso based estimation criterion, leads to
significant computational gains without compromising on the statistical
accuracy in identifying the number and location of the structural breaks. This
procedure is further coupled with new local and global screening steps to
consistently estimate the number and location of break points. The procedure is
scalable to large size high-dimensional time series data sets and can provably
achieve significant computational gains. It is further applicable to various
statistical models, including regression, graphical models and
vector-autoregressive models. Extensive numerical work on synthetic data
supports the theoretical findings and illustrates the attractive properties of
the procedure. Applications to neuroimaging data will also be discussed.

Abstract

We propose a general framework for prediction in which a prediction is
in the form of a distribution function, called ‘predictive distribution
function’. This predictive distribution function is well suited for prescribing
the notion of confidence under the frequentist interpretation and providing meaningful answers for
prediction-related questions. Its very form of a distribution function also
lends itself as a useful tool for quantifying uncertainty in prediction. A
general approach under this framework is formulated and illustrated using the
so-called confidence distributions (CDs). This CD-based prediction approach
inherits many desirable properties of CD, including its capacity to serve as a
common platform for directly connecting the existing procedures of predictive
inference in Bayesian, fiducial and frequentist paradigms. We discuss the theory
underlying the CD-based predictive distribution and related efficiency and
optimality. We also propose a simple yet broadly applicable Monte-Carlo
algorithm for implementing the proposed approach. This concrete algorithm
together with the proposed definition and associated theoretical development
provide a comprehensive statistical inference framework for prediction.
Finally, the approach is demonstrated by simulation studies and a real project
on predicting the volume of application submissions to a government agency. The
latter shows the applicability of the proposed approach to even dependent data
settings.

This is joint work with Jieli Shen (Goldman
Sachs) and Minge Xie (Rutgers University).

Artificial Intelligence (or Deep learning)
and big data have attracted great attention in recent years. The availability
of big data and advance in computation methods and capability further
accelerate the development in machine learning. There is no doubt that AI will
affect every aspect of our life in the future. In this talk, we discuss the
following issues: (a) What is AI? (b)
What is machine learning? (c) The role data play in the development of AI, (d)
What are the potential impact of AI? And (e) How to prepare for the AI
challenges? The talk will emphasize on the value of data and the importance of
statistical reasoning and methods in the development of smart AI.

This paper is about how we study
statistical methods. As an example, it uses the random regressions model, in
which the intercept and slope of cluster-specific regression lines are modeled
as a bivariate random effect. Maximizing this model's restricted likelihood
often gives a boundary value for the random effect correlation or variances. We
argue that this is a problem; that it is a problem because our discipline has
little understanding of how contemporary models and methods map data to
inferential summaries; that we lack such understanding, even for models as
simple as this, because of a near-exclusive reliance on mathematics as a means
of understanding; and that math alone is no longer sufficient. We then argue
that as a discipline, we can and should break open our black-box methods by
mimicking the five steps that molecular biologists commonly use to break open
Nature's black boxes: design a simple model system, formulate hypotheses using
that system, test them in experiments on that system, iterate as needed to
reformulate and test hypotheses, and finally test the results in an "in
vivo" system. We demonstrate this by identifying conditions under which
the random-regressions restricted likelihood is likely to be maximized at a
boundary value. Resistance to this approach seems to arise from a view that it
lacks the certainty or intellectual heft of mathematics, perhaps because
simulation experiments in our literature rarely do more than measure a new
method's operating characteristics in a small range of situations. We argue
that such work can make useful contributions including, as in molecular
biology, the findings themselves and sometimes the designs used in the five
steps; that these contributions have as much practical value as mathematical
results; and that therefore they merit publication as much as the mathematical
results our discipline esteems so highly.

Abstract

The
mixture of factor analyzers (MFA) model has emerged as a useful tool to perform
dimensionality reduction and model-based clustering of heterogeneous data. In
seeking the most appropriate number of factors (q) of a MFA model with the
number of components (g) fixed a priori, a two-stage procedure is commonly
implemented by firstly carrying out parameter estimation over a set of
prespecified numbers of factors, and then selecting the best q according to
certain penalized likelihood criteria. When the dimensionality of data grows
higher, such a procedure can be computationally prohibitive. To overcome this
obstacle, we develop an automated learning scheme, called the automated MFA
(AMFA) algorithm, to effectively merge parameter estimation and selection of q
into a one-stage algorithm. The proposed AMFA procedure that allows for much
lower computational cost is also extended to accommodate missing values.
Moreover, we explicitly derive the score vector and the empirical information
matrix for calculating standard errors associated with the estimated
parameters. The potential and applicability of the proposed method are
demonstrated through a number of real datasets with genuine and synthetic missing
values.

Keywords: automated learning; factor
analysis; maximum likelihood estimation; missing values; model selection;
one-stage algorithm

Abstract

Cancer immunotherapy has dramatically transformed the
treatment landscape of advanced cancers. There are now seven FDA-approvedimmune
checkpoint inhibitors targeting key immune checkpoint proteins, such as
programmed death receptor-1 (PD-1), its ligand (PD-L1), or cytotoxic
T-lymphocyte-associated protein 4 (CTLA-4). For some patients, these agents
extend lifespan and provide durable benefits. However, the majority of patients
receiving these treatments do not benefit from them: response rates in clinical
trials range from 10-50% and a fraction suffer from adverse immune toxicities.
There is an urgent need for reliable predictors of immunotherapy response.
Since the first report of cancer genome sequencing in 2006, we have gained a
considerable understanding of the cell-autonomous effects—the effects induced
upon cancer cells—caused by genomic alterations. Yet, how they affect the tumor
microenvironment remains unclear. We studied the non-cell-autonomous effects,
especially those affecting tumor-immune interactions, of the same cancer
genomic alterations. The results highlight cancer predisposition and driver
genes that may serve as biomarkers for prioritizing patient sub-populations for
treatment. Notably, they also suggest response and resistance mechanisms that
can be targeted to improve cancer immunotherapy.