Entropy Based Statistical Inference for Some HDLSS Genomic Models: UI Tests in a Chen-Stein Perspective
- 2010-05-24 (Mon.), 10:30 AM
- Auditorium, 2F, Tsai Yuan-Pei Memorial Hall
- Prof. Ming-Tien Tsai
- Institute of Statistical Science, Academia Sinica
Abstract
One of the scientific foci is to classify the K genes into two subsets of disease genes and non-disease genes. For HDLSS (high-dimensional, low-sample size) categorical data models, the number of associated parameters increases exponentially with K, thus creating an impasse to adapt conventional discrete multivariate analysis or model selection tools. Faced with this rather awkward environment, often statistical appraisals are based on marginal p-values where the multiple hypothesis testing (MHT) problem can be handled with the original Fisher’s method (developed nearly 80 years ago) along with various ramifications during the past 25 years or so. On the other hand, like the maximum likelihood being the dominant paradigm in statistics, the Shannon entropy (1948) is the dominant paradigm in information and coding theory. For qualitative data models, Gini-Simpson index (Gini, 1912; Simpson, 1949) and Shannon entropy are commonly used in dissimilarity and diversity analysis, economic inequality and poverty analysis, and genetic variation studies, as well as in many other fields. By the Lorenz curve, we can show that Shannon entropy appears to be more informative than Gini-Simpson index. However, for HDLSS genomic models, we suspect that the information might not be fully captured in a pseudo-marginal setup (namely, the so-called multivariate version of Shannon entropy in the literature). To capture greater information, some new genuine multivariate analogues of Shannon entropy are proposed. The nested subset monotonicity prospect along with subgroup decomposability of the proposed new measures is also exploited. Based on the proposed new Hamming-Shannon pooled measures, we incorporate the union-intersection principle of Roy (1953) and Chen-Stein theorem (Chen, 1975) to formulate suitable statistical procedures for gene classification. The SARSCoV data set is appraised as illustration. This is a joint work with Prof. P.K. Sen. ?