A- A A+



Classification of Imbalanced Data Set


In classification and prediction, when the class sizes in the training samples are not equal, the classification rule derived generally will favor the majority (larger) class. The majority class will have a high sensitivity. The standard classification algorithms are designed to minimize the number of incorrect predictions or to maximize the number of concordances. Maximizing concordance criterion is based on an assumption of an equal cost of misclassifications. In many applications, such as clinical diagnostic test of rare diseases, the interest is toward correct prediction (high sensitivity) of minority class samples. The criterion of maximizing concordance may not be appropriate when the class sizes are imbalanced (or misclassification costs are unequal). I will present two approaches to account for imbalanced class sample sizes: 1) adjusting decision threshold, and 2) ensemble classifier using resampling techniques. Examples are given for illustration.
