中央研究院統計科學研究所

演講公告

演講公告演講公告

:::

A Regression Tree Approach to Missing Values and Explainable AI

2026-05-25 (Mon.), 10:30 AM
統計所B1演講廳；茶會：上午10：10。
實體與線上視訊同步進行。
Prof. Wei-Yin Loh (羅偉賢教授)
美國威斯康辛大學麥迪遜分校統計學系

Abstract

Classification and regression tree models are unmatched for their interpretability, a feature that is lacking in "black-box" models such as those constructed by deep learning, tree ensembles, and gradient boosting. Yet tree models have been falling out of favor in recent years for two reasons. First, the prediction accuracy of tree models tends to be lower than that of black-box models. In particular, tree models often have lower accuracy than random forest models. Consequently, forests have largely supplanted trees for prediction tasks. The second reason is equally important. Tree algorithms that use the CART (Breiman et al., 1984) approach are overly greedy in their search for a variable to split each node. As a result, they have a propensity to select variables (e.g., categorical variables with large numbers of levels) that allow more splits. This makes interpretation of their tree structures problematic.

This talk introduces some new features in the GUIDE algorithm (Loh, 2002), which is designed to be free of the selection bias of CART, not only in the selection of variables for splitting nodes, but also in its importance scores (importance scores of Breiman's (2001) random forest are biased). Unbiasedness is not, however, the only desirable property of GUIDE. Another unique feature is how GUIDE deals with missing values in the predictor variables. While most algorithms employ implicit imputation of missing values (such as CART's surrogate splits) or send them randomly to the left and right subnodes at each split (such as CTREE in the party and partykit R packages), GUIDE does not impute at all. Therefore, it does not require missing at random (MAR) assumptions that are often difficult, if not impossible, to justify. GUIDE treats missing values as qualitative information and its tree diagrams show explicitly where missing values go at every split.

For a long time, the lower prediction accuracy of tree models versus black-box models seemed inevitable: you get either interpretability or accuracy, but not both. The current emphasis on "explainable AI" has renewed interest in algorithms that produce single-tree models with predictive accuracy on par with black-box models. The primary reason for traditional tree models having lower prediction accuracy is their restriction to splits on a single variable at a time, and to predicting the response in each terminal node with a constant. These restrictions were meant to improve interpretability. One way to soften these restrictions while retaining explainability and improving accuracy is to construct trees with linear splits and linear models in the nodes. These two ideas have been tried before but never together. They are now implemented in the GUIDE algorithm and software. Empirical evidence based on real data show that these new GUIDE models have predictive accuracy comparable to or better than that of random forests, neural nets, and gradient-boosted trees. The new GUIDE models can approximately "explain" which variables are utilized and how they are utilized by a black-box model. Alternatively, they can be highly accurate "explainable" replacements for them.

線上視訊請點選連結

最後更新日期：2026-04-09 14:56

回列表頁