Using Machine Learning to Model Information Complexity in Risk Adjustment

Wednesday, June 15, 2016: 12:40 PM
B26 (Stiteler Hall)

Author(s): Jeffrey S. McCullough; Tsan-Yao Huang

Discussant: Colleen M. Carey

Health care has undergone a transformation in both the scale and variety of data. Information technologies such as electronic health record systems (EHRs) provide a rapidly expanding source of clinical data while the scale of more traditional data sources, such as administrative claims, continues to rise. Other fields such as computer science and statistics have developed a wide range of machine learning or “big data” analytic techniques. These techniques are particularly powerful for prediction in stationary environments and signal detection in high-dimensional data. While these techniques are adept at out-of-sample prediction, they are typically atheoretical and may often produce biased parameter estimates.

We use these techniques to explore three issues related to clinical risk adjustment. First, we compare the accuracy of machine learning techniques to traditional algorithms for both in- and out-of-sample risk adjustment. Second, we compare the relative performance of these models over clinical conditions with more or less information complexity. Acute surgical conditions or acute myocardial infarctions (AMIs) could, for example, have a high-risk mortality but outcomes may be dependent on a fairly concise set of clinical information. Other conditions, such as pneumonia, that require medical management may depend on a broader set of clinical information. Modeling these conditions may also require more flexible specifications. Machine learning techniques may be especially beneficial for conditions with information complexity. Third, we use the risk adjustment algorithms in a model of hospital quality with endogenous choice. We measure the effect of Critical Access Hospitals (CAHs) on quality while using machine learning techniques to allow for heterogeneous treatment effects. The models are identified using differential distances between the closest CAH and non-CAH hospitals.

These models are estimated using Medicare Fee For Service (FFS) administrative claims data. These data describe all hospitalizations and all ER visits for the 2011 FFS Medicare population. We compare traditional risk adjustment algorithms to a variety of machine learning techniques. We first explore penalized regressions such as the least absolute shrinkage and selection operator (Lasso). These models employ a penalty function to select relevant covariates from a high-dimensional data vector. We then explore a variety of regression-tree based models such as boosting and bagging and random forests. Tree-based models may be more adept at handling non-linear inputs. The hospital selection and outcome models will be estimated by recursive bivariate probit using a double-selection process to correct for bias (Belloni et al., 2012 and 2014).

Preliminary results suggest that machine learning techniques provide more accurate risk adjustment for all clinical conditions. However, the magnitudes of the gains are small (e.g., a 1 to 2 percentage point improvement) for conditions such as AMI but substantially higher (e.g., about 8 percentage points) for conditions such as pneumonia.