29
The Limitations of Case-Control Matching when Estimating Average Treatment Effects

Monday, June 23, 2014
Argue Plaza

Author(s): Aaron C Miller

Discussant:

Case-control matching is frequently used in the fields of medicine and epidemiology to estimate average treatment effects (ATEs). In theory, when matching is properly implemented, it should produce estimates of ATEs similar to those of multivariate regression. However, in practice matching may suffer from additional limitations beyond basic regression analysis: exact matching can only accommodate a limited number of control variables before case observations must be eliminated, and estimation does not directly evaluate the strength of such control variables.

We demonstrate the potential dangers of matching by estimating the length of stay (LOS) attributable to nosocomial Clostridium difficileinfections (CDI). Increased LOS is the most costly outcome associated with healthcare-associated CDI. The existing literature has produced a wide range of estimates of CDI-attributable LOS estimates. Many studies have relied on exact matching to estimate attributable LOS, and estimates produced using matching techniques are larger and more variable than those using regression. We compare estimates of attributable LOS from various regression and matching methods using the 2009 HCUP Nationwide Inpatient Sample.

First, we estimate attributable LOS using a variety of commonly used regression models. These methods produce a fairly consistent range of estimates ranging from 3.0 days, with a gamma based GLM and log link, to 3.55 days with OLS.  Second, we estimate attributable-LOS by matching patients on various sets of control variables that have been frequently used in the matching literature. Matching on the various sets of controls produces estimates ranging from 6.0 to 7.3 days. Using such matching estimators forced us to eliminate anywhere from 36% to 47% of the CDI cases in the sample.

We then attempt to find an ideal set of control variables to match on using an informed and naive search strategies. We first use regression to find the strongest predictors of LOS and iteratively match on variables in order of their predictive strength. Our informed matching process is able to produce an estimate of 3.62 days, while requiring only five control variables and omitting less than 1% of the CDI observations. However, the optimal match variables are different from those reported in the literature. Next, we implement a naive search by repeatedly drawing random sets of control variables. This naive search demonstrates that the control variables most frequently used in the matching literature for CDI produce estimates of the attributable-LOS that are no better than those produced from randomly drawing a set of match values.

Our results demonstrate that, in practice, case-control matching can be severely limited by the need to specify an optimal set of control variables before analysis occurs. Moreover, we were only able to obtain an optimal set of matched controls after first using regression to determine the strongest predictors of LOS. Given that matching is frequently employed in settings where control variables are chosen before observations are gathered, this represents a significant limitation for case-control-matching approaches. Our findings suggest that, while exact matching methods may be valid estimators of ATEs in theory, they may be inferior to regression in practice.