Specification in difference-in-differences models

Tuesday, June 24, 2014: 1:35 PM
Waite Phillips 207 (Waite Phillips Hall)

Author(s): Andrew M Ryan

Discussant: Jason M. Hockenberry

The use of difference-in-difference models to estimate the effects of health care policies has exploded: the number of articles in Pubmed in which “difference-in-differences” appears increased four-fold between 2007 and 2012. The strength of difference-in-differences methods to make accurate inferences about the effects of treatment, however, can depend crucially on specification choices. Work by Bertrand and colleagues (2004) highlighted that variance estimates of program effects in difference-in-differences models often result in high rates of type II error. However, specification choices related to obtaining accurate point estimates in difference-in-differences models – including the choice of comparison groups, the choice of the pre-intervention time interval, and addressing violations to the “parallel trends” assumption – have received scant attention in the literature.

We conduct a simulation experiment to evaluate the effects of specification choices in difference-in-differences models. We use hospital-level data on quality of care between 2004 and 2009 from Hospital Compare – Medicare’s public quality reporting program – to estimate the effect of an imaginary policy, initiated in 2007 and continuing through 2009, on quality of care. To do this, we assign hospitals to treatment and comparison groups under three alternative scenarios in which: 1) the probability of hospital selection for treatment is completely random; 2) the probability of treatment is positively correlated with pre-intervention levels of quality; 3) the probability of treatment is positively correlated with pre-intervention trends in quality. We attribute effects of the imaginary intervention on quality to hospitals assigned to treatment, ranging from no effect, a small effect (+0.2 standard deviations), or a large effect (+ 0.5 standard deviations).

Using alternative modeling specifications, we then estimate the impact of the imaginary policy using difference-in-differences. Specifications vary with respect to the choice of the data intervals (use of all six years, condense to pre-post), the comparison group (use of all non-treated hospitals, propensity score matched hospitals), and the method of obtaining inference (assuming i.i.d errors, the cluster-sandwich estimator, and permutation tests). We capture the rate of false rejection (i.e., type II error), the rate of rejection from “small” and “large” program effects, and the mean absolute deviation (MAD) between estimated program effects and their true value. These parameters are obtained from 600 simulation iterations.

Our results show that when hospitals are randomly assigned to treatment, alternative specifications have a minor impact on rejection rates and estimator bias. However, the performance of alternative specifications varies dramatically when the probability of treatment is positively correlated with pre-intervention levels or trends. In these cases, the use of propensity score matching resulted in much more accurate point estimates: MAD values from these specifications were 3 to 10 times lower than specifications using all non-treated hospitals as the comparison group. Also, the use of permutation tests resulted in type II error rates that were frequently 10 to 20 times lower than specifications using i.i.d standard errors or the cluster-sandwich estimator. These results support specification for difference-in-differences models that include matching for more accurate point estimates and the use of permutation tests for better inference.