Specification in difference-in-differences models
We conduct a simulation experiment to evaluate the effects of specification choices in difference-in-differences models. We use hospital-level data on quality of care between 2004 and 2009 from Hospital Compare – Medicare’s public quality reporting program – to estimate the effect of an imaginary policy, initiated in 2007 and continuing through 2009, on quality of care. To do this, we assign hospitals to treatment and comparison groups under three alternative scenarios in which: 1) the probability of hospital selection for treatment is completely random; 2) the probability of treatment is positively correlated with pre-intervention levels of quality; 3) the probability of treatment is positively correlated with pre-intervention trends in quality. We attribute effects of the imaginary intervention on quality to hospitals assigned to treatment, ranging from no effect, a small effect (+0.2 standard deviations), or a large effect (+ 0.5 standard deviations).
Using alternative modeling specifications, we then estimate the impact of the imaginary policy using difference-in-differences. Specifications vary with respect to the choice of the data intervals (use of all six years, condense to pre-post), the comparison group (use of all non-treated hospitals, propensity score matched hospitals), and the method of obtaining inference (assuming i.i.d errors, the cluster-sandwich estimator, and permutation tests). We capture the rate of false rejection (i.e., type II error), the rate of rejection from “small” and “large” program effects, and the mean absolute deviation (MAD) between estimated program effects and their true value. These parameters are obtained from 600 simulation iterations.
Our results show that when hospitals are randomly assigned to treatment, alternative specifications have a minor impact on rejection rates and estimator bias. However, the performance of alternative specifications varies dramatically when the probability of treatment is positively correlated with pre-intervention levels or trends. In these cases, the use of propensity score matching resulted in much more accurate point estimates: MAD values from these specifications were 3 to 10 times lower than specifications using all non-treated hospitals as the comparison group. Also, the use of permutation tests resulted in type II error rates that were frequently 10 to 20 times lower than specifications using i.i.d standard errors or the cluster-sandwich estimator. These results support specification for difference-in-differences models that include matching for more accurate point estimates and the use of permutation tests for better inference.