Impact Evaluation - Counterfactual Evaluation Designs

Counterfactual Evaluation Designs

Counterfactual analysis enables evaluators to attribute cause and effect between interventions and outcomes. The ‘counterfactual’ measures what would have happened to beneficiaries in the absence of the intervention, and impact is estimated by comparing counterfactual outcomes to those observed under the intervention. The key challenge in Impact Evaluation is that the counterfactual cannot be directly observed, but must be approximated with reference to a comparison group. There are a range of accepted approaches to determining an appropriate comparison group for counterfactual analysis, using either prospective (ex ante) or retrospective (ex post) evaluation design. Prospective evaluations begin during the design phase of the intervention, involving collection of baseline and end-line data from intervention beneficiaries (the ‘treatment group’) and non-beneficiaries (the ‘comparison group’), and may also involve selection of individuals or communities into treatment and comparison groups. Retrospective evaluations are usually conducted after the implementation phase, and may exploit existing survey data, although the best evaluations will collect data as close to baseline as possible, to ensure comparability of intervention and comparison groups.

There are five key principles relating to internal validity (study design) and external validity (generalizability) which rigorous Impact Evaluations should address: confounding factors, selection bias, spillover effects, contamination, and impact heterogeneity.

Confounding occurs where certain factors, typically relating to socio-economic status, are correlated with both exposure to the intervention and, independent of exposure, are causally related to the outcome of interest. Confounding factors are therefore alternate explanations for an observed (possibly spurious) relationship between intervention and outcome.

Selection bias, a special case of confounding, occurs where intervention participants are non-randomly drawn from the beneficiary population, and the criteria determining selection are correlated with outcomes. Unobserved factors, which are associated with access to or participation in the intervention, and are causally related to the outcome of interest, may lead to a spurious relationship between intervention and outcome if unaccounted for. Self-selection occurs where, for example, more able or organized individuals or communities, who are more likely to have better outcomes of interest, are also more likely to participate in the intervention. Endogenous program selection occurs where individuals or communities are chosen to participate because they are seen to be more likely to benefit from the intervention. Ignoring confounding factors can lead to a problem of omitted variable bias. In the special case of selection bias, the endogeneity of the selection variables can cause simultaneity bias.

Spillover (referred to as contagion in the case of experimental evaluations) occurs when members of the comparison (control) group are affected by the intervention. Contamination occurs when members of treatment and/or comparison groups have access to another intervention which also affects the outcome of interest.

Impact heterogeneity refers to differences in impact due by beneficiary type and context. High quality Impact Evaluations will assess both the extent to which different groups (e.g. the disadvantaged) benefit from an intervention as well as the potential effect of context on impact. The degree that results are generalizable will determine the applicability of lessons learned for interventions in other contexts.

Impact evaluation designs are identified by the type of methods used to generate the counterfactual and can be broadly classified into three categories – experimental, quasi-experimental and non-experimental designs – that vary in feasibility, cost, involvement during design or after implementation phase of the intervention, and degree of selection bias. White (2006) and Ravallion (2008) discusses alternate Impact Evaluation approaches.

Experimental design

Under experimental evaluations the treatment and comparison groups are selected randomly and isolated both from the intervention, as well as any interventions which may affect the outcome of interest. These evaluation designs are referred to as randomized control trials (RCTs). In experimental evaluations the comparison group is called a control group. When randomization is implemented over a sufficiently large sample with no contagion by the intervention, the only difference between treatment and control groups on average is that the latter does not receive the intervention. Random sample surveys, in which the sample for the evaluation is chosen on a random basis, should not be confused with experimental evaluation designs, which require the random assignment of the treatment.

The experimental approach is often held up as the ‘gold standard’ of evaluation, and it is the only evaluation design which can conclusively account for selection bias in demonstrating a causal relationship between intervention and outcomes. Randomization and isolation from interventions might not be practicable in the realm of social policy, and may also be ethically difficult to defend, although there may be opportunities to utilize natural experiments. Bamberger and White (2007) highlight some of the limitations to applying RCTs to development interventions. Methodological critiques have been made by Scriven (2008) on account of the biases introduced since social interventions cannot be triple blinded, and Deaton (2009) has pointed out that in practice analysis of RCTs falls back on the regression-based approaches they seek to avoid, and so are subject to the same potential biases. Other problems include the often heterogeneous and changing contexts of interventions, logistical and practical challenges, difficulties with monitoring service delivery, access to the intervention by the comparison group and changes in selection criteria and/or intervention over time. Thus, it is estimated that RCTs are only applicable to 5 per cent of development finance.

Quasi-experimental design

Quasi-experimental approaches can remove bias arising from selection on observables and, where panel data are available, time invariant unobservables. Quasi-experimental methods include matching, differencing, instrumental variables and the pipeline approach, and are usually carried out by multivariate regression analysis.

If selection characteristics are known and observed then they can be controlled for to remove the bias. Matching involves comparing program participants with non-participants based on observed selection characteristics. Propensity score matching (PSM) uses a statistical model to calculate the probability of participating on the basis of a set of observable characteristics, and matches participants and non-participants with similar probability scores. Regression discontinuity design exploits a decision rule as to who does and does not get the intervention to compare outcomes for those just either side of this cut-off.

Difference-in-differences or double differences, which use data collected at baseline and end-line for intervention and comparison groups, can be used to account for selection bias with under the assumption that unobservable factors determining selection are fixed over time (time invariant).

Instrumental variables estimation accounts for selection bias by modelling participation using factors (‘instruments’) that are correlated with selection but not the outcome, thus isolating the aspects of program participation which can be treated as exogenous.

The pipeline approach (stepped-wedge design) uses beneficiaries already chosen to participate in a project at a later stage as the comparison group. The assumption is that as they have been selected to receive the intervention in the future they are similar to the treatment group, and therefore comparable in terms of outcome variables of interest. However, in practice, it cannot be guaranteed that treatment and comparison groups are comparable and some method of matching will need to be applied to verify comparability.

Non-experimental design

Non-experimental Impact Evaluations are so-called because they do not involve a comparison group which does not have access to the intervention. The method used in non-experimental evaluation is to compare intervention groups before and after implementation of the intervention. Intervention interrupted time-series (ITS) evaluations require multiple data points on treated individuals both before and after the intervention, while before versus after (or pre-test post-test) designs simply require a single data point before and after. Post-test analyses include data after the intervention from the intervention group only. Non-experimental designs are the weakest evaluation design, because in order to show a causal relationship between intervention and outcomes convincingly, the evaluation must demonstrate that any likely alternate explanations for the outcomes are irrelevant. However, there remain applications to which this design is relevant, for example in calculating time-savings from an intervention which improves access to amenities. In addition, there may be cases where non-experimental designs are the only feasible impact evaluation design, such as universally-implemented programmes or national policy reforms in which no isolated comparison groups are likely to exist.

Read more about this topic: Impact Evaluation

Famous quotes containing the words evaluation and/or designs:

“Evaluation is creation: hear it, you creators! Evaluating is itself the most valuable treasure of all that we value. It is only through evaluation that value exists: and without evaluation the nut of existence would be hollow. Hear it, you creators!”
—Friedrich Nietzsche (1844–1900)

“He began therefore to invest the fortress of my heart by a circumvallation of distant bows and respectful looks; he then entrenched his forces in the deep caution of never uttering an unguarded word or syllable. His designs being yet covered, he played off from several quarters a large battery of compliments. But here he found a repulse from the enemy by an absolute rejection of such fulsome praise, and this forced him back again close into his former trenches.”
—Sarah Fielding (1710–1768)

Related Subjects

Related Phrases

Related Words