Saturday, March 12, 2022

Reading Groups: Scaling in Social and Behavioural Science

During reading groups with students here at LSE, we will discuss papers from a number of emerging literatures at the interface of behavioural science. I am making the papers and reading lists available on this blog as they may be of interest to wider readers. 

Clearly, many disciplines grapple with issues of scaling and systemic effects more generally. I am happy to represent that in this reading list where I can meaningfully make connections. Suggestions for other papers welcome. Many of the papers in the ethics reading list address related questions such as institutional legitimacy and discussions interlinking these are welcome. 

John List's recent book "The Voltage Effect" summarises a range of factors that can lead to differential success in scaling up trials. 
Countless enterprises fall apart the moment they scale; their positive results fizzle, they lose valuable time and money, and the great electric charge of potential that drove them early on disappears. In short, they suffer a voltage drop. Yet success and failure are not about luck - in fact, there is a rhyme and reason as to why some ideas fail and why some make it big. Certain ideas are predictably scalable, while others are predictably destined for disaster. In The Voltage Effect, University of Chicago economist John A. List explains how to identify the ideas that will be successful when scaled, and how to avoid those that won't. Drawing on his own original research, as well as fascinating examples from the realms of business, government, education, and public health, he details the five signature elements that cause voltage drops, and unpacks the four proven techniques for increasing positive results - or voltage gains - and scaling great ideas to their fullest potential. By understanding the science of scaling, we can drive change in our schools, workplaces, communities, and society at large. Because a better world can only be built at scale.
The AER paper "From Proof of Concept to Scalable Policies: Challenges and Solutions, with an Application" by Banerjee and a wide group of co-authors is one of the most useful papers I have read on issues with scaling data from trials. As well as discussing practical issues across the scaling process, it relates these nicely to the type of parameters being estimated in trials. 
The promise of randomized controlled trials is that evidence gathered through the evaluation of a specific program helps us—possibly after several rounds of fine-tuning and multiple replications in different contexts—to inform policy. However, critics have pointed out that a potential constraint in this agenda is that results from small "proof-of-concept" studies run by nongovernment organizations may not apply to policies that can be implemented by governments on a large scale. After discussing the potential issues, this paper describes the journey from the original concept to the design and evaluation of scalable policy. We do so by evaluating a series of strategies that aim to integrate the nongovernment organization Pratham's "Teaching at the Right Level" methodology into elementary schools in India. The methodology consists of reorganizing instruction based on children's actual learning levels, rather than on a prescribed syllabus, and has previously been shown to be very effective when properly implemented. We present evidence from randomized controlled trials involving some designs that failed to produce impacts within the regular schooling system but still helped shape subsequent versions of the program. As a result of this process, two versions of the programs were developed that successfully raised children's learning levels using scalable models in government schools. We use this example to draw general lessons about using randomized control trials to design scalable policies.

The BPP paper "Successfully scaled solutions need not be homogenous" provides an account of scaling based on machine learning and micro-level heterogeneity. 

Al-Ubaydli et al. point out that many research findings experience a reduction in magnitude of treatment effects when scaled, and they make a number of proposals to improve the scalability of pilot project findings. While we agree that scalability is important for policy relevance, we argue that non-scalability does not always render a research finding useless in practice. Three practices ensuring (1) that the intervention is appropriate for the context; (2) that heterogeneity in treatment effects are understood; and (3) that the temptation to try multiple interventions simultaneously is avoided can allow us to customize successful policy prescriptions to specific real-world settings.

PNAS paper "Scaling up behavioral science interventions in online education" also very useful on the theme of large-scale iterative field trials. 

Online education is rapidly expanding in response to rising demand for higher and continuing education, but many online students struggle to achieve their educational goals. Several behavioral science interventions have shown promise in raising student persistence and completion rates in a handful of courses, but evidence of their effectiveness across diverse educational contexts is limited. In this study, we test a set of established interventions over 2.5 y, with one-quarter million students, from nearly every country, across 247 online courses offered by Harvard, the Massachusetts Institute of Technology, and Stanford. We hypothesized that the interventions would produce medium-to-large effects as in prior studies, but this is not supported by our results. Instead, using an iterative scientific process of cyclically preregistering new hypotheses in between waves of data collection, we identified individual, contextual, and temporal conditions under which the interventions benefit students. Self-regulation interventions raised student engagement in the first few weeks but not final completion rates. Value-relevance interventions raised completion rates in developing countries to close the global achievement gap, but only in courses with a global gap. We found minimal evidence that state-of-the-art machine learning methods can forecast the occurrence of a global gap or learn effective individualized intervention policies. Scaling behavioral science interventions across various online learning contexts can reduce their average effectiveness by an order-of-magnitude. However, iterative scientific investigations can uncover what works where for whom. 

PNAS paper by Milkman et al "A megastudy of text-based nudges encouraging patients to get vaccinated at an upcoming doctor’s appointment

Many Americans fail to get life-saving vaccines each year, and the availability of a vaccine for COVID-19 makes the challenge of encouraging vaccination more urgent than ever. We present a large field experiment (N = 47,306) testing 19 nudges delivered to patients via text message and designed to boost adoption of the influenza vaccine. Our findings suggest that text messages sent prior to a primary care visit can boost vaccination rates by an average of 5%. Overall, interventions performed better when they were 1) framed as reminders to get flu shots that were already reserved for the patient and 2) congruent with the sort of communications patients expected to receive from their healthcare provider (i.e., not surprising, casual, or interactive). The best-performing intervention in our study reminded patients twice to get their flu shot at their upcoming doctor’s appointment and indicated it was reserved for them. This successful script could be used as a template for campaigns to encourage the adoption of life-saving vaccines, including against COVID-19.

Deaton and Cartwright "Understanding and misunderstanding randomized controlled trials" gives a very strong summary of a wide range of points both authors have been making as a critique of the recent literature on randomised trials in policy applications.

Randomized Controlled Trials (RCTs) are increasingly popular in the social sciences, not only in medicine. We argue that the lay public, and sometimes researchers, put too much trust in RCTs over other methods of investigation. Contrary to frequent claims in the applied literature, randomization does not equalize everything other than the treatment in the treatment and control groups, it does not automatically deliver a precise estimate of the average treatment effect (ATE), and it does not relieve us of the need to think about (observed or unobserved) covariates. Finding out whether an estimate was generated by chance is more difficult than commonly believed. At best, an RCT yields an unbiased estimate, but this property is of limited practical value. Even then, estimates apply only to the sample selected for the trial, often no more than a convenience sample, and justification is required to extend the results to other groups, including any population to which the trial sample belongs, or to any individual, including an individual in the trial. Demanding ‘external validity’ is unhelpful because it expects too much of an RCT while undervaluing its potential contribution. RCTs do indeed require minimal assumptions and can operate with little prior knowledge. This is an advantage when persuading distrustful audiences, but it is a disadvantage for cumulative scientific progress, where prior knowledge should be built upon, not discarded. RCTs can play a role in building scientific knowledge and useful predictions but they can only do so as part of a cumulative program, combining with other methods, including conceptual and theoretical development, to discover not ‘what works’, but ‘why things work’.

"RCTs to Scale: Comprehensive Evidence From Two Nudge Units". Fascinating paper by DellaVigna and Linos pooling date from many policy trials and comparing effect sizes to those in academic literature. 

Nudge interventions have quickly expanded from academic studies to larger implementation in so-called Nudge Units in governments. This provides an opportunity to compare interventions in research studies, versus at scale. We assemble a unique data set of 126 RCTs covering 23 million individuals, including all trials run by two of the largest Nudge Units in the United States. We compare these trials to a sample of nudge trials in academic journals from two recent meta-analyses. In the Academic Journals papers, the average impact of a nudge is very large—an 8.7 percentage point take-up effect, which is a 33.4% increase over the average control. In the Nudge Units sample, the average impact is still sizable and highly statistically significant, but smaller at 1.4 percentage points, an 8.0% increase. We document three dimensions which can account for the difference between these two estimates: (i) statistical power of the trials; (ii) characteristics of the interventions, such as topic area and behavioral channel; and (iii) selective publication. A meta-analysis model incorporating these dimensions indicates that selective publication in the Academic Journals sample, exacerbated by low statistical power, explains about 70 percent of the difference in effect sizes between the two samples. Different nudge characteristics account for most of the residual difference.

Chater and Loewenstein's recent working paper "The i-Frame and the s-Frame: How Focusing on the Individual-Level Solutions Has Led Behavioral Public Policy Astray" advances the view that the last 20 years of development of behavioural public policy has been captured to an extent on a focus on micro-interventions that have not delivered on promises for transformative change. 

An influential line of thinking in behavioral science, to which the two authors have long subscribed, is that many of society’s most pressing problems can be addressed cheaply and effectively at the level of the individual, without modifying the system in which individuals operate. Along with, we suspect, many colleagues in both academic and policy communities, we now believe this was a mistake. Results from such interventions have been disappointingly modest. But more importantly, they have guided many (though by no means all) behavioral scientists to frame policy problems in individual, not systemic, terms: to adopt what we call the “i-frame,” rather than the “s-frame.” The difference may be more consequential than those who have operated within the i-frame have understood, in deflecting attention and support away from s-frame policies. Indeed, highlighting the i-frame is a long-established objective of corporate opponents of concerted systemic action such as regulation and taxation. We illustrate our argument, in depth, with the examples of climate change, obesity, savings for retirement, and pollution from plastic waste, and more briefly for six other policy problems. We argue that behavioral and social scientists who focus on i-level change should consider the secondary effects that their research can have on s-level changes. In addition, more social and behavioral scientists should use their skills and insights to develop and implement value-creating system-level change.

No comments: