Why It's Hard to Learn from the Learning Sciences
“The case against science is straightforward: much of the scientific literature, perhaps half, may simply be untrue. Afflicted by studies with small sample sizes, tiny effects, invalid exploratory analyses, and flagrant conflicts of interest, together with an obsession for pursuing fashionable trends of dubious importance, science has taken a turn towards darkness.” (Horton, 2015).
Recent years have witnessed a rebirth among education companies. Like a phoenix rising from the ashes, large edtech companies have been reborn as learning science companies. No longer designing products based solely on expert intuition and market demands, edtech companies are increasingly embracing the importance of evidence-informed product design and development.
Just a handful of years ago you’d be hard pressed to find any reference to the learning sciences or educational research on major education company websites, but visit any company homepage now and things look very different.
Mcgraw Hill has rebranded itself “The Learning Sciences Company.” Announcing that its products are ‘powered by learning science’ and that product design decisions are informed by a distinguished research council in conjunction with comprehensive research syntheses.
Macmillan now refers to itself the “Learning Insights Company”. It claims to be translating education research and cognitive science into practical blueprints for edtech and creating ‘design principles’ based on learning science research to guide their product development.
And Pearson has devoted significant resources to building their own learning research expertise. They’ve worked to codify and publicize their own collection of Learning Design Principles, and their Efficacy & Research team is engaging in the monumentally challenging work of conducting externally audited research to demonstrate the efficacy of their learning products.
And this is great. Who could argue that students and teachers were better off when edtech companies eschewed research and neglected to measure the learning impact of their products? This nascent industry-wide commitment to informing product design and development with research on learning is laudable and a critical step on the path to improving outcomes for all learners. Nothing I say here should be taken as suggesting differently.
But it’s also the case that the impact on learner outcomes resulting from this push in edtech to synthesize, incorporate, and conduct learning research is commensurable to the quality of that research. And the underlying assumption behind edtech companies’ recent efforts to publicize their alignment with the learning sciences is that the available research base is informative, reliable, and trustworthy. If it weren’t, then it’s unclear how much is to be gained from designing products “powered by learning science.”
Developments in recent years, however, have raised serious questions about the quality and reliability of research produced in education and the broader social sciences.
Houston, We Have a Reproducibility Problem
If you closely follow research in the social and biomedical sciences, then you’re undoubtedly aware of the reproducibility and methodology crisis sweeping across its myriad disciplines. This is a storm that has grown to envelop fields as far reaching as biology, genetics, neuroscience, and medicine.
If you’re unfamiliar with the unfolding quagmire, I’d encourage you to take a brief detour to catch up.
Psychology’s Replication Crisis Is Real
Most scientists ‘can’t replicate studies by their peers’
The Experiments Are Fascinating. But Nobody Can Repeat Them.
Thus far, educational research has largely avoided the scrutiny directed at other fields. But if we’re honest, it’s only a matter of time. This is because many of the methodological issues that blight research in fields like psychology and economics are endemic to education research as well. These concerns, which include reliance on small sample sizes, weak theorizing, flexible data analysis, and poor measurement, are issues that afflict education research in spades. There’s no small hint of irony that as cries of a reliability crisis approach a crescendo in the social and biomedical sciences, edtech companies are increasingly promoting their efforts to align themselves with research relying heavily on practices that are increasingly viewed as fundamentally flawed.
And with a recent review suggesting the rate of replication studies in education hovers slightly above one-tenth of one percent, we’re sitting on a ticking time bomb (Makel & Plucker, 2014).
This should be troubling to all of us in education.
If the learning science literature, which edtech companies increasingly rely on for product development guidance, is littered with untrustworthy and unreplicable findings, and the internal research conducted by edtech companies reflect the same problematic practices found in academia, then there is a legitimate concern that edtech companies will be led astray. Unlike academic researchers, who are strongly incentivized to increase their publication count and pursue novel findings, edtech can’t afford to waste time and money synthesizing and generating studies of dubious veracity and reliability. We have an obligation to ensure we do everything we can to get things right for the learners who use learning products!
Fulfilling this obligation will require greater awareness of the shortcomings of traditional research practices in the learning sciences; it demands that we accurately assess the quality of existing research and conduct studies where we’re confident the outputs are both informative and trustworthy. This increased awareness comes at a heavy cost though: acknowledging that much of the research reported in the learning sciences is likely wrong.
This realization is discouraging — coming just as excitement about incorporating learning science research into edtech is reaching a zenith — but it must be recognized if we are to avoid fooling ourselves about the likely impact of ‘evidence-based’ product decisions on learner outcomes.
The Vicious Circle of Bad Research
Critiques of research practices in the social and psychological sciences are not new. Trenchant criticisms of problematic statistical and methodological practices abound — written by scholars with greater erudition and expertise than myself. However, I’ve yet to come across a comprehensive effort to tie these problematic practices together into a cohesive and intuitive framework. I believe there’s much to be learned from mapping the interdependencies of problematic research practices and understanding how these practices amplify each other in ways that are difficult to appreciate in isolation.
The figure below illustrates what I call the Vicious Circle of Bad Research. Although each academic discipline has its own idiosyncratic research practices, opportunities, and threats (see, Ioannidis, 2018), I believe this circle offers a useful and generalizable framework depicting typical research practices in much of the social sciences, including education. In this article I discuss each element of the Vicious Circle of Bad Research in turn, describing how the pieces combine in a self-perpetuating cycle to support what Smaldino and McElreath refer to as “the natural selection of bad science” (2016, p.2).
Ultimately, my goal in describing the vicious circle of bad research is for readers to walk away with a better grasp of why much of the published research across the social and biomedical sciences (including the learning sciences) is increasingly viewed as degenerative, unreliable, and — perhaps most importantly — unlikely to self-correct without deep and systemic changes.
With so many topics to cover, some readers may find my discussion of each topic unsatisfyingly brief. If so, I encourage you to refer to the papers cited within each section to learn more. And if you’d like a gentle introduction to many of the ideas I discuss here, check out Jordan Ellenburg’s brilliantly written book (2015).
Before we get started, one last thing to note. While I provide many examples of the problematic practices discussed in this paper, you’ll notice that most of them are attributed to the educational psychologist Richard Mayer. This was intentional and done for two reasons. First, Richard Mayer is, I think rightfully, regarded as a paragon of rigor in the learning sciences and the quality of his research is higher than most published studies in the field. Second, his research — particularly on the topic of multimedia learning — is widely regarded and deeply influential in edtech, being profusely cited by educational companies. Thus my intention in repeatedly citing Richard Mayer is not to highlight his work as a bad example, but rather illustrate how even the most accomplished researchers in the learning sciences routinely fool themselves using the standard practices detailed in this paper.
With that out of the way, let the tour of the Vicious Circle begin.
A Tour of the Vicious Circle of Bad Research
Weak Theory
Learning science research is plagued by weak and non-existent theories. The literature is awash in grand hypothesizing and strong claims, but there is often little to show in terms of robust testable predictions. Instead, theories in the learning sciences are typically evaluated using weak directional predictions, ruling out at most 50% of possible difference scores (Dienes, 2008). The following types of predictions are commonplace: “…we predict that students who play the game with self-explanation, explanative feedback, or both will perform better” (Mayer & Johnson, 2010) and “…students receiving low-interest details should perform better on the transfer test” (Mayer et al., 2008). The consequence of such weak theorizing in education is that any positive finding, no matter how small or with what sub-group, is often interpreted as supporting the researcher’s underlying theory (a logically flawed conclusion, as we’ll discuss in more detail later). And because of the verbal imprecision of researchers’ hypotheses, even experimental results that appear to conflict with a researcher’s underlying theory can be easily spun as consistent with sufficient storytelling skill and ad-hoc appeals to moderating/increasingly subtle effects (see, Ashton, 2013; Coyne, 2017 ). (For an example of increasingly elaborate storytelling masquerading as a coherent research program, see: Growth Mindset: The Perils of a Good Research Story)
And while precise point estimates like those found in the hard sciences may be unrealistic in education, it is exceedingly rare to see articles with any meaningful predictions at all. For instance, rarely do researchers articulate intervals of expected effects, lower bounds of minimally interesting effects, forms of expected relationships between variables (e.g., exponential or logarithmic), or even anticipatory patterns with covariates (for more examples, see Edwards & Berry, 2010). This is problematic because improving our understanding of the world requires theories strong enough to forbid some states of the world and allow others (Meehl, 1990).
As Smaldino and McElreath write:
“A good theory specifies precise predictions that provide precise tests, and more than one model is usually necessary” (2016).
What typically passes off as theories in the learning sciences, however, lack the quantitative precision necessary to be meaningfully and rigorously evaluated (see, Rodgers, 2010). Learning researchers’ verbal predictions of ordinal relationships (better/worse than) are difficult to falsify, unlike numerical or functional predictions. As a result, theories in the learning sciences are often “more vampirical than empirical — unable to be killed by mere evidence” (Fresse, 2007).
Given the weak state of theorizing in education, leaving learning researchers unable and/or unwilling to construct quantitative models that generate testable predictions of their theories, what is a researcher to do?
The answer is found, paradoxically, in not testing the researcher’s theory at all.
N(il)HST
Rather than testing their own theory, or even articulating one in most cases, learning researchers instead gauge the compatibility between observed data and a straw man theory. This substitute theory, which a researcher doesn’t actually believe, has the important benefit of coming pre-installed with a very precise statistical prediction that can be easily tested: an intervention or treatment has no effect.
The process of testing the theory of no (or null) effect is called null hypothesis significance testing (NHST) and the procedure is straightforward:
Assume your intervention has no effect (Although differences other than null (or zero) can be evaluated, in practice this is rarely done — leading many to refer to the practice as nil hypothesis significance testing (Cohen, 1994). Hence the title of this section.) Measure any observed impact resulting from your intervention Evaluate whether the magnitude of the observed treatment effect is ‘unexpected’ (conventionally less than a 1/20 or 5% chance) under the assumption that any difference from zero effect is due entirely to chance If the magnitude of the observed difference is larger than what is to be expected, reject the null hypothesis that the intervention had no effect, otherwise fail to reject
Despite its ubiquity in learning science research, this procedure is deeply uninformative and generations of statisticians and methodologists have impugned the practice as being intellectually vacuous and impeding the accumulation of knowledge (see, Kline, 2004; Schmidt, 1996).
Numerous books and an avalanche of articles have been written about the shortcomings of NHST, so we won’t rehash them here. But here is a brief enumeration of several key problems:
First, NHST is really a conceptual slight-of-hand enabling researchers to statistically evaluate the predictions of a precisely defined theory they don’t believe (zero effect) and thereby avoid the task of articulating and deriving meaningful predictions from their preferred theory (see, Gigerenzer, Krauss, & Vitouch, 2004). Researchers then take a successful rejection of the null hypotheses to entail strong positive evidence in favor of their own theory, a logically unjustified move as we’ll see in the section on Misinterpreted Evidence.
Second, the null hypothesis of no effect, even if successfully rejected, is typically not an interesting finding. We already know that any intervention with learners is going to produce some difference at some level of precision, the only question is whether enough participants were used to detect it. As John Tukey succinctly observes, “asking ‘Are the effects different?’ is foolish (1991, p. 100). This also leads to a paradox described by Meehl (1967): stronger research designs actually lead to weaker tests of theories when using NHST.
Third, NHST encourages a binary accept/reject mindset (see, Amrhein, Greenland, & McShane, 2019; Mcshane & Gal, 2017). Is there a significant effect or not? This is reflected in the ubiquitous ‘what works’ approach to education research — does an intervention or product work or not? And researchers’ preoccupation with attaining statistical significance, rather than model-building or accurately estimating an intervention’s impact, tacitly encourages poor research design decisions that make subsequent study findings difficult or impossible to interpret.
We expand on this last problem next.
Poor Design
The recent crisis in medicine and the social sciences has prompted the realization that if the primary goal of researchers is to find statistically significant effects using NHST, then we should expect the research literature to be filled with studies exhibiting methodologically perverse choices. This is because it is shockingly easy to reject null hypotheses at conventional cutoff values using questionable research practices (QRPs) in conjunction with poor research designs (see, Loewenstein & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011).
For example, consider the lamentable lack of concern with precision or power in most studies in the learning sciences. As Jerzy Neyman notes, “Obviously, an experiment designed [with low power] is not worth performing” (1977). But rarely do researchers identify the number of participants needed to have a high likelihood of finding a minimal effect size of interest or conduct in any power analysis at all (Lakens, 2017a). Sample sizes are simply taken as a given, despite typically being so small as to make the probability of detecting any reasonable effect in a study akin to flipping a coin, or worse.
The nine studies that Mayer cites as providing evidence for the temporal contiguity principle of multimedia learning, for example, range in their number of participants from 24–144, with a median of 60 (Mayer, 2014; p. 305). At this size, a between-subjects study has roughly a 27% probability of detecting an effect at the lower bounds of being “substantively important” (WWC, 2014, p.23).
And yet, despite almost universally low power in the learning sciences, nearly every article implausibly reports a statistically significant effect! How is this possible? It turns out that conducting brief studies with small numbers of participants is attractive in the social sciences because they are “cheap and can be farmed for significant results, especially when hypotheses only predict differences from the null…” (Smaldino & McElreath, 2016). In addition, underpowered studies consistently find significant effects because they exhibit uncorrected multiple comparisons and exploration of multiple hypotheses (see, Gelman & Loken, 2013, Maxwell, 2004).
Another poor design choice resulting from researchers’ reliance on NHST is how little attention is often given to the issue of measurement.
If a researcher’s goal is to find a statistically significant effect, then concerns about the quality and noise of experimental measures are often swept aside or ignored (Gelman, 2017). For example, educational researchers often claim to be measuring the impact of interventions on learning, yet rarely is ‘learning’ meaningfully operationalized and observations are collected using instruments of unknown validity and reliability (see, Cheung & Slavin, 2016). Claims of learning improvements are frequently based on student grades or ad-hoc researcher-created assessments where it’s not at all obvious that study designs/instruments actually capture anything we really care about. (Tim McKay humorously and aptly describes student grades as “aggregated performance measures of unrecorded tasks, meant to estimate unknown outcomes, quantified on ill-defined scales” (2017).)
For instance, a recent paper makes the following bold claim: “Overall, the two experiments provide consistent evidence that redesigning multimedia lessons to incorporate emotional design principles significantly improves learning outcomes” (emphasis added; Mayer & Estrella, 2014). A close look at the study methodology, however, reveals that participants were required to view 8 PowerPoint slides in a lab, spend a total of 3–5 minutes studying them, and then take an immediate post-test using a researcher-created assessment (no reliability/validity information available).
Perhaps the learning construct measured in this research paper has some theoretical value, but is it any conception of learning that we actually care about or can extrapolate outside the confines of the study into a classroom or educational product? Surely not.
Alarmingly, reviews of the learning research literature reveal the ubiquity of these methodological decisions. Most educational psychology studies involve interventions that are brief (less than 1 day), employ assessments administered immediately after the intervention, focus on simple recall rather than complex learning or transfer, expend little effort to evaluate treatment integrity, and record multiple outcome measures creating many opportunities for researchers to identify positive findings when sifting through results (see, Hsieh et al., 2005).
As we’ll see, the consequence of these poor design choices is that even when statistically significant results are found, the findings are often misleading, uninformative, or simply wrong.
Uninformative Results
Given consistently low power, data untethered to theory, and poor measurements, it is no surprise there is often little to be learned from published studies in the learning sciences. These studies were designed to capitalize on chance and noise, a fact that becomes clearer when we look beneath the grand pronouncements of statistical significance.
Consider two key issues.
While the reporting of effect sizes in the social sciences has improved slightly in recent years, they are still reported at depressingly low rates (see, McMillan, 2011; Peng et al., 2013). But even rarer is the reporting of researchers’ uncertainty with respect to their estimates of an intervention’s effect (Thompson, 2002). One reason for this may be that it would reveal “embarrassingly large” degrees of uncertainty, given researchers’ use of imprecise measurements and small sample sizes (Cohen, 1994). In fact, it’s not atypical to find statistically significant effects that are consistent with impacts ranging in size from miniscule to absurdly large. (The most common measure of effect size in the learning sciences is probably Cohen’s d, which is a standardized way of quantifying the difference in mean scores between two groups. Traditionally, d=.2 is considered a small effect, d=.5 is a medium-sized effect, and d=.8 is a very large effect.)
For instance, you’ll often find research in the learning sciences excitedly highlighting a large main effect for an intervention (d = .65), but ignoring estimate uncertainty showing observed data are consistent with the effect being nearly non-existent (d = .02) to utterly massive (d = 1.07) (e.g., Mayer, 2004). And because these studies are so imprecise in their estimates, they are practically immune to falsification, functioning primarily to add noise to the existent literature (see, Lakens & Morey, 2017).
Furthermore, because learning scientists typically report findings conditional on having broken through the statistical significance threshold, observed impacts are typically large overestimates or even in the wrong direction (see, Button et al., 2013; Cheung & Slavin, 2016; Gelman & Carlin, 2014). These inflated estimates should be obvious to researchers when they report gargantuan effect sizes, but given the lack of attention researchers give to interpreting what reported effect sizes mean in practical terms they’re commonly reported without comment (Ellis, 2010).
Consider the recent volume edited by Richard Mayer in which he summarizes dozens of research findings related to various principles of multimedia learning — an area of research that has deeply influenced edtech — citing average effect sizes (again, Cohen’s d) across multiple studies with magnitudes of: 0.86, 1.10, 0.75, and 1.22 (2014). Effects of this magnitude are hardly realistic.
To put these numbers into perspective, reported effect sizes for men weighing more than women approximate d=0.59; effect sizes for people who like eggs tending to eat more egg salad, d=1.09 — effects of this magnitude are “so potent that they are easily detected through casual observation” (Pashler et al, 2016). It strains credulity to think modifying brief instructional text to include more personalized pronouns (e.g., Mayer & Fennell, 2004) will have a similar or greater effect on student learning. In fact, effect sizes in education rarely exceed d=0.2 when studies are of high quality and large numbers of participants are included (Kraft, 2018).
But aren’t these mere quibbles about the uncertainty or size of reported effects? Isn’t the important takeaway from research showing a statistically significant results that there is a ‘real’ effect and clear evidence in favor of a researcher’s theory?
Although learning scientists often present their research as though this were the case, the reality is different.
Misinterpreted Evidence
The sequence of steps that we’ve outlined thus far, starting with poor theory, moving to the use of NHST, which subsequently encourages poor design choices and produces often uninterpretable results, leaves researchers in a quandary.
There are many questions a researcher might be interested in answering when conducting research, including: What is the probability my hypothesis is true? What is the effect of my intervention? How strong is the evidence in favor of my hypothesis? Is an observed effect ‘real’ or merely a chance event?
Unfortunately, none of these questions can be meaningfully answered at the end of the process we’ve outlined thus far. But researchers are a persistent bunch and, as Jacob Cohen eloquently observes, “out of desperation, nevertheless believe that it does!” (1994, p. 997). Consider the common practice of interpreting a statistically significant effect as indicating that an observed effect is “real”. This interpretation is reflected in the definition of statistical significance found on the US education department’s What Works Clearinghouse’s (WWC) website:
“The likelihood that a finding is due to chance rather than a real difference. The WWC labels a finding statistically significant if the likelihood that the difference is due to chance is less than five percent (p = 0.05).” [emphasis added]
Another example is found in the highly influential book e-Learning and the Science of Instruction authored by Ruth Clark and Richard Mayer. They define statistical significance as indicating:
“there is less than 5% chance it is not correct to say that the difference…reflects a real difference between the two groups.” (emphasis added, 2016, p.58).
Both of these definitions assert that achieving statistical significance means the probability of the null hypothesis (no effect) being true is less than 5%, and, correspondingly, the probability of the alternative hypothesis (real effect) being true is greater than 95%. This interpretation of statistical significance, while intellectually seductive and widespread, is incorrect.
It should be obvious that something is flawed with this definition when we stop to consider the many statistically significant research findings found in the literature that are almost certainly false — e.g., articles published on extrasensory perception (see, Bem, 2011). Is the hypothesis that precognition exists true — with probability greater than 95% — given a statistically significant effect? Of course not. This just demonstrates that we can often observe unlikely outcomes even when we are almost 100% sure the null hypothesis (no effect) is true (Lakens, 2017b).
The logical flaw in the reasoning of the WWC and Clark & Mayer is that p values are about the probability of data, not about hypotheses. P values are calculated assuming the null hypothesis is true and that any observed deviation from the null hypothesis is entirely due to chance. Consequently, it doesn’t make sense to interpret p values as indicating the probability of the null hypothesis being false or, conversely, the researcher’s alternative hypothesis being true (Greenland et al., 2016). A statistically significant finding does not indicate a researcher’s hypothesis is likely or that an observed effect is “real” — in fact, a researcher’s hypothesis might not even be remotely plausible despite having achieved significance (see, Leppink, Winston, O’Sullivan, 2016).
But surely the rejection of the null hypothesis at least provides good evidence for a researcher’s preferred hypothesis, right? Not quite.
This line of reasoning exhibits what Rouder and colleagues refer to as belief in “the free lunch of inference” (2016). After observing results that are unexpected under the null hypothesis (i.e., p <.05), researchers want to conclude that the null is unlikely “without consideration of a well-specified alternative” (Rouder, 2016, p.523). This desire for free inferential lunch is evident whenever researchers suggest that observing a low p-value indicates, as Mayer and Estrella write, “moderate-to-strong evidence” in favor of their preferred hypotheses (2014, p.16).
The problem is that “evidence is always relative in the sense that data constitute evidence for or against a hypothesis, relative to another hypothesis” (Johansson, 2011, p. 115). But a p value is always calculated relative to a single hypothesis (the null), so it can’t provide a measure of evidence in favor of or against this hypothesis. To be sure, an observation might be very unlikely under the null hypothesis, but unless a researcher offers a statistical prediction from an alternative model to compare observed results against, it’s possible the evidence is even less likely under their preferred alternative. When it comes to life and inference, there’s no free lunch.
Impoverished Literature
The upshot of everything we’ve discussed thus far is that it’s hard to learn from research in the learning sciences. Surveying the available literature, there is little reason to think that many reported findings are replicable, there is too much uncertainty and poor measurement to reasonably estimate intervention effects in realistic learning conditions, and the lack of focus on building and comparing meaningful models makes it difficult to accumulate knowledge. As learning researchers, we are faced with a literature polluted with overestimated, unreliable, and unfalsifiable findings that serve largely to encourage future underpowered and underdeveloped studies (see, Button et al., 2013).
Thus beginning the cycle of bad research all over again.
It also means that many of the research studies cited by edtech companies to justify product design choices and drive product improvement may provide little reliable empirical guidance. This is true despite the allure of having been published in peer-reviewed journals and using rigorous experimental designs.
And insofar as edtech companies conduct internal research that emulates standard research practices in the learning sciences (i.e., small sample sizes, poor measurement, flexible data analyses, passing pilot studies off as confirmatory research, selection on significance, avoidance of practical significance, and lack of model building), we should expect resultant research findings will be equally unreliable and uninformative.
How much time, money, and effort will be wasted conducting and synthesizing learning research that is essentially “dead on arrival” (Gelman, 2011, p. 38) — incapable of telling us anything we really want to know?
At this point, scientifically savvy readers may be thinking, “Sure, these are important considerations when evaluating isolated research studies, but that’s why I only trust findings reported in meta-analyses and multiple-study papers!” Rather than taking a single experimental result as gospel, meta-analyses aggregate intervention effects across many studies while trying to take into account publication bias, while multiple-study papers require authors demonstrate effects can be consistently replicated across a series of studies.
Unfortunately, without safeguards for properly conducting individual research studies, meta-analyses and multiple-study articles can simply amplify the biases reflected in single studies and produce equally misleading, conflicted, and unreliable findings (see, Ioannidis, 2016; Lakens, Hilgard, & Staaks, 2016; Schimmack, 2012). Fans of John Hattie and his meta-meta-analyses should take heed (see, Slavin, 2018).
What Now?
At this point we’re left with a deeply challenging question, “What do those of us in education do once we’ve acknowledged the flaws endemic to the learning science literature?” This is a question I’ve been thinking about a lot recently, and while I don’t presume to have anything resembling a satisfying answer, I’ll leave you with some thoughts.
First, the obvious suggestion. I think those of us in edtech need to adopt better standards and practices when conducting research. It’s too easy to fool ourselves about the impact of our learning products when conducting small exploratory studies, using flexible data analyses, employing poor measures, and selecting on statistical significance. Yet I see this type of research conducted and shared ALL the time. The second, and more difficult, challenge is how we ought to judge and make decisions based on the existing (but flawed) learning science literature. I think there are (at least) two important things we can do immediately.
From a transparency/learning perspective, if the studies contributing to a learning principle are too small/noisy/variable to tell us anything meaningful about its likely impact on students, we should openly acknowledge that uncertainty and admit we simply don’t know if/to what magnitude a product feature based on this principle does anything to improve learning outcomes. Rather than obfuscating and skirting uncertainty and variability by (as is typically the case) deferring to a list of supporting published studies that have crossed the magical p<0.05 threshold, we should openly admit our ignorance and try to rectify it through future research efforts. Image for post
From the perspective of needing to make a decision, I understand how an edtech company might decide that the effort to incoporate a learning principle into their educational product (e.g., updating images in an etext so they are consistent with the spatial contiguity principle of multimedia learning) is worthwhile despite uncertainty about its effect, given its relatively low cost, potential benefit, and congruence with more established theories. However, this decision should take the uncertainty of a research finding into account rather than simply gloss over it. The same goes for the other factors discussed in this blog. All these factors should be part of a holistic decision matrix that is used to inform development priorities. While edtech companies need to make product decisions — despite the interpretative challenges posed by the existing learning science literature — these decisions should be informed by a critical assessment of research quality, reliability, and informativeness.
Overall, I believe the appropriate response to what I’ve outlined here isn’t despair, but abundant skepticism, intellectual modesty, and a commitment to gradual improvement. There is reason for optimism as awareness of problematic research practices continues to grow in the social and biomedical sciences. But the edtech community must change course soon if it is to avoid wandering into the wasteland currently inhabited by great swaths of biomedical research — a field recently exposed for wasting billions of dollars on sloppy studies and churning out mountains of intellectual detritus (see, Harris, 2017; Freedman, Cockburn, & Simcoe, 2015).
The edtech community has a unique opportunity to reshape how research in the learning sciences is conducted and evaluated — to break free from the vicious circle of bad research — let’s not miss our chance to seize it. Learners everywhere are depending on us!
References
Amrhein, V., Greenland, S., & McShane, B. (2019). Scientists rise up against statistical significance. Nature. Retrieved from: https://www.nature.com/articles/d41586-019-00857-9
Ashton, J. C. (2013). Experimental power comes from powerful theories — the real problem in null hypothesis testing (response Ioannidis). Nature Reviews. Neuroscience, 14(5), 365–76.
Bem, D. J. (2011), Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect. Journal of Personality and Social Psychology,100, 407–425.
Button, K. S., Ioannidis, J. P. a, Mokrysz, C., Nosek, B. a, Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews. Neuroscience, 14(5), 365–76.
Cheung, A. C. K., & Slavin, R. E. (2016). How Methodological Features Affect Effect Sizes in Education. Educational Researcher, 45(5), 283–292.
Clark, R. C., & Mayer, R. E. (2016). E-learning and the science of instruction: Proven guidelines for consumers and designers of multimedia learning. John Wiley & Sons.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003.
Coyne, J. C. (2017). A bad abstract is good enough to be published in Journal of Experimental Psychology: General. Quick Thoughts, 16 Mar. Retrieved from: http://www.coyneoftherealm.com/2017/03/16/a-bad-abstract-is-good-enough-to-be-published-in-journal-of-experimental-psychology-general/
Dienes, Z. (2008). Understanding psychology as a science: An introduction to scientific and statistical inference. Palgrave Macmillan.
Edwards, J. R., & Berry, J. W. (2010). The Presence of Something or the Absence of Nothing: Increasing Theoretical Precision in Management Research. Organizational Research Methods, 13(4), 668–689.
Ellenberg, J. (2015). How not to be wrong: The power of mathematical thinking. Penguin.
Ellis, P. D. (2010). The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results. Cambridge University Press.
Freedman, L. P., Cockburn, I. M., & Simcoe, T. S. (2015). The economics of reproducibility in preclinical research. PLoS biology, 13(6), e1002165.
Freese, J. 2007. The problem of predictive promiscuity in deductive applications of evolutionary reasoning to intergenerational transfers: Three cautionary tales. In Caring and Exchange Within and Across Generations, ed. A. Booth et al. Washington, D.C.: Urban Institute Press.
Gelman, A. (2017). Null hypothesis significance testing is incompatible with incrementalism. Unpublished manuscript.
Gelman, A. (2011). Ethics and statistics. Chance, 24(4), 51–54.
Gelman, A., & Carlin, J. (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science, 9(6), 641–651.
Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The Null Ritual. What You Always Wanted to Know About Significance Testing but Were Afraid to Ask. In The Sage handbook of quantitative methodology for the social sciences (pp. 391–408).
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical Tests, P-values, Confidence Intervals, and Power: A Guide to Misinterpretations. The American Statistician, 15(53), 1–31.
Harris, R. (2017). Rigor mortis: how sloppy science creates worthless cures, crushes hope, and wastes billions. Basic Books. Horton, R. (2015). What is medicine’s 5 sigma. Lancet, 385(9976), 1380.
Hsieh, P. (Pei-H., Acee, T., Chung, W.-H., Hsieh, Y.-P., Kim, H., Thomas, G. D., … Robinson, D. H. (2005). Is Educational Intervention Research on the Decline? Journal of Educational Psychology, 97(4), 523–529.
Ioannidis, J. P. A. (2018). Meta-research: Why research on research matters. PLOS Biology, 16(3), e2005468.
Ioannidis, J.P. (2016). The mass production of redundant, misleading, and conflicted systematic reviews and meta-analyses. Milbank Q.; 94: 485–514.
Johansson, T. (2011). Hail the impossible: P-values, evidence, and likelihood. Scandinavian Journal of Psychology, 52(2), 113–125. Loewenstein, G. & Prelec, D. (2012) Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol. Sci. 23, 524–532.
Kline, R. B. (2004). Beyond significance testing: Reforming Data Analysis Methods in Behavioral Research. Washington, DC: American Psychological Association.
Kraft, M. A. (2018). Interpreting Effect Sizes of Education Interventions. Brown University Working Paper, (December), 1–28.
Lakens, D. (2017a). How a power analysis implicitly reveals the smallest effect size you care about. Blog. Retrieved from: http://daniellakens.blogspot.com/2017/05/how-power-analysis-implicitly-reveals.html
Lakens, D. (2017b). Understanding common misconceptions about p-values. Blog. Retrieved from: http://daniellakens.blogspot.com/2017/12/understanding-common-misconceptions.html
Lakens, D., Hilgard, J., & Staaks, J. (2016). On the reproducibility of meta-analyses: six practical recommendations. BMC Psychology, 4(24), 1–10.
Leppink J, Winston K, O’Sullivan P. (2016). Statistical significance does not imply a real effect. Perspect Med Educ. (2):122–4. doi:10.1007/s40037–016–0256–6.
Levin, J. R. (2004). Random thoughts on the (In)credibility of educational-psychological intervention research. Educational Psychologist, 39(3), 37–41.
Makel, M. C., & Plucker, J. A. (2014). Facts Are More Important Than Novelty: Replication in the Education Sciences. Educational Researcher, 43(6), 0013189X14545513.
Mayer, R., (Eds.). (2014). The Cambridge handbook of multimedia learning. Second edition. Cambridge university press.
Mayer, R. E., & Estrella, G. (2014). Benefits of emotional design in multimedia instruction. Learning and Instruction, 33, 12–18.
Mayer, R., & Fennell, S. (2004). A Personalization Effect in Multimedia Learning: Students Learn Better When Words Are in Conversational Style Rather Than Formal Style. Journal of Educational …, 96(2), 389–395.
Mayer, R. E., Griffith, E., Jurkowitz, I. T. N., & Rothman, D. (2008). Increased interestingness of extraneous details in a multimedia science presentation leads to decreased learning. Journal of Experimental Psychology. Applied, 14(4), 329–339.
Mayer, R. E., & Johnson, C. I. (2010). Adding Instructional Features that Promote Learning in a Game-Like Environment. Journal of Educational Computing Research, 42(3), 241–265.
McMillan, & Foley. (2011). Reporting and Discussing Effect Size : Still the Road Less Traveled? Practical Assessment, Research & Evaluation, 16(14).
McKay, Tim. (2017). Why Learning Analytics? UC Berkeley Learning Analytics Conference Keynote Address. Retrieved from: https://www.youtube.com/watch?v=OkmvHAR2ea0&t=1912s
Mcshane, B. B., & Gal, D. (2017). Statistical Significance and the Dichotomization of Evidence. Journal of the American Statistical Association, 112(519), 885–908.
Morey, R. D., & Lakens, D. (2016). Why most of psychology is statistically unfalsifiable. Unpublished manuscript.
Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66(1), 195–244.
Meehl, P. E. (1967). Theory-Testing in Psychology and Physics: A Methodological Paradox. Philosophy of Science, 34, 103–115.
Pashler, H., Rohrer, D., Abramson, I., Wolfson, T., & Harris, C. R. (2016). A Social Priming Data Set With Troubling Oddities. Basic and Applied Social Psychology, 38(1), 3–18.
Peng, C. Y. J., Chen, L. T., Chiang, H. M., & Chiang, Y. C. (2013). The Impact of APA and AERA Guidelines on Effect Size Reporting. Educational Psychology Review, 25(2), 157–209.
Rodgers, J. L. (2010). The epistemology of mathematical and statistical modeling. A quiet methodological revolution. American Psychologist, 65(1), 1–12.
Rouder, J. N., Morey, R. D., Verhagen, J., Province, J. M., & Wagenmakers, E. J. (2016). Is There a Free Lunch in Inference? Topics in Cognitive Science, 8(3), 520–547.
Schimmack, U. (2012). The Ironic Effect of Significant Results on the Credibility of Multiple-Study Articles. Psychological Methods, 17(4), 551–566.
Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1(2), 115–129.
Simmons JP, Nelson LD, & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22, 1359–1366.
Simpson, A. (2017) ‘The misdirection of public policy: comparing and combining standardised effect sizes.’ Journal of education policy., 32 (4). pp. 450–466.
Slavin, R. (2018). John Hattie is Wrong. Blog post. Retrieved from: https://robertslavinsblog.wordpress.com/2018/06/21/john-hattie-is-wrong/ Smaldino, P. E., & McElreath, R. (2016). The Natural Selection of Bad Science. Royal Society Open Science, 3.
Thompson, B. (2002). What Future Quantitative Social Science Research Could Look Like: Confidence Intervals for Effect Sizes. Educational Researcher, 31(3), 25–32.
Tukey, J. W. (1991) The philosophy of multiple comparisons. Statistical Science, 6, 100–116.
What Works Clearinghouse. (2014). WWC procedures and standards handbook (Version 3.0). U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, What Works Clearinghouse.