“Power” – I remember this expression making me and my classmates giggle in our undergrad introductory stats course, much as other expressions had made us giggle in high school sex ed. But I am not here for giggling. Instead, I want to share two things about power that I have understood (which actually work astonishingly well not only for statistics, but also for real life).
(1) Power is important (and one should understand the inner workings of all things important).
(2) Power is, nevertheless, not everything (as life is full of compromise).
For me, (1) paraphrases the fact that there is a theoretically appropriate thing to do (namely to run well-powered studies). This one is pretty straightforward, but I nevertheless cover it in the first section of this post, since I have experienced that what should be done is reasonably well understood, but why it is important for both the individual researcher and the research community is worth some reiteration. (2) paraphrases the fact that there can also be a practically appropriate thing to do (thus, once practicalities are thrown into the mix, the theoretically appropriate thing might not work anymore and the next-best possible solution needs to be found). I will talk about this in the second section, and this is a topic that is very close to my heart: The practicalities of implementing good research practices into our daily research life.
Luckily enough, this topic is also close to the heart of many of my department colleagues, leading to the collaborative organization of a meeting series to discuss such practical aspects, last time about the issues outlined above (slides). In some sense, the choice of this topic was based on a motivated grad student’s struggle: Imagine this student, who runs a pilot or a meta-analysis to do a prospective power analysis, finding out he would need to run 230 subjects in his main experiment. Which he won’t be able to do, unless he would want to spend half of his European 3 year PhD time and use up a quarter of the grant money allocated for his project on this experiment. I will come to suggestions for this grad student in section (2), after covering some basics in (1).
(1) On why power is important
(You can skip this part if you already know everything about beauty, sex and power.)
You might all have seen a figure like the one above, and since there are many wonderful resources that explain the full thing, for instance this one I adapted the figure from, or this absolutely brilliant one, I won’t go over it in detail. Today’s focus is the light blue area, which is POWER, or the likelihood that you find an effect when there actually is one. Power gets higher the bigger your effect size is and the more participants you test (I also won’t go into detail on how to calculate power, but you can for instance check out pwr for R or Gpower, or this online calculator for infant studies). So what happens if you have low power? It means that you reduce your chances of finding your underlying true effect. That’s bad for you and for science in general out of several reasons: Consider first scenario 1, under which you find no effect, which is likely enough because, remember, the likelihood you find an effect is low. First and foremost, that is unfortunate for you because you didn’t find something you wanted to find although it is actually there. But in addition to that, it also can easily guide you down an alleyway towards bad science. If you decide to interpret your finding as a true null result and report on that, that’s highly problematic because the likelihood to find an effect was low to begin with, so your conclusion stands on very shaky legs. However, if you interpret your finding as some kind of failure and decide to bury them in your file drawer, then you contribute to skewed reporting which contributes to the overestimation of true effects. So I hope you see how your underpowered null result is nothing less than a straight way to misery. Now let’s consider scenario 2, under which you do find an effect although it was unlikely, which basically means you were lucky. Lucky is good! Is it? Why not! You can publish your result, which means fame and glory and funding opportunities. And that might just be the end of the story for you, and that’s alright. However, in case you ever want to replicate your study or built on it by running a variant with a slight modification, you might not be so lucky this second time. Which might force you to modify your initial theory, or lead you to bury these replication data. And even if never touched that experiment again, someone else might think of building on it and not find anything. This might waste several months of a grad student’s precious time, and eventually to an attack of your previous findings. So unless nobody ever tries to replicate or build on your data, this outcome can also lead to long-term misery.
In sum, running an underpowered study if that’s an end of it’s own (as opposed to, maybe, an initial step) can be disadvantageous for yourself and for your field in general. Which brings us to part (2), in which I will summarize some possible practical solutions for cases in which your power runs out.
(2) On why power is, nevertheless, not everything
Coming back to our motivated grad student, who doesn’t have the power (not even 80% of it). Similarly, you could think of cases with clinical populations, in which a power analysis would be futile since the number of participants that will be tested will depend on the number of patients than can be recruited in a given time-window, a number which will certainly be lower than recommended by a power analysis (and the control group will be matched to this). What should one do in these cases?
Our (meeting series organizers) advice (which we are delighted to discuss) is to incorporate a combination of sequential analysis with an upper limit based on practical constraints, and preregistration. I will go through these two concepts quickly for those not familiar with them before I come to the conclusion.
(2.1) Sequential analysis
Sequential analysis is a great concept to make high-powered studies more practicable. In essence, you decide on an upper limit of your sample size (ideally informed by power analysis, if not possible, by your practical constraints), and you decide on points of interim analysis. Thus, you might decide on a maximum n of 100, and to perform two interim analyses at 33 and 66 participants. Two practices are crucial for this to work: First, you need to account for the fact that you are doing multiple comparisons, thus control the Type I error rate (the probability to find an effect that isn’t there), much as if you would control for it when performing multiple statistical tests. So for instance, you could use the Bonferroni method, dividing your alpha level by the number of tests (3 in this case), and judge your outcomes as significant only when it falls under the significance level of p = .0167. Thus, you perform your statistical test after 33 participants and stop testing only when the outcome is significant under your criterion, if not you do the same thing at 66, and again, depending on outcome, at 100. There are other, more flexible ways to do it, for instance by using spending functions. You can read this great practical primer by Daniel Lakens and step-by-step instructions (using the GroupSeq package in R). The second practice that is crucial for this to work is the element of pre-defining what you will do, which brings me to preregistration.
In a nutshell, preregistration means to decide on design and methods a priori and to document these on a platform like OSF. This practice narrows down researchers’ degrees of freedom, which is incredibly important. For instance, having decided on a way to exclude trials beforehand prevents researchers from trying out multiple criteria (e.g., 2SDs, 2.5SDs 3SDs from the mean) until they find a significant result, and having decided on a sample size prevents them to do a significance test after each tested participant until they reach significance, a practice referred to as p-hacking. It also prevents you from revising your hypothesis post-hoc (and selling it as your a-priori one), or HARKing. Of course preregistration does not mean that you are not free to explore your data, but it means that it is very clear to the reader which aspects of your report are confirmatory and which are exploratory.
Thus, using sequential analysis can make research more efficient and practicable by opening the possibility to stop testing earlier than suggested by power analysis, while preregistration ensures you are not digging around in your data. Let me finish by explaining why these two practices can improve the situation when you need to set your maximum n lower than recommended by power analysis. First, this does not solve the power problem per se. If you can only test a certain number of participants, and power analysis tells you you’d need to test more, then what you get is what it is: An underpowered study. But, these practices provide you with a way to avoid other questionable research practices while upping the efficiency of your testing. Another important aspect to consider is that increasing sample size is not the only way to increase power (think design or population changes). And last but not least, it is important to consider whether a project is worth the resources it would require if run properly.
Below, you see an attempt at a flowchart for making stopping rule decisions, created collaboratively with Christina Bergmann, Alejandrina Cristia, and Sho Tsuji, and improved based on comments in our department meeting.
Feel empowered and happy stopping!
Special thanks go to our motivated grad student for posing the questions cited above and in the slides, and for us.