Beginning: Statistics are misused and abused, sometimes even unintentionally, in both scientific and business settings. Alex Reinhart, author of the book “Statistics Done Wrong: The Woefully Complete Guide” talks about the most common errors people make when trying to figure things out using statistics, and what happens as a result. He shares practical insights into how both scientists and business analysts can make sure their statistical tests have high enough power, how they can avoid “truth inflation,” and how to overcome multiple comparisons problems.
Ginette: In 2009, neuroscientist Craig Bennett undertook a landmark experiment in a Dartmouth lab. A high tech fMRI machine was used on test subjects, who were “shown a series of photographs depicting human individuals in social situations with a specified emotional valence” and asked “to determine what emotion the individual in the photo must have been experiencing.” Would it be found that different parts of the brain were associated with different emotional associations? In fact, it was. The experiment was a success. The results came in showing brain activity changes for the different tasks, and the p-value came out to 0.001, indicating a significant result.
The problem? The only participant was a 3.8 pound 18-inch mature Atlantic salmon, who was “not alive at the time of scanning.”
Ginette: I’m Ginette.
Curtis: And I’m Curtis.
Ginette: And you are listening to Data Crunch.
Curtis: A podcast about how applied data science, machine learning, and artificial intelligence are changing the world.
Ginette: Data Crunch is produced by the Data Crunch Corporation, an analytics training and consulting company.
Ginette: This study was real. It was real data, robust analysis, and an actual dead fish. It even has an official sounding scientific study name—”Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon”.
Craig Bennett did the experiment to show that statistics can be dangerous territory. They can be abused and misleading—whether or not the experimenter has nefarious intentions. Still, statistics are a legitimate and powerful tool to discover actual truths and find important insights, so they cannot be ignored.
It becomes our task to wield them correctly, and to be careful when accepting or rejecting statistical assertions we come across.
Today we talk to Alex Reinhart, author of the book “Statistics done wrong—The Woefully complete guide”. Alex is an expert on how to do statistics wrong. And incidentally, how to do them right.
Alex: We end up using statistical methods in science and in business to answer questions, often very simple questions, of just “does this intervention or this treatment or this change that I made, does it have an effect?” Often in a difficult situation, because there are many things going on, you know, if you’re doing a medical treatment there’s many different reasons that people recover in different times, and there’s a lot of variation, and it’s hard to predict these things. If you’re doing an A-B test on a website, your visitors are all different. Some of them will want to buy your product or whatever it is, and some of them won’t, and so there’s a lot of variation that happens naturally, and we’re always in the position of having to ask, “This thing/change I made or invention I did, does it have an effect, and can I distinguish that effect from all the other things that are going on.” And this leads to a lot of problems, so statistical methods exist to help you answer that questions by seeing how much variation is there naturally, and this effect I saw, is it more than I would have expected had my intervention not worked or not done anything, but it doesn’t give you certainty. It gives us nice words, which is like “statistically significant,” which sounds important, but it doesn’t give you certainty. You’re often asking the question, “Is this effect that I’m seeing from my experiment, would it have been unlikely to see this effect had my treatment not worked or not done anything.” So I observed what this medical treatment . . . more people improved when they got this treatment than when they didn’t. Or more people bought this product when I did this version of the website than when I didn’t, but is that just because we got lucky and got the people who like to spend money, or is it because the actual effect. Well, you have to be able to rule out those other things, and we never can do so with certainty. We can just say, “well, it would have been pretty unlikely to see this result had my treatment had no effect.” And this leads to all kinds of problems because people want to see the certainty when there is none of misinterpreting the results to imply that they’re more certain than they are, or running small new experiments, you know, running the A-B test for a shorter time period, or a medical trial that has too few medical subjects in it, thinking that, “yeah, I’ll be able to tell whether it works from that,” when there’s not enough data there for you to distinguish between the competing explanations for the results, and so that in various ways leads to all kinds of problems, both in scientific research and in data science and practice when people are using it the answer business questions.
Curtis: Let’s touch on this scientific research really quick, just because I think it’s so interesting. You think that scientific research is really well done, and a lot of it is, but according to your book, there’s also a lot out there that is maybe not as well done as you’d think or the statistics are not as good as they should be.
Alex: There’s a fair chunk of scientific research that suffers from several different problems that are related. So one problem is in a lot of scientific research, getting people or subjects into your experiment cost money and time, so doing a very large experiments is difficult. So, you see this in medicine and psychology and things that use human subjects a lot. But you need a lot of people if you’re looking for a small effect, like in psychology, you know, you do some intervention and see if it changes how people behave or how they interact, so its effects are often very small. You’re looking for like a ten percent difference between things, or something, so there’s a lot of variation between people, and so to reliably say that, “yes, I have detected this small effect, and I’m sure that it’s because of the intervention and not because of luck or because other things, you need a lot of test subjects. And if you don’t have a lot of test subjects, then either, then you’re going to say, “Well, I didn’t find a statistically significant result. I couldn’t rule out the other explanations.” So some people then misinterpret that to say, “I am sure that this intervention had zero effect,” which the whole problem is, you don’t have enough data, so how can you reach such a conclusion. You don’t have enough precision to say that, or if you get really lucky—you run the experiment and just by luck you happen to get a very large effect when on average you would have gotten a much smaller one, then when you get the large effect, that one will be statistically significant, because it’s big enough to distinguish from chance, so the results you end up reporting as statistically significant end up being exaggerated. They’re too large because the correct estimate of the effect wouldn’t have been statistically significant, so that has an effect in a lot of fields. People . . . so this is statistical power—how powerful is my experiment? Is it able to detect the kinds of effects I’m looking for? In a lot of scientific fields, the statistical power of experiments is often quite low for the types of effects they’re looking for, like less than 50%, meaning less than half the time you would actually get a significant result even if it does exist, but then this combines with another problem. So you think that scientists would just say, “Gosh, we’re never finding anything, so we should get more experimental subjects and fix our experiments.” But they do find things, and often the problem is there’s flexibility in the hypothesis they’re testing. Maybe they measured several outcome variables or they did several different treatments or they divide the people up into groups. You know, is there an effect based on age or religion or this or that or all these things. So there are lots of different hypotheses for them to test, lots of different outcomes they can check on and the more you check on, the more likely you are to get lucky and find one that seems to statistically significant, and so then you see studies that maybe tried 20 different things and got lucky and even though their study was underpowered, they found one that was significant but is most likely an overestimated the truth, and then they report that and that becomes, you know . . . you see in the news you see a news article about how scientists have shown something or another. I call that truth inflation, and I’m not sure if there’s a standard accepted term for that. But it’s a significant problem in areas like psychology and some parts of medicine, often like neuroscience suffered from this for quite a while, and I hope they’re getting to grips with it now. When they’re doing all those brain scanning studies, where they say, “Oh, well, we’ve scanned brains where people do this task and they discover that this part of the brain is associated with this skill or ability or type of thinking or something. A lot of those suffered from problems like that because putting people in brain scanners is really expensive. They’re just now starting to understand how to deal with it and how to design their experiments and so on.
Curtis: Got it, and this, it sounds, I mean, this is something that’s common in business as well, right? I’m assuming you see this where, you know, analysts are trying to find something in the data; they’re doing data mining, and this kind of problem arises in business. Are there things that you’ve seen that people can do that people can do to either increase the power of what they’re doing or avoid this problem of finding something that is not actually there?
Alex: Yeah, there are a few different things that you can do. One is just sort of awareness of realizing that if you poke through your data long enough, you’re going to find something, and so you need to be aware of that, “Well, I have been kind of torturing this data, so I should double-check whether this thing I found is real. Maybe you run your experiment longer, or whatever. There are ways of controlling for how many tests you’ve run and how many things you’ve fiddled with to essentially raise the standard of what counts as statistically significant. Say, “Well, since I’ve tried so many things, the thing I find has to be very definitive if I’m going to believe it.”
There’s also, in business if you’re running things like A-B tests and other experiments, where you’re doing an experiment on your website or your service or whatever, good experimental design can actually have a big effect here, where designing your experiment to reduce the amount of outside variation that you see so that it’s easier to see the effect that you’re looking for, so that might mean something like, suppose your website has different audiences, and those audiences are more or less likely to pay for the service or whatever it is you’re measuring. If you stratify your experiment and compared within audiences and then combine those comparisons, you’ve reduced a lot of the variation that would have otherwise been there and would have made it easier to detect the thing you’re looking for.
And then there’s also ways of before you even run an experiment, saying, “Well, I’m looking for an effect of this size, and I know from historical data on my website that I see about this much variation from day to day, from person to person, so just how long do I need to run this experiment to tell if there’s an effect or not of a size that’s of however big is important to your business, and you can calculate then calculate that, and then commit in advance and say, “I’m going to run the experiment that long, and I’m not going to peek before it finishes.” Because it turns out, particularly with things like A-B tests, people like to peek before they’re finished, and then say, “Oh look! It’s significant now because, you know, with the 460 people who’ve visited the website so far, there’s a significant difference. But if you keep peeking at different when different numbers of people have visited, you’re increasing your opportunities for error. It’s just like running many experiments and picking the one that got you the result you wanted when you keep peeking like that.
Curtis: Interesting. And this, this is really good stuff. What would you recommend for members of our audience that don’t necessarily have, you know, a lot of statistical background. A lot of people moving into the field haven’t had formal training in these kinds of things. Where can they look, or what do you recommend they do to really understand how to do this well.
Alex: “Yeah, that’s a tricky question. Speaking as someone who now teaches statistics, it’s tempting to say, “Well, you know, do a statistic degree,” except many of these practical questions end up appearing for a day in the curriculum or something, and aren’t extensively covered, so a lot of the tools that you might be using to do experiments and analyze data, like online A-B testing tools, for example, some of them do have things like power calculators that say, “Well, I’m looking for an effect this big, how many people do I need to run through this experiment before I could detect that reliably?” And those calculators already made online, and so you don’t have to know all the math and all the details of how it’s derived, but just knowing, you know, before I do my experiment, I should figure out how to calculate that. Find some resource that can help me calculate that. That’s a big difference.
Just being aware that these issues exist and knowing I should look up in the software I’m using, or the service I’m using for doing this experiment, I should look up and see if they provide a way of calculating these things and then correcting for multiple comparisons, checking the power before I run it, figuring out the sample size. And then even if you don’t know the math, one thing you can often do is simulate. You can say, “Well, I have all this data.” If you’re a data scientist coming into it from the programming side, the computer science side, you can simulate. You can say, “Well, you know, on a typical day, say 10 percent of the visitors to my website would subscribe or click the thing that I’m measuring. I’ll just simulate, just simulate a bunch of people, 10% of them randomly do it, and I’ll run whatever hypothesis test or procedure it is that I want to do, and I’ll see whoever comes out statistically significant. And then I’ll try singulating the ones who see the new version of the website, they click it 12 percent of the time, and I’ll run that simulation and see if I get a statistically significant result, and you run that simulation a hundred times, and you see what fraction of the time you get a significant result, and that’s your statistical power, and then you can fiddle with your simulation, and say, “Oh, no, it looks like I don’t have a large enough sample size. I wonder how big it would have to be?” And you don’t need to derive all the statistics theory and methods. It’ll still give you a pretty good estimate of what you would need.”
Curtis: Got it. That’s a great practical method. And speaking of resources, I want to also talk a little bit about your book, “Statistics Done Wrong: The Woefully Complete Guide,” which I think is really amazing. Writing a book is a big undertaking, so I want to know what inspired you to do that, and how the experience was, and what your purpose, your aim was.
Alex: Yeah, so the book was kind of an interesting or unusual choice, I guess. So back . . . the time the book idea, the genesis of the book, I was a physics major, an undergraduate physics major at the University of Texas, and I took, we had to take these seminar classes. One of the ones I took was a public speaking seminar, and so you had to give a 25-minute presentation, and people give you feedback and comments and so on, and my presentation ended up being kind of accidentally about misinterpretations of P values, and I didn’t really know anything about P values. I had never done a statistical hypothesis test or calculated a P value, but I had come across some papers while I was doing research for the presentation and found, “Oh, look at how many scientists misuse this,” and I thought it was really interesting. And so after that, I said, “Well, gee, maybe I’ll have to take a statistics class, and I got pretty lucky that he was very good statistics class. Because often, when you tell people that you’re a statistician, they go, “Oh, I hate my stats class in college.” But it was a great statistics class and actually motivated me to go to graduate school in statistics, and during this time, I was reading more, after I done the presentation, I was reading more about the ways people misuse statistics. And that presentation that I wrote, I sort of added things to as I learned more. And some point, it hit 10,000 words of text that I had written in there, as sort of notes for the presentation, and I went, “You know, 10,000 words. I should do something with this.” And I think at that point, I was in my first semester of graduate school, so I reached out to No Start Press who publish a lot of, they publish a lot of books on R programming and Python and a bunch of technical books, just to see whether they’re even remotely interested in the idea, and they said, yes, and so it was my first semester to graduate school that I signed the contract, and about a year and a bit later, 2015, March 2015, the book finally came out. I was working on the index for the book while sitting in my time series class. Don’t tell my professor. So it did develop during graduate school while I was learning all these things, which gave me kind of an interesting perspective because I’m writing these things that are hard to learn and easy to get confused about right after I learned them, so I knew how easy it was to be confused about them because I just tried to learn them. That made it kind of easy to write for the desired audience of people who are just learning it.
And, yeah, it was kind of an unusual choice to write it before you start, start writing it before you even get into school for statistics, but then I continued while I was learning all these things, read an incredible amount of papers, too many papers, on these things, and tried to compile it all together into one resource.
Curtis: Got it, yeah. That’s super interesting, and the book is very interesting and insightful. Now we’re coming up on time here. Give you the last word here, are you adding to the book? What are you doing now? What are you researching? and maybe we can leave it there.
Alex: Yeah, so I’m thinking about adding to the book. We’ll see what happens there, if there ends up being a second edition. So I finished graduate school, and now I’m faculty and work on . . . one of my areas of research, I work in statistical research in crime prediction, but I also work on understanding how people learn statistics now, and we’re doing experiments with undergraduate statistics students to figure out how is it that students learn statistics, what are the misconceptions, where do they come from, and how do we improve our teaching to fix that? We’ve been doing a whole bunch of interviews with students as they solve problems to figure it out how it is that students think about statistics. We’re discovering that things that we teach in statistics classes, students do not hear the things that we say in the way we expect them to. It’s actually heavily inspired by what I had learned about physics education when I had been a physics major because because in intro physics classes, when you say “force,” the thing that students imagine the force is, is very different from what professors imagine forces to be. So there’s not even a common vocabulary there. So students come out of the class thinking very different things then the professor intended them to think. And physics went through a very difficult time trying to solve that problem. I’m hoping that we can achieve the same thing in statistics, realizing what it is that students aren’t hearing when we say it, and figuring out how we change the way we teach, change the activities we have students do—the work we have them do, so that when you take an intro statistics class, (a) people don’t come out of it hating it, and (b) that when people do come out of it, they’ve grasped the core idea of statistics more successfully.
Curtis: I hope you are successful with that, ‘cause I think it’s never been needed more than now with having people understand what statistics are and how to use them, so I’m excited to see what you guys come up with.
Ginette: A special thanks to Alex Reinhart for taking the time to talk to us, and you can find his book “Statistics Done Wrong – The Woefully Complete Guide” on Amazon, or at statisticsdonewrong.com.
If you liked the show you can show us some love by dropping us a review on itunes or wherever you listen to you podcasts, and as a final note, you can always find analytics training and consulting services on our site, datacrunchcorp.com.
We’ll see you next month!
“Loopster” Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License