Type-1 and Type-2 Error Rates Explained

In frequentist hypothesis testing, there are always two types of error rates, aptly referred to as type-1 and type-2 error rates. The error rate refers to the estimated rate at which the end result of an experiment will be incorrect. Type-1 errors occur when the null hypothesis is rejected, despite the fact that it was true. For example let's say in objective reality that there is no difference between having a signup form with three fields and a signup form with two fields. The experimenter (not knowing this objective truth) designs an experiment to test whether two field signup forms will lead to more signups that three field forms. Here the null hypothesis is that there is no difference, with the alternative hypothesis being that two field form is better than three fields (or just different, depending on if the test is one or two-sided). You run an experiment and the results state that indeed the two field form was better. We know that can't be true because of the objective truth that both were the same. This is a type-1 error, and unfortunately we won't know when it has occurred because we don't get these objective truth's handed down to us from some deity; if we did we wouldn't need to experiment in the first place! Statistical analysis is not guaranteed to be accurate 100% of the time. Since we're only looking at a sample of the total population and extrapolating results about the population as a whole, we know that there will be times when our best predictions don't end up lining up with reality. Type one error rates can most intuitively be thought of as false positives because the positive result (that two fields was better than three in our example) did not align with reality, i.e. it was false.

Type-2 errors, as you may have guessed, occur when the opposite happens. Type-2 errors can be intuitively though of as false negatives. If we take the previous example and switch things up a bit, we can construct a hypothetical scenario in which a type-2 error occurs. Let's say we have the objective truth passed down and instead it says that two and three fields are in fact better for signup conversions. An experimenter run the same test and the result says there is no difference (i.e. we accept the null hypothesis) between two fields and three. Again, we as the readers with perfect information know this doesn't line up with objective truth. Does the distinction make sense? Does it now follow why these can intuitively be thought of as false negatives? I know the terminology is difficult to grasp at first, but don't let the statistical jargon dissuade you. The concepts are more important to understand, and when you think of them in a way that makes sense you build intuition and can always refer to the definition later as needed. Eventually, you won't even need to do that.

Now that you know what type-1 and 2 errors are, let's talk about what the common error rates are and what it means for your experiment and overall testing strategy. Some a/b testing platforms will let you choose your type-1 and 2 errors, and others make a default assumption and use it for all experiments. At Engauge, we do a combination of the two. We assume a default of 5% type-1 and 1% type-2 error rates, but allow advanced users to change the defaults on their own. The way you can interpret this assumption is that for every twenty Engauge experiments ran, one (because 1/20 = 5%) will return a false positive. Additionally, one of every one hundred experiments will return a false negative. We set the defaults to have lower error rates than what could be considered the industry standards of 10% and 5% type-1 and type-2 error rates, respectively. We believe that the standard error rates are too high and because of our smart algorithms we can default to stricter standards while delivering results just as fast, if not faster, than traditional A/B testing platforms. Despite the superior error rate defaults of Engauge, we see again that A/B testing is not a perfect system. We are using statistics to make the best guess possible, but we're not looking into a crystal ball to see the future.

The size of error rate you as the experimenter are willing to accept should be something you carefully consider. The tradeoff is always between speed of results and definitiveness of results. If you want a really quick turn-around time, and are willing to accept the greater risk of a high error rate, you can take a type-1 error as high as 20%. Now, one in every five experiments will return a false positive. You will learn things about your users faster, but each experiment you get back will be highly suspect. Or, you could use a type-1 error rate of 1%, and likely end up waiting quite a while for the results. Since we're thinking about this from a business perspective, and not a medical one where lives are on the line, it makes sense to find a sane middle-ground where you can have a good deal of confidence in your results without having to wait until you run out of runway to get them. We think the sweet spot is 5%, but welcome you to adjust that to your liking now that you understand what kind of tradeoffs you will be making in doing so.

So get out there and start testing, and don't be afraid to change that dial to get the error rates that you think are right for you. Thanks for reading, and follow us on twitter for more information on A/B testing, Engauge, and the math it's all based on.