“How many participants should you run in a usability study?”
How many times have you heard that question?
How many different answers have you heard?
After you sift through the non-helpful ones, probably the most common answer you’ve heard is five. You might have also heard that these “magic 5” users can uncover 85% of a product’s usability issues. Is that true? Are five enough, too few, or too many?
How can you know? Can you really know?
Or are we just resigned to hearing the most dogmatic voices on social media? What are the alternatives?
Perhaps we should average the advice of others or make our lives easier by sidestepping the question altogether.
We’ve seen both approaches taken. But is there a better way to find sample sizes?
And is there a single sample size that is right for all usability studies?
One Size Does Not Fit All: Define the Study Type
You probably know the answer: One sample size does not fit all studies. Not much of a surprise there. But there is a way to get to a sample size that doesn’t involve democracy or demagoguery.
The first step in finding a sample size is to define the study type. For the purposes of sample size estimation, there are three types of usability studies: Problem Discovery, Estimation, and Comparison (Table 1).
# | Type | Purpose | Example | Formative or Summative |
---|---|---|---|---|
1 | Problem Discovery | Finding Problems and/or Insights | What are the usability problems for the check-out flow? | Formative |
2 | Estimation | Estimating a Value/Parameter | What is the SUS score for all users of a product? | Summative |
3 | Comparison | Making a Comparison | Is there a difference in SUS scores or is the score above average? | Summative |
In contrast to the focus on measurements taken during summative user research (study types 2 and 3), the goal of problem discovery usability studies (type 1) is to discover and enumerate the problems that users have when performing tasks with a product. It’s considered a formative type of evaluation.
So, what’s the sample size for each study type? 5, 50, 100?
One Size Does Not Fit All Even Within Study Types
While defining the study type helps narrow the proper approach to sample size estimation, it still doesn’t warrant recommending one number. Because there’s math involved, it’s understandable that people seek a simple single number. We’ve been trained to find a single answer to simple math problems: 2+2 always equals 4. The square root of 9 is always 3. The answer is determined because there aren’t any variables—life is great!
As soon as you introduce variables, however, things get more complicated. The hypotenuse of a triangle is always equal to the square root of the sum of the squares of the two other sides (a2 + b2 = c2), but the actual length of the hypotenuse depends on the length of the two sides.
The methods for finding sample sizes for summative studies are typically taught in university statistics classes. Those methods include several variables whose values can differ from study to study, including alpha and beta decision criteria (which control the long-run probability of Type I and Type II errors), the standard deviation of the metric, and the smallest difference that you need to detect to make the necessary decisions (i.e., the critical difference). Changing any of these variables will change the sample size needed to meet the requirements.
Problem discovery sample sizes use a less familiar approach. We’ve discussed in previous articles the mathematics commonly used to derive sample sizes for formative problem discovery usability studies and how well that math matches reality.
So, what is the formula for finding sample sizes for problem discovery studies?
Sample Size Formula for Discovery Studies
While you don’t need to fully understand the derivation of the formula to use it, it helps to know how to use it. It has only two elements: n and p.
P (at least once) = 1 − (1 − p)n
The p is how likely a problem (or event) is to occur in the tested population and n is the sample size. In this formula, they compute the probability of seeing the problem at least once in a formative usability study with n participants.
Technical note: We manipulated the binomial probability formula to get to 1 − (1 − p)n, but there are other ways to arrive at this formula, including the Poisson probability formula and capture-recapture models.
The formula above computes the probability of detecting a problem given a sample size and its frequency in the population. It can be rearranged using algebra to solve for the sample size.
Because n is an exponent in the formula, it’s necessary to use logarithms to manipulate the formula to focus on the sample size instead of the probability of discovering the event of interest at least once. The resulting formula is:
Don’t worry too much about the formula other than to note that it shows that the sample size for a discovery study is driven by the discovery goal (P(at least once)) and how likely an event is to happen during the discovery (p).
As mentioned above, in the best-known rule of thumb for usability study sample sizes, the “magic number 5,” the claim is that five participants are enough for the discovery of 85% of usability problems (strictly speaking, 85% of the problems that are available for discovery given the constraints of the study regarding the sampled population and tasks).
Why 85%?
Nothing is inherently right or wrong with a discovery goal of 85%. It deviates from the more expected convention of 95% or 90% used in confidence intervals, but like a confidence level, the discovery goal can take any value from 1% to 99%. So, where did 85% originally come from?
Several early investigations into using these formulas to predict problem discovery rates as a function of sample size (e.g., Virzi, 1990; Nielsen & Landauer, 1993) reported finding that four or five participants discover 80–85% of the problems in large-sample usability studies. Over time, these findings became the simplified “magic number 5” rule.
An early test of the simple goal of 85% discovery was an economic ROI simulation published in 1994 (by Jim) that estimated the costs associated with running additional participants, fixing problems, and failing to discover problems in formative usability studies. Although all the independent variables influenced the sample size at the maximum ROI, the variable with the broadest influence was the average likelihood of problem discovery (p), which also had the strongest influence on the percentage of problems discovered at the maximum ROI. The results indicated that, when the target value of p is small (e.g., 10%), practitioners should plan to discover about 86% of the problems available for discovery in the study. When p is greater (e.g., 25–50%), the appropriate goal is about 98% discovery.
Things get trickier determining how often events of interest occur during the study. A common estimate of that likelihood is 31%. But where did that come from?
Why 31%?
In the research Jakob Nielsen and Thomas Landauer published in 1993, which was the basis of their recommendation for running formative usability studies with five participants, the value they computed for the likelihood of problem occurrence was .31.
This was the average of the problem discovery rates reported in 11 usability studies they had conducted or had acquired from other researchers at the time (including one from Jim Lewis—see Figure 1 for the correspondence between Nielsen and Lewis in 1991). When they used their version of 1 − (1 − p)n and graphed the expected percentage of discovery for sample sizes from 1 to 15 and p = 31%, their estimated discovery rate was 85% when n = 5.
If you plug .85 and .31 into the sample size formula, you get:
n = ln(1 − .85)/ln(1 − .31) = (−1.897)/(−0.371) = 5.11
So, math supports running five participants in a discovery study if (1) the discovery goal is 85% and (2) the probability of the occurrence of an event of interest is 31%. (You can also use our online calculator, which will do the math for you.)
But as mentioned above, one size does not fit all. What if, in your research context, you need to discover more or fewer than 85% of the events of interest, and what if their probability of occurrence is less or greater than 31%?
In those cases, you need a size chart, analogous to shopping for a men’s dress shirt to fit a given neck size and sleeve length (desired discovery rate and problem likelihood). We’ll publish that size chart in a future article.
Summary and Discussion
How many participants do you need for a usability study?
It depends first on the study type. There are three study types—discovery, estimation, and comparison. In contrast to estimation and comparison studies, sample size estimation for discovery studies uses a different mathematical approach.
It still depends within study types. Don’t rely on averaging together recommendations or looking for a single number that will always work even when focusing within a study type such as discovery.
What about the “magic number 5?” The controversial claim based on the research of Nielsen and Landauer that “five is enough” turns out to sometimes be true, but only for a limited range of research contexts.
What about any other magic number? Because the appropriate sample size for discovery studies depends on two factors, no one magic number will be appropriate for all research contexts. In fact, there is no magic number for sample sizes for any type of usability study, formative or summative.
Use the formula for problem discovery. The problem discovery formula can be used to find the sample size based on expected problem occurrences (p) and the likelihood of seeing a problem at least once. You can also use the online calculator.
Parameters have defaults but should be changed when necessary to fit the research needs. The typical parameter for discovering problems is 85%, but this can be increased or decreased depending on the context. The parameter of 31% for the probability of problem occurrence came from an average across datasets from the 1990s. It’s not a bad place to start, but it shouldn’t be the only value for this parameter. Using values of 10%, 20%, and even 5% may make sense depending on how important it is to discover uncommon problems.
If there isn’t a magic number, should we give up on sample size estimation for formative usability studies? Giving up on magic numbers doesn’t mean you have to give up on sample size estimation for formative usability studies (or any other type of discovery study). You just need to be able to make decisions about (1) how rare of an event you need to be able to detect at least once and (2) what percentage of those events you need to discover in the study.
Bottom line: It would be nice if this process were simpler, but unfortunately, one sample size does not fit all research requirements. Fortunately, there is a mathematical model that can guide UX professionals to make reasoned decisions about sample size requirements for formative usability studies.