On the number of people required for a usability evaluation

How many people are needed for a usability study? The question comes up time and time again, with different answers. This time, Hwang and Salvendy have tried to answer it by a meta-analysis of the available literature since 1990. As inclusion criteria they used:

  1. the usability evaluation was done with one of the methods think-aloud, heuristic evaluation or cognitive walkthrough
  2. the study reported the number of participants in the evaluation (users or evaluators) and the overall discovery rate of errors.

Out of the 102 usability evaluation experiments found, only 27 satisfied the inclusion criteria. Hwang and Salvendy then performed a linear regression analysis on the data, and tried to estimate the number of people needed to detect 80% of the usability problems. The results: 9 for think-aloud, 8 for heuristic evaluation and 11 for cognitive walkthrough. This leads them to propose 10±2 as a rule of thumb.

I have a few problems with this approach. The ‘how many users’ question is a very logical question to ask, both when planning a study and when interpreting the results. But the statistical analyses are just numbers, and to determine the real value of a study one should not forget the content.

  • Not all errors are equal. While 80% detection sounds good, issue severity should not be neglected. Hwang and Salvendy mention that the cognitive walkthrough method is good at finding critical issues, but less adequate for detecting minor flaws. Knowing that, I wouldn’t choose for increasing the number of evaluators to 11 (!), but would rather combine a smaller early cognitive walkthrough with another evaluation method later, maybe even on a fixed prototype. If 2 or 3 evaluators using the cognitive walkthrough method can point to some of the severe issues, this is valuable enough in itself.
  • Issues are not always errors. A usability evaluation can help discover potential problems, but the way these issues influence the user experience and user behavior in real life may not be the same as in the user study.
  • The usefulness of your results depends on how they are used. Using the results wisely to improve the next design iteration of a system is useful. Using them to formulate specific questions to be answered with other methods (think of A/B testing for example), or to decide on what data to collect, makes sense too. But using them to calculate a magic number to report to stakeholders (based on ‘few issues found = good system’) makes less sense.

Besides, what does that 80% number mean? 80% of the total number of issues hidden in a system, but that’s not a real, measurable quantity. Okay, when the number of distinct issues found is plotted against the number of participants used, the curve does flatten and an upper limit can be estimated. But even with very large groups, there is still a chance that one more participant will find something that all of the others overlooked. In addition, the characteristics of the participants and the protocol used also influence which types of issues will be found easily.

However, I think that in practice the 10±2 guideline should work pretty well, especially for the think-aloud case with non-expert users. With a very small group of users, it is often difficult to say if the findings will generalize across the real user base. On the other hand, using a very large amount of people is costly in time and money, and does not have much added value since there will be lots of repetition in your findings.

Hwang, W., & Salvendy, G. (2010). Number of people required for usability evaluation Communications of the ACM, 53 (5) DOI: 10.1145/1735223.1735255

No Comments

Post a Comment

Your email is never shared. Required fields are marked *