Reviewed by Thoralf Mildenberger (ZHAW)

  • Paul. D. Ellis, The Essential Guide to Effect Sizes. Statistical Power, Meta-Analysis and the Interpretation of Research Results. Cambridge University Press, Cambridge 2010. Link to book on publisher’s website.

In the last few years, statistical hypothesis testing – with the p-value still being THE standard for reporting results in many fields of science – has increasingly been criticized. Many researchers have even called for abandoning the “NHST” (Null Hypothesis Significance Testing) approach all together. I think this is going too far as many problems are due to misapplication of the techniques and – perhaps even more importantly – misinterpretation of the results. There is also no consensus on how to replace hypothesis testing with a better methodology – some of the more moderate critics suggest using confidence intervals, but while these are often more informative they are essentially equivalent to hypothesis tests and share some of the problems. This makes it all the more important to highlight difficulties in the correct application and interpretation of statistical methodology.

One of the most common errors in interpreting the p-value (apart from mistaking it for a probability of the null hypothesis) is mistaking this measure of statistical evidence for a measure of the size of an effect – a highly significant test (very small p-value) only means that there is strong evidence for the existence of an effect, not necessarily that there is evidence for a large effect. For small samples, one can only hope to detect quite large effects, while for very large samples, even very small and practically irrelevant effects become “statistically significant”. So in many applications the question is not only whether some effect exists or not but how large it is and often also what it means in terms of some real world measures like number of lives saved per year or increase in monetary income etc.   

Paul D. Ellis’ “Essential Guide to Effect Sizes”, although already published a few years ago, is still a very good and readable introduction to the topic. The book consist of three parts. The first part gives a general introduction to the problem and introduces the two main types of effect size measures: differences between means  standardized by a measure of spread (Cohen-type measures) on one hand and correlations on the other. It also contains a critical discussion of Cohen’s rough classification of effect sizes. 

The second part of the book is about the power of statistical tests, i. e., the probability of a given procedure to detect an effect that is actually there. Power generally depends on both the size of the effect and on the sample size, so thinking about whether the chosen experiment even has a reasonable chance of detecting true effects that the experimenter is interested in is crucial. Still, the power of much published research in many fields is still very low, a problem that has been known for decades but there has been little improvement.  

The third part of the book is actually about a related, but different topic – meta analyses, i.e. combining results from existing studies. Meta analyses are performed mainly in psychology (where they originated), medicine and various areas of social research as well as marketing research, but are not widely known outside of these fields. Ellis gives a short introduction of the main techniques and approaches as well as some practical advice, especially on how to avoid biases (some which are at least in part due to the general problems with significance testing). 

The are only a few formulas, and software implementations are mentioned in the chapter end notes without going into much detail. Remarks on software only concern packages like SPSS and stand-alone software for special calculations, I did not find a single mention of R in the whole book. Nevertheless, the end of chapter notes are full of references to the literature and there is an extensive (although surely not exhaustive) bibliography.

The book is short and non-technical, making it useful for both researchers from fields where these applications are most relevant (medicine, psychology and social sciences) as well as for data scientists looking for a primer on the main ideas. The latter group would of course later move on to the more technical literature on these topics. It also deserves to be read by a wider audience of scientists and professionals as it clears up some of the confusion about hypothesis testing, non-reproducibility and related issues. It certainly deserves a larger readership!

Paul D. Ellis also maintains a website with an effect size FAQ based on the book.

[personally signed contributions reflect the opinions of their authors and not necessarily those of datalab]