Calculating an optimal samples size or identifying the power of a sample size

Justin Scott (


We have a hypothesis we want to test. We want our conclusions to be generalizable to our population or sub-population of interest. Ideally, we would test everyone as this would give the most accurate results but would also be the most uneconomical approach. Instead we will take a sample of the population of interest.

Small sample sizes will be vulnerable to the influence or bias of any one individual. We want a sample of the population of interest that is sufficiently large that it is representative of it.

We need to be able to identify if our hypothesis was correct or not. Typically, we are testing a difference between two or more groups or cohorts. If the test statistics: frequencies, proportions, means, or medians of the groups are far apart then it is easy to detect a difference and only a small sample size is required. If they are close together then it is harder to detect a difference and a large sample size is required.

Different results will come back for different individual patients or samples. The more variable the results, the "fuzzier" the picture and the larger the sample required to control for this variation.

A sample that is too small will not be representative of the population of interest and may be too vulnerable to the variation of individual results. A sample that is too large will be uneconomical and if invasive or ineffectual may potentially cause undue stress or pain to subjects or patients. Sample size calculations finds the optimal balance between these two extremes.

What follows is the example of deriving optimal sample size and power of the T-Test: comparing the means between to groups of observations. The other types of tests follow a similar logic. You no longer have to derive the values by hand as this can now be done by software so the mathematics is for background information only.

Hypothesis Testing

The null hypothesis (H0) e.g. gold standard or control is tested against the alternative e.g. new or cheaper treatment. There are four possible scenarios.

 H0 TrueH0 False
Reject H0 Type I Error Correct rejection
Fail to Reject H0 Correct Decision Type II Error
Type I error
Is the probabilty to reject H0, given the truth of H0. This is also called the significance level, denoted by α.
Type II error
Is the probability to accept H0, given the falsehood of H0. This is denoted by β. The Power of a test is 1 - β; the probability of correctly rejecting H0 when it is false, which also translates as the probability of correctly accepting the alternative hypothesis H1 when it is true.

By convention a statistical test in a clinical environment is designe dto have a significance level of 5% and a Power of 80% or, in larger well-funded studies, 90%.

Sample Size and Power

Optimal Sample Size

In general terms, the optimal sample size is related to the variations and the difference of the test statistics.

Sample size n \approx\frac{[\textrm{Pooled Variation}]}{[\textrm{Mean Difference}]}^2

Figure 1: Example of t-test comparison

Technically, it is the sample size that will be able to detect a specified significance level (e.g. α=0.05) for a specified difference, variation and power (e.g. 80%). It is the sample size at which we can be confident that we have constrained our Type I and Type II Errors to an acceptable degree.

Calculation of the optimal sample size will entail an appropriate formula and a number of assumptions and estimations. This is now done by software with the user entering the information.


Power is the confidence we can have in our test. It is our repeatability. If we were to repeat the test 100 times, with different random samples, how many times would we expect it to give us the correct result rather than a false one? The standard for clinical research is a power of 80% and in larger better-funded projects, 90%. Think of it as you would an exam the result.

The Power of a trial or statistical test is the probability that it correctly rejects the null hypothesis (H0) when the null hypothesis is false (i.e. the probability of not committing a Type II error, or β).

Power = P(\textrm{reject }H_{0}|H_{0}\textrm{ is false}) = 1 - \beta


  • P is power
  • reject H0 given that H0 is false

In general terms, the optimal Power is related to the variation and the difference of the test statistics.

Power: \pi \approx \textrm{difference} \times \sqrt{\frac{n}{\textrm{variation}}}

Technically, it is the power of the test given a detection at a specified significance level (e.g. α=0.05) for a specified difference and variation.

Figure 2. The graph places the two hypotheses scenarios side by side. The null hypothesis (H0) is on the left and the alternative (H1) is on the right.
abbrev: S.E.—standard error; σ—standard deviation; µ—mean; δ—mean difference; Ζ—test statistic derived measuring the strenght of the difference

Figure 2 shows the distributions of the two hypotheses side by side. The null hypothesis (H0) is on the left and the alternative hypothesis (H1) is on the right. The Power is the green area under the alternative hypthosis (right-most area) as a proportion of its overall area.

Standard Error

Standard Error (SE) is the Standard Deviation of the Mean rather than the observed results. More rigorously the standard error is the square root of the variance of a statistic. For example, the standard error of the mean of a sample of n observations taken from a population with variance \sigma^{2} is \frac{\sigma}{\sqrt{n}}. The same term is used for the corresponding sample estimate \frac{s}{\sqrt{n}}. The standard error is always smaller than the standard deviation.

Sample size vs. Power

Sample size and power are the two sides of the same coin. When we are calculating an optimal sample size we set the power to 80%. When we are restricted in a sample size so that it is fixed, we calculate the power of the test for a given sample size.

Figure 3. Example of Sample Size (x-axis) versus Power (y-axis). Image derived from nQuery Advisor (v7.0) Statistical Solutions Ltd.

Increasing the sample size has a diminishing return. Larger numbers give less gains in power. At some point further increasing the sample size becomes cost-prohibitive.


Estimating a sample size or the power of a study lies somewhere between rigorous science and the divinatory art of numerology.

Sample size and power calculations require estimation of the likely or required statistics before the study or trial has taken place. How can you know what these statistics will be before you have completed your study? The standard approaches are, in order of preference:

  • conduct a pilot study beforehand
  • use retrospective clinical data
  • conduct a literature review
  • guestimate
  • guess

This author frowns upon the last two options.

The sample size calculated should be based on the numbers you expect at the end of the study not at the start.

The number of subjects or patients that you have at the end of a study can be substantially less than what you recruited at the start. If you do not take into account drop-outs, increasing study fatigue, and missing values then your study will end up being underpowered and less capable of answering your primary hypothesis question.

Computer derived sample sizes need to be additionally inflated to take this into account.

Your clinic or literature review may allow you to estimate a missing value percent (MVP). Otherwise assume a MVP of 10%.

For a single time measurement study:

n\prime = \frac{n}{(1-\textrm{MVP})}

For a repeated measure study:

n\prime = \frac{n}{[(1-\textrm{MVP}).(1-DR)^{t}]}


  • DR is the estimated drop out rate (e.g. 5% per time point)
  • t is the number of time points

Alternatively, use the original sample size calculation but make it clear in the proposal that this is not the recruitment number but the end-point number. You then have the flexibility to recruit until you eventually reach this target.

Some genetic studies have rare cases (e.g. low frequency alleles). If these are of interest then the sample size needs to be focused on detecting differences between these low frequency cases not the overall number of cases. The common case frequencies are ignored for the sample size calculation.

Statistical significance does not mean clinical significance. The more samples, subjects, or patients you have, the smaller the difference between the statistics you will be able to detect to a significant difference. At some point this difference becomes clinically meaningless.