Sample Size Calculation

Sample size calculation has always been an important topic in medical and nursing research and is also a frequently asked issue by reviewers. The reason why sample size planning is needed is obvious. In general, the reason concerns cost, feasibility, science, and ethics.

While we certainly admit the importance of the different considerations in sample size calculation, we will only address the scientific concern in our exposition.

The following examples, which were slightly modified from their original studies, are used for illustrations.

Example 7: A study on workplace violence aginst nurses

Example 8: RCT on a supportive intervention for fatigue

Example 9: RCT on bed-chair pressure sensors

Note there is software available for sample size calculation. Therefore, it is not our intention to deliver explicit methods of calculating sample size but to describe the principles that are transferable in different situations and make the use of different software easier.

When is it Needed?

For confirmatory studies when we have a specific research hypothesis to test out, sample size calculation is mandatory and should be properly described. A good exposition of sample size calculation is the one that enables a person with reasonable statistical proficiency to reproduce the calculation.

For exploratory and pilot studies, care should be taken not to put undue emphasis on sample size calculation although an educated estimate will be good. Without a specific hypothesis to test for, it is impossible to perform formal sample size calculations. In particular, pilot studies do not have a vision of the key outcomes and do not possess prior data for sample size calculation. Indeed, these studies are used to check feasibility and to provide information for proper planning of future studies.

When is it Needed?

The Key

Many flaws in sample size calculation are due to the lack of provision of how the main analysis will be performed. Thus, the key to starting sample size calculation is the planning of the main analysis. Planning an analysis would make us naturally look for relevant information. The information include, at least,

Main study objective(s). There can be many objectives in a study. However, sample size is calculated based only upon the main study objective(s) only. If you have difficulty in sorting out the main objective(s) in a study, the study may likely be exploratory.
Main outcome(s).. Similarly, there can be many outcome measures in a study. Sample size is calculated based only on the main outcome(s)
Design. This refers to how data are collected from each study subject.
Method of analysis. The method of analysis can be broadly classified as Parameter Estimation and Significance Testing. It is influenced by 1-4. The two types of analysis result in different ways of calculating sample size.
Level of significance. This refers to the maximum chance of committing a false positive error. Very often, it is set to 0.05.

Note that these are not the only information we need for sample size calculation. However, It is not until we have a proper plan for the main analysis before we are ready to look for additionally required information and method for calculating a sample size.

In the sequel, when sample size calculation is being described, we need to state the method of analysis that we plan to perform. This should be consistent to the analysis plan which is often written in a separate section. This sounds to be a logical practice but it is not surprising to see the inconsistency in real study protocols or even published reports!

The Key

Parameter Estimation

There are studies whose main objective is to estimate a certain population parameter of interest.

Although it is quite common to calculate a sample size based on the use of a significance test, the use of a significance test is not required in studies aiming for parameter estimation. Rather, the analysis plan is the computation of a confidence interval at a certain level of confidence.

Now, we are ready to look for additional information to calculate a sample size. Remember the width of a confidence interval measures the error of using a sample estimate for an unknown true parameter value (see the figure below).

Sample size can be calculated so that the maximum error is within our tolerable limit, say e, up to a certain confidence level. This forms the basis for calculating the sample size.

How the sample estimate and its standard error (SE) are computed depends on what we like to estimate. The sample size, denoted by n, for estimating an unknown parameter using a 95% confidence interval is shown in the table below.

Thus, the values we need before we can calculate the sample size are

e, the maximum tolerable error
This is a quantity specified by the researcher. Certainly, the larger it is, the smaller the sample size.
An estimate of SD (when estimating population mean) or p (when estimating population proportion). They may be obtained by several means:
- from pilot data
- from the literature of studies that measured the same outcome on similar subjects
- by conservative guess, e.g., the sample size is the largest when p is taken as 0.5

Note the sample size calculation formula given in the above table gives the smallest value. We should round up to the nearest integer in the case when a decimal number is obtained.

A sample paragraph of sample size calculation for parameter estimation, in the workplace violence study is:

The main study objective was to estimate the prevalence of workplace violence against nurses in Hong Kong. Based on a conservative estimated prevalence of 50% and an error of not more than 5% by using a 95% confidence interval, we need 385 subjects.

Significance Testing

Sample size calculation for studies with the objective of testing a relationship or difference may vary widely. We do not attempt to cover all situations but to describe the general principles and things to avoid.

Again, the first thing we need to work out is the main analysis plan, and the required information has been listed in The Key described above. In particular, the choice of statistical test, although can be difficult, should be carefully made.

Examples of analysis plans

The main analysis plan affects what additional information we need to gather before calculating the sample size. Nevertheless, for controlled studies, the commonly required information includes:

Effect size
Anticipated variability
Power

Effect size

It is the smallest difference between the comparison groups that can be considered clinically meaningful, i.e., the minimal clinically important difference (MCID). Deciding on effect size is generally more of a clinical decision. For example, 5mmHg for blood pressure can be an effect size given the measurement error of blood pressure is often 5mmHg. If one is unsure, one may look up randomized, controlled studies or study protocols that have been published in decent journals and see the effect size they used in their sample size calculation.

In general, it can be relatively difficult to decide an effect size for self-reported outcomes because they do not have direct clinical interpretation. Determination of their effect sizes or MCIDs would need to be benchmarked with another outcome with direct clinical interpretation. However, this is only feasible when one can afford the time and resources.

Sometimes, the effect size for sample size calculation got mixed up with the observed effect size after data collection. Both are effect size but they are conceptually different. The observed effect size after data collection is only available after completing the analysis of the collected data. The observed effect size may not be discernible or can be large. For example, in an RCT on hypertension, the observed effect size of 1 mmHg is too small to be considered clinically important, whereas the observed effect size of 10 mmHg is large but a smaller one like 8 mmHg is also of clinical importance. Therefore, the observed effect size is different from the MCID. In the sequel, using the observed effect size from a previous study for sample size calculation may result in a sample size that is smaller or way larger than what is actually needed.

So, why do we need to specify an effect size in sample size calculation? Think of how many subjects we need to detect a zero difference! (Ans: likely infinitely large) Indeed, we are often not interested in a very tiny difference. For instance, you won't bother to ask me to pay back a loan of only one cent! Instead, we want to ensure an adequate sample size to detect only a difference of discernable size.

When the main outcome is continuous, effect size/SD is called standardized effect size. One should be careful, especially when using computer software, whether the effect size is standardized or not.

Anticipated variability

Depending on the measurement scale of the main outcome, a different measure of anticipated variability is required.

When the main outcome of interest is quantitative, we need to have an estimate of

the SD of the outcome

This information may often be found in a pilot study. Alternatively, we may make the best use of the literature. Note although we may not have much information on the test treatment, there should be documented information on the conventional treatment.

When the main outcome is binary, we need to determine

the anticipated proportion in the conventional group.

We may again make use of a pilot or the literature. Note that when the proportion is 0.5., the variability is the highest and thus the sample size is the largest.

Power

This is often set at 80%, which is also the minimally acceptable level.

However, regulatory approval often needs substantial evidence of the effectiveness of an investigational intervention before approval can be granted. Substantial evidence may be demonstrated by having consistent results across similar studies with minimal statistical errors. However, when one is working on a study that is unlikely to be repeated such as those in subjects with rare diseases, a higher power of say 90% will be preferable to assure regulatory success.

Now, after soliciting the above information, i.e., effect size, anticipated variability, and power, we are ready to look for a method of calculating the sample size. Despite one may use a calculator to do the calculation when an explicit formula exists, it is preferable to use a computer/online program so that one can determine exactly what additional information may be needed.

A sample paragraph for sample size calculation in the supportive intervention example, when the mean is compared, is

The main study objective was to determine if the supportive intervention reduces fatigue over usual care in patients prior to the 4th cycle of chemotherapy. Based on a pilot study, the anticipated SD was 32. In order to have 80% power to detect a difference of at least 5 units between the two groups of patients with a maximum chance of 5% committing a false positive error by a two-independent samples t-test, we need 644 subjects in each group.

A sample paragraph for sample size calculation in the example on bed-chair pressure sensors, when the proportion is compared, is

The sample size calculation was based on the primary comparison in the proportion of subjects with physical restraint between patients using the bed-chair pressure sensors and patients without using any pressure sensors. A previous study reported the proportion of patients without using pressure sensors was 30%. Then, in order to have 80% power to detect a reduction of at least 5% by a Fisher's exact test with a 5% maximum chance of committing a false positive error, we need 1289 subjects per group.

Note the difference in sample size in the two examples. In general, the use of categorical outcomes demands a larger sample size. Continuous outcomes generally require a smaller sample size.

Common Misconceptions

There are misconceptions in sample size calculation that are worth noting:

Sample size calculation guarantees treatment success or statistical significance.
An appropriate sample size calculation guarantees scientific rigor only. The conduct of the study, such as loss to follow-up, missing values, and unexpected larger variability, may influence the effective sample size. Moreover, the treatment may just be not effective!
Use of small, medium, and large effect sizes
For comparing means between two groups, Cohen (1988) has defined the effect size as small, medium, and large when the standardized effect size is 0.2, 0.5, and 0.8, respectively. They were obtained from extensive surveys in the literature. However, they have been misused in sample size calculation since the sample sizes at these effect sizes are pre-determined. With a 5% level of significance and 80% power, the sample size for small effect size is 394 per group, for medium effect size is 64 per group, and for a large effect size is 26 per group, irrespective of all other factors. However, this practice has made no regard to whether the effect size and SD are meaningful. It is recommended to determine effect size and SD separately (Lenth, 2006).
Post hoc power analysis
One may have been suggested to conduct a post hoc power analysis when there are negative (insignificant) results, in order to determine if the insignificant result is due to inadequate sample size or not. Post hoc power analysis is generally not recommended because the power of detecting an insignificant difference is expectedly low (whereas the power of detecting a significant difference is expectedly high). Rather, it is recommended to consider the corresponding confidence interval and clinical significance on the size of the effect (Levine & Ensom, 2001). If the confidence interval fell entirely within the range in which a difference is not of clinical significance, the conclusion of no discernible effect can be made. Otherwise, no conclusion can be made and a larger sample is deemed to be needed.

Complications

We have gone through the basic principles of sample size calculation. However, there can be complications in practice that make the calculation of sample size difficult. Those common ones are:

Multiplicity
It arises when we have more than one statistical comparison. For example, there are multiple primary outcomes, or more than two groups to be compared. Ignoring multiplicity will lead to an inflation of false positive error rate, the most undesirable consequence in most research studies.
A simple way to get around this is using the Bonferroni adjustment. That is, we first count the number of comparisons to be made, say m. Then, use alpha/m as the level of significance in the sample size calculation for each comparison, where alpha is the overall level of significance. The largest sample size is taken as the sample size for the study. Therefore, the more the comparisons, the larger the sample size is.
Repeated measurements
In this situation, we need to be careful in formulating what is being compared. Most difficulty found in the course of sample size calculation is the lack of a viable objective.
An objective of comparing the difference between two groups with repeated measurements is certainly not good enough since there can be many types of differences between the two groups such as a difference in slope, at a time epoch, in the peak value over time, in the lowest value over time, in time achieving a good response, etc.
As long as a specific objective is formulated, we may adopt the previous approaches in calculating sample size. For example, if we focus on comparing the slope of change between two groups, we just need to obtain the relevant information on the slope and calculating the sample size based on a two-independent samples t-test.
Clustered randomization
It occurs when every time we randomly allocated a group of individuals to receive a certain treatment, rather than one individual each time. This is desirable sometime when individuals in a group may have their outcomes interfere each other. For example, we may want to randomly allocate a class of students to receive a certain vaccine for flu since students in a class are expected to have more contacts. Ignoring of this design will again lead to an under-powered sample size (Hayes & Bennett, 1999).
In general, it is not straight forward to obtain a sample size for studies using cluster randomization. For continuous outcome, one needs to estimate the intra-cluster correlation coefficient. For binary outcome, one needs to know the coefficient of variation of true proportions between clusters in a group.