Statistics in Practice
Why Statistics
The application of statistics is enormous in medicine, nursing, and real life. This is due to the fact that we often need to make decisions under uncertainty. For example, we may not know if a treatment works when we need to decide to take it or not; and we may not know if we will pass a professional exam before we take it.
This brought to the need for statistics, a tool that helps us to quantify and manage uncertainty so that we can make decisions with the best use of available information. This is consistent with the current paradigm of evidencebased practice.
Running through such a process ...
Given: The infant mortality rate of a hospital in Hong Kong was 20 deaths per 1000 live births in the past year.
Question. Will you suggest people to give birth in the hospital? (No / Yes / Not sure)
Response. Not sure! Need to know what rate to expect, e.g., the overall infant mortality rate in Hong Kong.
Additional information: The infant mortality rate in Hong Kong was 6 deaths per 1000 live births.
Question. Now, will you suggest people to give birth in the hospital? (No / Yes / Not sure)
Response. Not sure! The hospital may just be unlucky (e.g., more maternal morbidity) or only had a few live births in the past year (e.g., 2 deaths out of 100 live births only).
Additional information: No evidence of a difference of live births in this hospital from those in other hospitals. Moreover, the 95% confidence intervals for the infant mortality rates of the hospital and Hong Kong were (10, 30) and (5.5, 6.5), respectively per 1000 live births.
Question. Now, will you suggest people to give birth in the hospital? (No / Yes / Not sure)
Response. No, better not!
Moreover, a proper study design is essential for deriving a valid conclusion. A poorly designed study may not be rescued at the analysis stage.
Here is an example modified from a real study:
Given: A study was conducted to examine if calcium intake is a risk factor of high blood pressure. The investigator decided to adopt a casecontrol study design. Specifically, the investigator sampled two groups of subjects. Group A consisted of hypertensive patients and Group B consisted of subjects without hypertension nor calcium intake.
Question. Can we determine if calcium intake is a risk factor of high blood pressure? (No / Yes / Not sure)
Response. No!
Given: The investigator then sampled another group (C) of subjects, all of them did not have hypertension.
Question. Now, is there a way to do the analysis? (No / Yes / Not sure)
Response. Yes, by comparing Groups A and C. If there are statistically more subjects in Group A who had calcium intake than in Group C, we may say calcium intake is a risk factor of high blood pressure.
Phases of Analysis
A statistical analysis may be identified as a process in four phases:

Data preparation / Clinical data management

Descriptive statistics

Inferential statistics

Presentation of results
Data preparation / Clinical data management
Clinical data management is a vital vehicle in clinical studies to ensure the integrity and quality of data being transferred from the study subjects to a computer database system (Fong, 2001). Poor data quality during data collection generally comes from poor management and study design. However, the most irresponsible source of error is the assumption that the data come errorfree.
While statisticians are certainly frustrated with poor data, data with excessive missing values, outliers, and errors may lead to biased conclusions that mislead the study investigators and the public. To minimize data errors, it is better to get prepared as early as in the design stage. Data collection forms should be welldesigned. Specifically, the use of referential questions should be minimized. Moreover, all data forms should be trial tested on whom the forms will be completed before they are administered.
Data errors may also come from data miswritten on the data collection forms. Therefore, it is good to do data monitoring to check for discrepancies between the source data and the data collected on the forms. Having said that, it is only possible when we have enough resources!
Before entering data on paper forms into a computer file, it is ideal to construct a screen input form with realtime data checking. It can be done in computer software such as SAS, Epi Info, and Microsoft Access. Otherwise, standardized templates created by SPSS, Microsoft Excel, Google Sheet, or other spreadsheet software may be used. No matter which method is used for data entry, it should be pilot tested.
During data entry, the data entry clerk should be briefed and given the liberty to raise questions during data entry. Categorical data such as gender, race, educational level, etc. are often coded in order to ease data analysis before they are entered into a database. Therefore, a coding dictionary or equivalent should be prepared. Besides, there should be a provision for entering data as recorded on the data collection forms. Unnecessary calculation by the data entry clerk, whether by hand, an electronic device, or heart, should be avoided. For instance, for quality of life questionnaires, despite we only use their summary scores for assessment rather than responses from individual items, it is better to enter the item responses and later calculate the summary scores in a computer program.
The time we completed data entry is not the end of data management but is when we start to face various data issues. Although almost no datasets, if not all, can be perfectly cleaned, we should attempt data cleaning as far as possible. Again, we should not assume the data come errorfree.
In general, data errors may occur in different ways. Common data errors or how they may occur are:

Illogical data. A common method is a range check to see if there are values out of the plausible range. Others include checking for relational conflicts such as the existence of pregnancy results for men, and outliers, i.e. values that are extremely larger or extremely smaller than the others.

Dates. A data type that may often have errors. Subjects may have missing or implausible days, months, or years. Moreover, although we are moving away from the 2000 bug (year entered as 03 is interpreted as 1903!), there is still a chance of encountering it when only the last two digits of a year are entered.

Duplicates. It is in fact not a surprise to have data of a subject entered more than once, especially in studies with more than one person performing the data entry.

Missing values. They may due to data missing on the data collection forms or data that were not entered. Therefore, we should also be conscious of the second source of missing values.
Eventually, we do hope to have a decent set of data for subsequent analysis.
Descriptive statistics
This is the time when we want to characterize our sample. The choice of descriptive statistics depends on the data types.
Nominal data or ordinal data with not too many levels (say <5) should be summarized by frequency and percentage. Ordinal data with many levels or quantitative data may be summarised by either the median and interquartile range (IQR), or the mean and standard deviation (SD). Ideally, if the Normal distribution appears to fit the data, the mean and the SD are preferable as they are often more easily handled mathematically. Practically, it will be good to have at least a rough check by examining the difference between the mean and the median. A larger difference informs the need of reporting the median and the IQR; otherwise, report the mean and the SD.
In any case, we also need to report, for each variable, the number of subjects with values obtained. Conclusions on variables with a lot of missing values are often not trustworthy no matter what analysis is performed.
Inferential statistics
Since we often operate on samples but not the populations of interest, we need to infer our results derived from a sample to the population. The statistical methods to be used may vary widely in different situations. One needs good knowledge as well as experience before one can be confident and competent in formulating an analysis strategy.
Pvalue is inevitably an important statistical tool for decision making. An easy interpretation of a pvalue is the observed chance of committing a false positive error. Therefore, we reject the null hypothesis when the chance of committing a false positive error is affordably small.
The American Statistical Association (ASA) has stipulated six statements on pvalue (Wasserstein & Lazar, 2016):

Pvalues can indicate how incompatible the data are with a specified statistical model.

Pvalues do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

Scientific conclusions and business or policy decisions should not be based only on whether a pvalue passes a specific threshold.

Proper inference requires full reporting and transparency.

A pvalue, or statistical significance, does not measure the size of an effect or the importance of a result.

By itself, a pvalue does not provide a good measure of evidence regarding a model or hypothesis.
Presentation of results
Presentation of your findings may influence the way they are interpreted. Altman & Bland (1996) discussed some poor presentations of numerical data.
Particularly, there have been poor presentations of pvalues. It is often preferable to report the numerical value of a pvalue up to three decimal places to minimize ambiguous comparison with the level of significance (e.g., 0.05 when the pvalue can be 0.046 or 0.053).
In general, there is no definite required number of decimal places for a descriptive statistic. Nevertheless, it has been suggested that summary measures such as mean, median, standard deviation, etc. should not be given to more than one extra decimal place over the raw data.
For correlations, it is generally sufficient to have at most 2 decimal places.
Of course, we need to comply with the requirements of the journal in which you want to report your results!
A Painful Reality
Although the four phases of analysis described often occur sequentially, they appear to be more appropriately taken as a cyclic process in practice.
Our experience tells us that we might go back to data cleaning in the process whenever there are newly identified data errors. The worse is that we may, with a sudden instinct, discover a new data error at the time when we are presenting our results! We then have to go back to the beginning of the process and redo everything (see the figure below).
So, which stage of the process takes most of the time? Just for reference, in a project we conducted, we spent around 2 weeks in data cleaning but 2 days in the actual analysis.