J Korean Med Sci.  2024 Jan;39(3):e35. 10.3346/jkms.2024.39.e35.

Data Distribution: Normal or Abnormal?

Affiliations
  • 1Past President, World Association of Medical Editors (WAME)
  • 2Editorial Consultant, The Lancet
  • 3Associate Editor, Frontiers in Epidemiology

Abstract

Determining if the frequency distribution of a given data set follows a normal distribution or not is among the first steps of data analysis. Visual examination of the data, commonly by Q-Q plot, although is acceptable by many scientists, is considered subjective and not acceptable by other researchers. One-sample Kolmogorov-Smirnov test with Lilliefors correction (for a sample size ≥ 50) and Shapiro-Wilk test (for a sample size < 50) are common statistical tests for checking the normality of a data set quantitatively. As parametric tests, which assume that the data distribution is normal (Gaussian, bell-shaped), are more robust compared to their non-parametric counterparts, we commonly use transformations (e.g., log-transformation, Box-Cox transformation, etc.) to make the frequency distribution of non-normally distributed data close to a normal distribution. Herein, I wish to reflect on presenting how to practically work with these statistical methods through examining of real data sets.

Keyword

Biostatistics; Statistical Distributions; Data Analysis; Normal Distribution; Epidemiologic Methods

Figure

  • Fig. 1 Frequency distribution and Q-Q plot. (A) Frequency distribution of HBs Ag measured in 150 study participants taken from a previous study13 (bell-shaped gray curve) along with the fitted normal distribution (having the same mean and the standard deviation). (B) The Q-Q plot of the data implies that the distribution can be assumed to be normal.HBs Ag = hepatitis B surface antigen.

  • Fig. 2 Frequency distribution and Q-Q plot. (A) The frequency distribution of the PSA measured in 150 study participants taken from a previous study20 (the highly positively skewed gray curve) along with the fitted normal distribution (having the same mean and the standard deviation). (B) The Q-Q plot of the data also implies that the data does not have a normal distribution.PSA = prostate-specific antigen.

  • Fig. 3 Output of the one-sample Kolmogrov-Smirnov test with Lilliefors correction and Shapiro-Wilk test from IBM® SPSS® Statistics ver. 26. Because the sample size was 150, the result of the first test is used.df = degrees of freedom.

  • Fig. 4 The same graphs as those in Fig. 2 after log-transformation of the PSA, when log(PSA) is used instead of PSA. (A) The frequency distribution (gray curve) is now much closer to a normal distribution and (B) the point in the Q-Q plot lie close enough to the straight line to retain the assumption that the data distribution is normal.PSA = prostate-specific antigen.

  • Fig. 5 Frequency distributions before and after transformation. (A) The frequency distribution of SARS-CoV-2 IgG level measured in 40 study participants taken from a previous study21 (gray curve). (B) Frequency distribution of the same data set after a Box-Cox transformation given a λ = −1 (Eq. 2).SARS-CoV-2 = severe acute respiratory syndrome coronavirus 2, IgG = immunoglobulin G.


Reference

1. Lang TA, Altman DG. Basic statistical reporting for articles published in biomedical journals: the SAMPL Guidelines. Smart P, Maisonneuve H, Polderman A, editors. Science Editors’ Handbook. Exeter, UK: European Association of Science Editors;2013.
2. Habibzadeh F. Statistical data editing in scientific articles. J Korean Med Sci. 2017; 32(7):1072–1076. PMID: 28581261.
3. Misra DP, Zimba O, Gasparyan AY. Statistical data presentation: a primer for rheumatology researchers. Rheumatol Int. 2021; 41(1):43–55. PMID: 33201265.
4. Habibzadeh F. Common statistical mistakes in manuscripts submitted to biomedical journals. Eur Sci Ed. 2013; 39(4):92–94.
5. Habibzadeh F. How to report the results of public health research. J Public Health Emerg. 2017; 1:90.
6. Altman DG, Bland JM. Statistics notes: the normal distribution. BMJ. 1995; 310(6975):298. PMID: 7866172.
7. Shatz I. Assumption-checking rather than (just) testing: the importance of visualization and effect size in statistical diagnostics. Behav Res Methods. Forthcoming. 2023; DOI: 10.3758/s13428-023-02072-x.
8. Barker LE, Shaw KM. Best (but oft-forgotten) practices: checking assumptions concerning regression residuals. Am J Clin Nutr. 2015; 102(3):533–539. PMID: 26201816.
9. Casson RJ, Farmer LD. Understanding and checking the assumptions of linear regression: a primer for medical researchers. Clin Exp Ophthalmol. 2014; 42(6):590–596. PMID: 24801277.
10. Nielsen EE, Nørskov AK, Lange T, Thabane L, Wetterslev J, Beyersmann J, et al. Assessing assumptions for statistical analyses in randomised clinical trials. BMJ Evid Based Med. 2019; 24(5):185–189.
11. Hu Y, Plonsky L. Statistical assumptions in L2 research: a systematic review. Second Lang Res. 2019; 37(1):171–184.
12. Hoekstra R, Kiers HA, Johnson A. Are assumptions of well-known statistical techniques checked, and why (not)? Front Psychol. 2012; 3:137. PMID: 22593746.
13. Habibzadeh F, Roozbehi H. No need for a gold-standard test: on the mining of diagnostic test performance indices merely based on the distribution of the test value. BMC Med Res Methodol. 2023; 23(1):30. PMID: 36717791.
14. Sawilowsky SS. Misconceptions leading to choosing the t test over the Wilcoxon Mann-Whitney test for shift in location parameter. J Mod Appl Stat Methods. 2005; 4(2):598–600.
15. Lilliefors HW. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. J Am Stat Assoc. 1967; 62(318):399–402.
16. Mishra P, Pandey CM, Singh U, Gupta A, Sahu C, Keshri A. Descriptive statistics and normality tests for statistical data. Ann Card Anaesth. 2019; 22(1):67–72. PMID: 30648682.
17. Gross J, Ligges U. nortest: Tests for Normality. R package version 1.0-4. The Comprehensive R Archive Network. 2015.
18. Wickham H. ggplot2: Elegant Graphics for Data Analysis. New York, NY, USA: Springer-Verlag;2016.
19. Limpert E, Stahel WA, Abbt M. Log-normal distributions across the sciences: keys and clues. Bioscience. 2001; 51(5):341–352.
20. Habibzadeh F, Habibzadeh P, Yadollahie M, Roozbehi H. On the information hidden in a classifier distribution. Sci Rep. 2021; 11(1):917. PMID: 33441644.
21. Habibzadeh F, Habibzadeh P, Yadollahie M, Sajadi MM. Determining the SARS-CoV-2 serological immunoassay test performance indices based on the test results frequency distribution. Biochem Med (Zagreb). 2022; 32(2):020705. PMID: 35799990.
22. Box GE, Cox DR. An analysis of transformations. J R Stat Soc B. 1964; 26(2):211–252.
23. Millard SP. EnvStats: An R Package for Environmental Statistics. New York, NY, USA: Springer;2013.
24. Lee DK. Data transformation: a focus on the interpretation. Korean J Anesthesiol. 2020; 73(6):503–508. PMID: 33271009.
Full Text Links
  • JKMS
Actions
Cited
CITED
export Copy
Close
Share
  • Twitter
  • Facebook
Similar articles
Copyright © 2024 by Korean Association of Medical Journal Editors. All rights reserved.     E-mail: koreamed@kamje.or.kr