Healthc Inform Res.  2025 Jan;31(1):16-22. 10.4258/hir.2025.31.1.16.

Feature Selection for Hypertension Risk Prediction Using XGBoost on Single Nucleotide Polymorphism Data

Affiliations
  • 1Department of Informatics Engineering, Faculty of Computer Science, Brawijaya University, Malang, Indonesia
  • 2Department of Biology, Faculty of Mathematics and Natural Sciences, Brawijaya University, Malang, Indonesia
  • 3Department of Statistics, Faculty of Mathematics and Natural Sciences, Brawijaya University, Malang, Indonesia
  • 4Department of Histology, Faculty of Medicine, Maranatha Christian University, Bandung, Indonesia

Abstract


Objectives
Hypertension, commonly known as high blood pressure, is a prevalent and serious condition affecting a significant portion of the adult population globally. It is a chronic medical issue that, if left unaddressed, can lead to severe health complications, including kidney problems, heart disease, and stroke. This study aims to develop a feature selection model using the XGBoost algorithm to identify specific single nucleotide polymorphisms (SNPs) as biomarkers for detecting hypertension risk.
Methods
We propose using the high dimensionality of genetic variations (i.e., SNPs) to build a classifier model for prediction. In this study, SNPs were used as markers for hypertension in patients. We utilized the OpenSNP dataset, which includes 19,697 SNPs from 2,052 samples. Extreme gradient boosting (XGBoost) is an ensemble machine learning method employed here for feature selection, which incrementally adjusts weights in a series of steps.
Results
The experimental results identified 292 SNPs that exhibited high performance, with an F1-score of 98.55%, precision of 98.73%, recall of 98.38%, and overall accuracy of 98%. This study provides compelling evidence that the XGBoost feature selection method outperforms other representative feature selection methods, such as genetic algorithms, analysis of variance, chi-square, and principal component analysis, in predicting hypertension risk, demonstrating its effectiveness.
Conclusions
We developed a model for predicting hypertension using the SNPs dataset. The high dimensionality of SNP data was effectively managed to identify significant features as biomarkers using the XGBoost feature selection method. The results indicate high performance in predicting the risk of hypertension.

Keyword

Single Nucleotide Polymorphism, Hypertension, Prediction Methods, Machine, Genetics, Machine Learning

Figure

  • Figure 1 Block diagram of the general proposed method. XGBoost: extreme gradient boosting, ANOVA: analysis of variance, PCA: principal component analysis, GA: genetic algorithm, SNP: single nucleotide polymorphism, GWAS: genome-wide association studies.

  • Figure 2 Area under the curve (AUC) of the XGBoost feature selection and classifier.

  • Figure 3 Loss function of the XGBoost feature selection method and classifier.


Reference

References

1. NCD Risk Factor Collaboration (NCD-RisC). Worldwide trends in hypertension prevalence and progress in treatment and control from 1990 to 2019: a pooled analysis of 1201 population-representative studies with 104 million participants. Lancet. 2021; 398(10304):957–80. https://doi.org/10.1016/S0140-6736(21)01330-1.
2. Adlung L, Cohen Y, Mor U, Elinav E. Machine learning in clinical decision making. Med. 2021; 2(6):642–65. https://doi.org/10.1016/j.medj.2021.04.006.
Article
3. Silva GF, Fagundes TP, Teixeira BC, Chiavegatto Filho AD. Machine learning for hypertension prediction: a systematic review. Curr Hypertens Rep. 2022; 24(11):523–33. https://doi.org/10.1007/s11906-022-01212-6.
Article
4. AlKaabi LA, Ahmed LS, Al Attiyah MF, Abdel-Rahman ME. Predicting hypertension using machine learning: findings from Qatar Biobank Study. PLoS One. 2020; 15(10):e0240370. https://doi.org/10.1371/journal.pone.0240370.
Article
5. Martinez-Rios E, Montesinos L, Alfaro-Ponce M, Pecchia L. A review of machine learning in hypertension detection and blood pressure estimation based on clinical and physiological data. Biomed Signal Process Control. 2021; 68:102813. https://doi.org/10.1016/j.bspc.2021.102813.
Article
6. Alzubi R, Ramzan N, Alzoubi H, Katsigiannis S. SNPs-based hypertension disease detection via machine learning techniques. In : Proceedings of 2018, 24th International Conference on Automation and Computing (ICAC); 2018 Sep 6–8; Newcastle Upon Tyne, UK. p. 1–6. https://doi.org/10.23919/IConAC.2018.8748972.
Article
7. Antony Raj CB, Nagarajan H, Aslam MH, Panchalingam S. SNP identification and discovery. Gupta MK, Behera L, editors. Bioinformatics in rice research: theories and techniques. Singapore: Springer;2021. p. 361–86. https://doi.org/10.1007/978-981-16-3993-7_17.
Article
8. Kurland L, Liljedahl U, Lind L. Hypertension and SNP genotyping in antihypertensive treatment. Cardiovasc Toxicol. 2005; 5(2):133–42. https://doi.org/10.1385/ct:5:2:133.
Article
9. Park HW, Li D, Piao Y, Ryu KH. A hybrid feature selection method to classification and its application in hypertension diagnosis. Bursa M, Holzinger A, Renda M, Khuri S, editors. Information technology in bio-and medical informatics. Cham, Switzerland: Springer International Publishing;2017. p. 11–9. https://doi.org/10.1007/978-3-319-64265-9_2.
Article
10. Peng Y, Xu J, Ma L, Wang J. Prediction of hypertension risks with feature selection and XGBoost. J Mech Med Biol. 2021; 21(05):2140028. https://doi.org/10.1142/S0219519421400285.
Article
11. Asmare Z, Erkihun M. Recent application of DNA microarray techniques to diagnose infectious disease. Pathol Lab Med Int. 2023; 15:77–82. https://doi.org/10.2147/PLMI.S424275.
Article
12. Liu L, So AY, Fan JB. Analysis of cancer genomes through microarrays and next-generation sequencing. Transl Cancer Res. 2015; 4(3):212–8. https://doi.org/10.3978/j.issn.2218-676X.2015.05.04.
Article
13. Beck DB, Petracovici A, He C, Moore HW, Louie RJ, Ansar M, et al. Delineation of a human Mendelian disorder of the DNA demethylation machinery: TET3 deficiency. Am J Hum Genet. 2020; 106(2):234–45. https://doi.org/10.1016/j.ajhg.2019.12.007.
Article
14. Gupta S, Gupta MK, Shabaz M, Sharma A. Deep learning techniques for cancer classification using microarray gene expression data. Front Physiol. 2022; 13:952709. https://doi.org/10.3389/fphys.2022.952709.
Article
15. Hassan F. Beautiful soup: a python library for web scraping [Internet]. San Francisco (CA): Medium;2023. [cited at 2024 Mar 6]. Available from: https://blog.devgenius.io/introduction-to-beautiful-soup-a-python-library-for-web-scraping-21cacb9cf088.
16. Requests. Requests: HTTP for Humans (Release v2.31.0) [Internet]. [place unknown]: Requests;2023. [cited at 2023 Nov 21]. Available from: https://requests.kenneth-reitz.org/en/latest/.
17. Li C, Sun D, Liu J, Li M, Zhang B, Liu Y, et al. A prediction model of essential hypertension based on genetic and environmental risk factors in Northern Han Chinese. Int J Med Sci. 2019; 16(6):793–9. https://doi.org/10.7150/ijms.33967.
Article
18. Lim NK, Lee JY, Lee JY, Park HY, Cho MC. The role of genetic risk score in predicting the risk of hypertension in the Korean population: Korean genome and epidemiology study. PLoS One. 2015; 10(6):e0131603. https://doi.org/10.1371/journal.pone.0131603.
Article
19. Hasan N, Bao Y. Comparing different feature selection algorithms for cardiovascular disease prediction. Health Technol. 2021; 11(1):49–62. https://doi.org/10.1007/s12553-020-00499-2.
Article
20. Kumar M, Rath NK, Swain A, Rath SK. “Feature Selection and Classification of Microarray Data using MapReduce based ANOVA and K-Nearest Neighbor,”. Procedia Computer Science. 54:301–310. Jan. 2015; DOI: 10.1016/j.procs.2015.06.035.
Article
21. Cai L, Lv S, Shi K. “Application of an Improved CHI Feature Selection Algorithm,”. Discrete Dynamics in Nature and Society. 2021(1):9963382. 2021; DOI: 10.1155/2021/9963382.
Article
22. Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci. 2016; 374(2065):20150202. https://doi.org/10.1098/rsta.2015.0202.
Article
23. Holland JH. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. Cambridge (MA): MIT Press;1992.
24. Mandal M, Singh PK, Ijaz MF, Shafi J, Sarkar R. A Tristage wrapper-filter feature selection framework for disease classification. Sensors (Basel). 2021; 21(16):5571. https://doi.org/10.3390/s21165571.
Article
25. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In : Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016 Aug 13–17; San Francisco, CA, USA. p. 785–94. https://doi.org/10.1145/2939672.2939785.
Article
26. Lajevardi SA, Kargari M, Daneshpour MS, Akbarzadeh M. Hypertension risk prediction based on SNPS by machine learning models. Curr Bioinform. 2023; 18(1):55–62. https://doi.org/10.2174/157489361766622101109332.
Article
Full Text Links
  • HIR
Actions
Cited
CITED
export Copy
Close
Share
  • Twitter
  • Facebook
Similar articles
Copyright © 2025 by Korean Association of Medical Journal Editors. All rights reserved.     E-mail: koreamed@kamje.or.kr