Feature Selection for Hypertension Risk Prediction Using XGBoost on Single Nucleotide Polymorphism Data

Muflikhah, Lailil; Fatyanosa, Tirana Noor; Widodo, Nashi; Perdana, Rizal Setya; , Solimun; Ratnawati, Hana

Healthc Inform Res. 2025 Jan;31(1):16-22. 10.4258/hir.2025.31.1.16.

Feature Selection for Hypertension Risk Prediction Using XGBoost on Single Nucleotide Polymorphism Data

Affiliations

¹Department of Informatics Engineering, Faculty of Computer Science, Brawijaya University, Malang, Indonesia
²Department of Biology, Faculty of Mathematics and Natural Sciences, Brawijaya University, Malang, Indonesia
³Department of Statistics, Faculty of Mathematics and Natural Sciences, Brawijaya University, Malang, Indonesia
⁴Department of Histology, Faculty of Medicine, Maranatha Christian University, Bandung, Indonesia

KMID: 2564348
DOI: http://doi.org/10.4258/hir.2025.31.1.16

Abstract

Objectives
Hypertension, commonly known as high blood pressure, is a prevalent and serious condition affecting a significant portion of the adult population globally. It is a chronic medical issue that, if left unaddressed, can lead to severe health complications, including kidney problems, heart disease, and stroke. This study aims to develop a feature selection model using the XGBoost algorithm to identify specific single nucleotide polymorphisms (SNPs) as biomarkers for detecting hypertension risk.
Methods
We propose using the high dimensionality of genetic variations (i.e., SNPs) to build a classifier model for prediction. In this study, SNPs were used as markers for hypertension in patients. We utilized the OpenSNP dataset, which includes 19,697 SNPs from 2,052 samples. Extreme gradient boosting (XGBoost) is an ensemble machine learning method employed here for feature selection, which incrementally adjusts weights in a series of steps.
Results
The experimental results identified 292 SNPs that exhibited high performance, with an F1-score of 98.55%, precision of 98.73%, recall of 98.38%, and overall accuracy of 98%. This study provides compelling evidence that the XGBoost feature selection method outperforms other representative feature selection methods, such as genetic algorithms, analysis of variance, chi-square, and principal component analysis, in predicting hypertension risk, demonstrating its effectiveness.
Conclusions
We developed a model for predicting hypertension using the SNPs dataset. The high dimensionality of SNP data was effectively managed to identify significant features as biomarkers using the XGBoost feature selection method. The results indicate high performance in predicting the risk of hypertension.

Keyword

Single Nucleotide Polymorphism, Hypertension, Prediction Methods, Machine, Genetics, Machine Learning

Figure

Figure 1 Block diagram of the general proposed method. XGBoost: extreme gradient boosting, ANOVA: analysis of variance, PCA: principal component analysis, GA: genetic algorithm, SNP: single nucleotide polymorphism, GWAS: genome-wide association studies.
Figure 2 Area under the curve (AUC) of the XGBoost feature selection and classifier.
Figure 3 Loss function of the XGBoost feature selection method and classifier.

Reference

References

1. NCD Risk Factor Collaboration (NCD-RisC). Worldwide trends in hypertension prevalence and progress in treatment and control from 1990 to 2019: a pooled analysis of 1201 population-representative studies with 104 million participants. Lancet. 2021; 398(10304):957–80. https://doi.org/10.1016/S0140-6736(21)01330-1.

2. Adlung L, Cohen Y, Mor U, Elinav E. Machine learning in clinical decision making. Med. 2021; 2(6):642–65. https://doi.org/10.1016/j.medj.2021.04.006.
Article

3. Silva GF, Fagundes TP, Teixeira BC, Chiavegatto Filho AD. Machine learning for hypertension prediction: a systematic review. Curr Hypertens Rep. 2022; 24(11):523–33. https://doi.org/10.1007/s11906-022-01212-6.
Article

4. AlKaabi LA, Ahmed LS, Al Attiyah MF, Abdel-Rahman ME. Predicting hypertension using machine learning: findings from Qatar Biobank Study. PLoS One. 2020; 15(10):e0240370. https://doi.org/10.1371/journal.pone.0240370.
Article

5. Martinez-Rios E, Montesinos L, Alfaro-Ponce M, Pecchia L. A review of machine learning in hypertension detection and blood pressure estimation based on clinical and physiological data. Biomed Signal Process Control. 2021; 68:102813. https://doi.org/10.1016/j.bspc.2021.102813.
Article

6. Alzubi R, Ramzan N, Alzoubi H, Katsigiannis S. SNPs-based hypertension disease detection via machine learning techniques. In : Proceedings of 2018, 24th International Conference on Automation and Computing (ICAC); 2018 Sep 6–8; Newcastle Upon Tyne, UK. p. 1–6. https://doi.org/10.23919/IConAC.2018.8748972.
Article

7. Antony Raj CB, Nagarajan H, Aslam MH, Panchalingam S. SNP identification and discovery. Gupta MK, Behera L, editors. Bioinformatics in rice research: theories and techniques. Singapore: Springer;2021. p. 361–86. https://doi.org/10.1007/978-981-16-3993-7_17.
Article

8. Kurland L, Liljedahl U, Lind L. Hypertension and SNP genotyping in antihypertensive treatment. Cardiovasc Toxicol. 2005; 5(2):133–42. https://doi.org/10.1385/ct:5:2:133.
Article

9. Park HW, Li D, Piao Y, Ryu KH. A hybrid feature selection method to classification and its application in hypertension diagnosis. Bursa M, Holzinger A, Renda M, Khuri S, editors. Information technology in bio-and medical informatics. Cham, Switzerland: Springer International Publishing;2017. p. 11–9. https://doi.org/10.1007/978-3-319-64265-9_2.
Article

10. Peng Y, Xu J, Ma L, Wang J. Prediction of hypertension risks with feature selection and XGBoost. J Mech Med Biol. 2021; 21(05):2140028. https://doi.org/10.1142/S0219519421400285.
Article

11. Asmare Z, Erkihun M. Recent application of DNA microarray techniques to diagnose infectious disease. Pathol Lab Med Int. 2023; 15:77–82. https://doi.org/10.2147/PLMI.S424275.
Article

12. Liu L, So AY, Fan JB. Analysis of cancer genomes through microarrays and next-generation sequencing. Transl Cancer Res. 2015; 4(3):212–8. https://doi.org/10.3978/j.issn.2218-676X.2015.05.04.
Article

13. Beck DB, Petracovici A, He C, Moore HW, Louie RJ, Ansar M, et al. Delineation of a human Mendelian disorder of the DNA demethylation machinery: TET3 deficiency. Am J Hum Genet. 2020; 106(2):234–45. https://doi.org/10.1016/j.ajhg.2019.12.007.
Article

14. Gupta S, Gupta MK, Shabaz M, Sharma A. Deep learning techniques for cancer classification using microarray gene expression data. Front Physiol. 2022; 13:952709. https://doi.org/10.3389/fphys.2022.952709.
Article

15. Hassan F. Beautiful soup: a python library for web scraping [Internet]. San Francisco (CA): Medium;2023. [cited at 2024 Mar 6]. Available from: https://blog.devgenius.io/introduction-to-beautiful-soup-a-python-library-for-web-scraping-21cacb9cf088.

16. Requests. Requests: HTTP for Humans (Release v2.31.0) [Internet]. [place unknown]: Requests;2023. [cited at 2023 Nov 21]. Available from: https://requests.kenneth-reitz.org/en/latest/.

17. Li C, Sun D, Liu J, Li M, Zhang B, Liu Y, et al. A prediction model of essential hypertension based on genetic and environmental risk factors in Northern Han Chinese. Int J Med Sci. 2019; 16(6):793–9. https://doi.org/10.7150/ijms.33967.
Article

18. Lim NK, Lee JY, Lee JY, Park HY, Cho MC. The role of genetic risk score in predicting the risk of hypertension in the Korean population: Korean genome and epidemiology study. PLoS One. 2015; 10(6):e0131603. https://doi.org/10.1371/journal.pone.0131603.
Article

19. Hasan N, Bao Y. Comparing different feature selection algorithms for cardiovascular disease prediction. Health Technol. 2021; 11(1):49–62. https://doi.org/10.1007/s12553-020-00499-2.
Article

20. Kumar M, Rath NK, Swain A, Rath SK. “Feature Selection and Classification of Microarray Data using MapReduce based ANOVA and K-Nearest Neighbor,”. Procedia Computer Science. 54:301–310. Jan. 2015; DOI: 10.1016/j.procs.2015.06.035.
Article

21. Cai L, Lv S, Shi K. “Application of an Improved CHI Feature Selection Algorithm,”. Discrete Dynamics in Nature and Society. 2021(1):9963382. 2021; DOI: 10.1155/2021/9963382.
Article

22. Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci. 2016; 374(2065):20150202. https://doi.org/10.1098/rsta.2015.0202.
Article

23. Holland JH. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. Cambridge (MA): MIT Press;1992.

24. Mandal M, Singh PK, Ijaz MF, Shafi J, Sarkar R. A Tristage wrapper-filter feature selection framework for disease classification. Sensors (Basel). 2021; 21(16):5571. https://doi.org/10.3390/s21165571.
Article

25. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In : Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016 Aug 13–17; San Francisco, CA, USA. p. 785–94. https://doi.org/10.1145/2939672.2939785.
Article

26. Lajevardi SA, Kargari M, Daneshpour MS, Akbarzadeh M. Hypertension risk prediction based on SNPS by machine learning models. Curr Bioinform. 2023; 18(1):55–62. https://doi.org/10.2174/157489361766622101109332.
Article

Feature Selection for Hypertension Risk Prediction Using XGBoost on Single Nucleotide Polymorphism Data

Abstract

Keyword

Figure

Reference

References

Cited

Save citations to file

Email citations