Korean J Leg Med.  2019 Aug;43(3):97-105. 10.7580/kjlm.2019.43.3.97.

Classification of Common Relationships Based on Short Tandem Repeat Profiles Using Data Mining

Affiliations
  • 1Department of Statistics, Korea University, Seoul, Korea. jael@korea.ac.kr
  • 2Product Development HQ, Dong-A ST, Seoul, Korea.
  • 3Department of Forensic Medicine, Seoul National University College of Medicine, Seoul, Korea.
  • 4Forensic Science Division 2, Supreme Prosecutor's Office, Seoul, Korea.

Abstract

We reviewed past studies on the identification of familial relationships using 22 short tandem repeat markers. As a result, we can obtain a high discrimination power and a relatively accurate cut-off value in parent-child and full sibling relationships. However, in the case of pairs of uncle-nephew or cousin, we found a limit of low discrimination power of the likelihood ratio (LR) method. Therefore, we compare the LR ranking method and data mining techniques (e.g., logistic regression, linear discriminant analysis, diagonal linear discriminant analysis, diagonal quadratic discriminant analysis, K-nearest neighbor, classification and regression trees, support vector machines, random forest [RF], and penalized multivariate analysis) that can be applied to identify familial relationships, and provide a guideline for choosing the most appropriate model under a given situation. RF, one of the data mining techniques, was found to be more accurate than other methods. The accuracy of RF is 99.99% for parent-child, 99.44% for full siblings, 90.34% for uncle-nephew, and 79.69% for first cousins.

Keyword

Short tandem repeats; Kinship testing; Relationships; Likelihood ratio; Data mining

MeSH Terms

Classification*
Data Mining*
Discrimination (Psychology)
Forests
Humans
Logistic Models
Methods
Microsatellite Repeats*
Siblings
Support Vector Machine
Trees

Reference

1. Butler JM, Hill CR. Biology and genetics of new autosomal STR loci useful for forensic DNA analysis. Forensic Sci Rev. 2012; 24:15–26.
2. Bieber FR, Brenner CH, Lazer D. Human genetics: finding criminals through DNA of their relatives. Science. 2006; 312:1315–1316.
3. Myers SP, Timken MD, Piucci ML, et al. Searching for first-degree familial relationships in California's offender DNA database: validation of a likelihood ratio-based approach. Forensic Sci Int Genet. 2011; 5:493–500.
Article
4. Schneider PM. Scientific standards for studies in forensic genetics. Forensic Sci Int. 2007; 165:238–243.
Article
5. Lee JW, Lee HS, Lee HJ, et al. Statistical evaluation of sibling relationship. Commun Stat Appl Methods. 2007; 14:541–549.
Article
6. Jeong SJ, Lee JW, Lee SD, et al. Statistical evaluation of common relationships using STR markers in Korean population. Korean Acad Sci Crim Invest. 2016; 10:107–115.
Article
7. Evett IW, Weir BS. Interpreting DNA evidence: statistical genetics for forensic scientists. Sunderland: Sinauer Associates;1998.
8. Yang IS, Lee HY, Park SJ, et al. Analysis of Kinship Index distributions in Koreans using simulated autosomal STR profiles. Korean J Leg Med. 2013; 37:57–65.
Article
9. Gaytmenn R, Hildebrand DP, Sweet D, et al. Determination of the sensitivity and specificity of sibship calculations using AmpF lSTR Profiler Plus. Int J Legal Med. 2002; 116:161–164.
10. Budowle B, Shea B, Niezgoda S, et al. CODIS STR loci data from 41 sample populations. J Forensic Sci. 2001; 46:453–489.
Article
11. Cowen S, Thomson J. A likelihood ratio approach to familial searching of large DNA databases. Forensic Sci Int Genet Suppl Ser. 2008; 1:643–645.
Article
12. Curran JM, Buckleton JS. Effectiveness of familial searches. Sci Justice. 2008; 48:164–167.
Article
13. Cox DR. The regression analysis of binary sequences. J R Stat Soc Ser B Methodol. 1958; 20:215–242.
Article
14. Fisher RA. The use of multiple measurements in taxonomic problems. Ann Eugen. 1936; 7:179–188.
Article
15. Bickel PJ, Levina E. Some theory for Fisher's linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli. 2004; 10:989–1010.
Article
16. Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc. 2002; 97:77–87.
Article
17. Vapnik VN. The nature of statistical learning theory. Berlin: Springer;2000.
18. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and regression trees. Monterey: Wadsworth & Brooks/Cole Advanced Books & Software;1984.
19. Altman NS. An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat. 1992; 46:175–185.
Article
20. Breiman L. Random forests. Mach Learn. 2001; 45:5–32.
21. Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009; 10:515–534.
Article
22. Buckleton JS, Triggs CM, Walsh SJ. DNA evidence. Boca Raton: CRC Press;2004.
Full Text Links
  • KJLM
Actions
Cited
CITED
export Copy
Close
Share
  • Twitter
  • Facebook
Similar articles
Copyright © 2024 by Korean Association of Medical Journal Editors. All rights reserved.     E-mail: koreamed@kamje.or.kr