Genomics Inform.  2018 Sep;16(3):75-77. 10.5808/GI.2018.16.3.75.

GNI Corpus Version 1.0: Annotated Full-Text Corpus of Genomics & Informatics to Support Biomedical Information Extraction

Affiliations
  • 1Bioinformatics Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea. neo@ewha.ac.kr
  • 2Center for Convergence Research of Advanced Technologies, Ewha Womans University, Seoul 03760, Korea.

Abstract

Genomics & Informatics (NLM title abbreviation: Genomics Inform) is the official journal of the Korea Genome Organization. Text corpus for this journal annotated with various levels of linguistic information would be a valuable resource as the process of information extraction requires syntactic, semantic, and higher levels of natural language processing. In this study, we publish our new corpus called GNI Corpus version 1.0, extracted and annotated from full texts of Genomics & Informatics, with NLTK (Natural Language ToolKit)-based text mining script. The preliminary version of the corpus could be used as a training and testing set of a system that serves a variety of functions for future biomedical text mining.

Keyword

biomedical text mining; corpus linguistics; text analytics

MeSH Terms

Data Mining
Genome
Genomics*
Informatics*
Information Storage and Retrieval*
Korea
Linguistics
Natural Language Processing
Semantics
Full Text Links
  • GNI
Actions
Cited
CITED
export Copy
Close
Share
  • Twitter
  • Facebook
Similar articles
Copyright © 2024 by Korean Association of Medical Journal Editors. All rights reserved.     E-mail: koreamed@kamje.or.kr