Genomics Inform.  2019 Jun;17(2):e17. 10.5808/GI.2019.17.2.e17.

OryzaGP: rice gene and protein dataset for named-entity recognition

Affiliations
  • 1UMR DIADE, Institute of Research for Sustainable Development (IRD), F-34394 Montpellier, France. pierre.larmande@ird.fr
  • 2ICT Lab, University of Science and Technology of Hanoi (USTH), 100000 Hanoi, Vietnam.
  • 3Database Center for Life Science (DBCLS), Chiba 277-0871, Japan.

Abstract

Text mining has become an important research method in biology, with its original purpose to extract biological entities, such as genes, proteins and phenotypic traits, to extend knowledge from scientific papers. However, few thorough studies on text mining and application development, for plant molecular biology data, have been performed, especially for rice, resulting in a lack of datasets available to solve named-entity recognition tasks for this species. Since there are rare benchmarks available for rice, we faced various difficulties in exploiting advanced machine learning methods for accurate analysis of the rice literature. To evaluate several approaches to automatically extract information from gene/protein entities, we built a new dataset for rice as a benchmark. This dataset is composed of a set of titles and abstracts, extracted from scientific papers focusing on the rice species, and is downloaded from PubMed. During the 5th Biomedical Linked Annotation Hackathon, a portion of the dataset was uploaded to PubAnnotation for sharing. Our ultimate goal is to offer a shared task of rice gene/protein name recognition through the BioNLP Open Shared Tasks framework using the dataset, to facilitate an open comparison and evaluation of different approaches to the task.

Keyword

named-entity recognition; natural language processing; Oryza sativa; plant molecular biology; rice; text mining

MeSH Terms

Benchmarking
Biology
Data Mining
Dataset*
Machine Learning
Methods
Molecular Biology
Natural Language Processing
Oryza
Plants
Full Text Links
  • GNI
Actions
Cited
CITED
export Copy
Close
Share
  • Twitter
  • Facebook
Similar articles
Copyright © 2024 by Korean Association of Medical Journal Editors. All rights reserved.     E-mail: koreamed@kamje.or.kr