Organizing an in-class hackathon to correct PDF-to-text conversion errors of Genomics & Informatics 1.0

Kim, Sunho; Kim, Royoung; Kim, Ryeo-Gyeong; Ko, Enjin; Kim, Han-Su; Shin, Jihye; Cho, Daeun; Jin, Yurhee; Bae, Soyeon; Jo, Ye Won; Jeong, San Ah; Kim, Yena; Ahn, Seoyeon; Jang, Bomi; Seong, Jiheyon; Lee, Yujin; Seo, Si Eun; Kim, Yujin; Kim, Ha-Jeong; Kim, Hyeji; Sung, Hye-Lynn; Lho, Hyoyoung; Koo, Jaywon; Chu, Jion; Lim, Juwon; Kim, Youngju; Lee, Kyungyeon; Lim, Yuri; Kim, Meongeun; Hwang, Seonjeong; Han, Shinhye; Bae, Sohyeun; Kim, Sua; Yoo, Suhyeon; Seo, Yeonjeong; Shin, Yerim; Kim, Yonsoo; Ko, You-Jung; Baek, Jihee; Hyun, Hyejin; Choi, Hyemin; Oh, Ji-Hye; Kim, Da-Young; Nam, Hee-Jo; Park, Hyun-Seok

Genomics Inform. 2020 Sep;18(3):e33. 10.5808/GI.2020.18.3.e33.

Organizing an in-class hackathon to correct PDF-to-text conversion errors of Genomics & Informatics 1.0

Kim S ¹
Kim R ¹
Kim RG ¹
Ko E ¹
Kim HS ¹
Shin J ¹
Cho D ¹
Jin Y ¹
Bae S ¹
Jo YW ¹
Jeong SA ¹
Kim Y ¹
Ahn S ¹
Jang B ¹
Seong J ¹
Lee Y ¹
Seo SE ¹
Kim Y ¹
Kim HJ ¹
Kim H ¹
Sung HL ¹
Lho H ¹
Koo J ¹
Chu J ¹
Lim J ¹
Kim Y ¹
Lee K ¹
Lim Y ¹
Kim M ¹
Hwang S ¹
Han S ¹
Bae S ¹
Kim S ¹
Yoo S ¹
Seo Y ¹
Shin Y ¹
Kim Y ¹
Ko YJ ¹
Baek J ¹
Hyun H ¹
Choi H ¹
Oh JH ¹
Kim DY ¹
Nam HJ ¹
Park HS ^1,2

Affiliations

¹Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
²Center for Convergence Research of Advanced Technologies, Ewha Womans University, Seoul 03760, Korea

KMID: 2506909
DOI: http://doi.org/10.5808/GI.2020.18.3.e33

Abstract

This paper describes a community effort to improve earlier versions of the full-text corpus of Genomics & Informatics by semi-automatically detecting and correcting PDF-to-text conversion errors and optical character recognition errors during the first hackathon of Genomics & Informatics Annotation Hackathon (GIAH) event. Extracting text from multi-column biomedical documents such as Genomics & Informatics is known to be notoriously difficult. The hackathon was piloted as part of a coding competition of the ELTEC College of Engineering at Ewha Womans University in order to enable researchers and students to create or annotate their own versions of the Genomics & Informatics corpus, to gain and create knowledge about corpus linguistics, and simultaneously to acquire tangible and transferable skills. The proposed projects during the hackathon harness an internal database containing different versions of the corpus and annotations.

Keyword

biomedical text mining; corpus; text analytics

Organizing an in-class hackathon to correct PDF-to-text conversion errors of Genomics & Informatics 1.0

Abstract

Keyword

Cited

Save citations to file

Email citations