Juan Pavez, Sebastián Rodríguez, and Eduardo N. Fuentes. 2020. Congreso Internacional de Lingüística Computacional y de Corpus.
One of the main challenges for researchers when writing scientific papers is to coherently structure and organize the content, specifically at rhetorical-discursive level. Modeling these types of text is difficult and new computation approaches are necessary. Currently, language model pre-training that learned word representations from a large amount of unannotated text has been shown to be effective for improving many natural language processing (NLP) tasks. Recent models have focused on learning context dependent word representations, such as: 1) Embeddings from Language Models (ELMo) (Peters et al., 2018); 2) Generative Pretrained Transformer (GPT) (Radford et al., 2018); 3) Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019). Specifically, BERT which consists of a transformer architecture (Vaswani et al., 2017) that produces contextualized word representations has shown state-of-the-art performance on several NLP benchmarks. Despite these advantages, BERT has been trained and tested mainly on datasets containing general domain texts (e.g. Wikipedia). Therefore, its performance in other genre types of text, such as biomedical scientific papers, is not optimal. Recently, BioBERT- the first domain-specific BERT based model pretrained on biomedical corpora (PubMed) – has shown to outperform previous models on biomedical NLP tasks (Lee et al., 2019). However, little research has been performed at rhetorical-discursive level using these state-of-the-art language models and applied them to the challenging task of identification of rhetorical-discursive steps (i.e. functional linguistic unit that fulfills a communicative purpose in a sentence). Therefore, the aim of this study was to test the accuracy of BioBERT on rhetorical-discursive steps classification in biomedical scientific papers.