Artificial Intelligence for Academic Writing

  • How can we help writers communicate better? What is the difference between well and poorly written texts?
  • What makes an easy-to-read and logically written text? What are the underlying linguistic patterns of well-written texts?

Our R+D+i in Artificial Intelligence for Writing

To solve our investigative questions as related to Artificial Intelligence for Writing, we are applying state-of-the-art machine learning techniques, applied linguistic research, and expert knowledge on scientific writing to develop new models, functions, and algorithms in the field of artificial intelligence.

We seek to comprehensively aid writers during the entire writing process. This goal will be achieved through our applied research, development, and innovation (R+D+i), merging the latest technological advances in artificial intelligence for writing with established writing guidelines.

Our R+D+i is manifested in WriteWise, a unique software that will modernize writing by reducing the time and effort required by writers to finish their texts.

Research Lines in Artificial Intelligence for Academic Writing


Natural Language Processing Applied to Writing

We combine machine learning and computational linguistics within the framework of natural language processing, as applied to modelling and revising the writing process and scientific texts.

This line of research applies the following methodologies:

1. Novel approaches for representing textual data from scientific articles:

  • Word embeddings combined with deep/machine learning models for natural language processing tasks.
  • Graph-based representations

2. Novel computational approaches for analyzing scientific articles, with specific investigative focus on:

  • Discourse Segmentation
  • Automatic Punctuation Analysis
  • Rule-based Text Mining
  • Topic Modelling
  • Readability/Coherence Classification

Rhetoric-Discourse and Lexical-Grammar in Artificial Intelligence
Applied for Writing

We use functional and applied discursive frameworks, combined with corpus analysis, computational linguistics, and natural language processing approaches, to empirically determine the discursive and linguistics norms and requirements of academic and scientific texts.

This line of research seeks to identify and comprehend the:

1. Communicative purposes and lexical-grammar features that constitute written texts in distinct scientific disciplines.

2. Textual and discursive foundations of academic and scientific texts.

Publications in Artificial Intelligence for Writing

▷ A novel machine learning model that guides graduate students to write more organized and structured texts

Javier Vera, Hector Allende-Cid, René Venegas, Sebastián Rodríguez, Wenceslao Palma, Sofía Zamora, Fernando Lillo, Humberto González, Ashley Van Cott, and Eduardo N. Fuentes. 2018. Molecular Biology of the Cell, 29:26

Academic writing is one of the most valuable skills a scientist can develop. A primary challenge for graduate students is to coherently and concisely organize and present ideas within a manuscript. Writing a quality research manuscript requires transmitting the most relevant information through precise sentences that fulfill diverse communicational roles, ultimately resulting in a coherent, understandable text connected by cohesive mechanisms (e.g. lexical relationships between pairs of terms). Despite technological advances, the execution and teaching of the writing process have not similarly advanced. Therefore, a top priority for graduate programs is to implement new methodologies and technologies that aid students in communicating research advances. Through our investigation, we developed a novel, unsupervised machine-learning model applied to cell biology and biomedical texts that guides students in writing better organized and more structured texts.

▷ Revealing the collaborative dynamics of large-scale arXiv text collection by means of k-shell decomposition

Javier Vera, Wenceslao Palma, Hector Allende, Sebastian Rodriguez, Juan Pavez, and Eduardo Fuentes. 2019. NetSci-X: International Conference on Network Science

In this work was shown how k − shell decomposition helps to understand the dynamics of the formation of the decentralized and collaborative language community defined by the electronic repository arXiv. Our results suggest that there are several global patterns that emerges from the microscopic activity of users sharing content. The growth of the collection of texts (and therefore of the associated networks) was (almost) completely governed by the outmost k −shells, which exponentially increased its size over time. Nevertheless, the size of the most dense set of nodes (Skmax ) tends to linearly increase its size. This points in the direction of the existence of an exponential accumulation of words that forces changes in the main discipline (computer science, in our case), represented by Skmax . These observations were confirmed by the behavior of the (normalized) critical index k∗ = arg maxk |Sk |, since it exponentially shifts to the outmost network layers. Further study should describe the relationship between the index k and the number of connected components of the k − shell Sk . Moreover, it is plausible to propose that the decentralized features of arXiv appear precisely at those external layers..

Brayn Díaz, Juan Pavez, Sebastian Rodríguez, Wenceslao Palma, Hector Allende-Cid, Rene Venegas, and Eduardo N. Fuentes. 2019. 5th Workshop on Automatic Text and Corpus Processing.

We demonstrated the effectiveness of both the USE and BioSentVec as methods for helping users identify and improve semantic similarity between sentences in bio-medical texts. The shared tendencies between the models support sequential similarity as a metric to evaluate a text’s cohesion. With both methods outliers can be easily spotted, and then specific modifications in the sentences can be carried out depending on the type of outlier.

Eduardo N. Fuentes, Hector Allende-Cid, Sebastián Rodríguez, Rene Venegas, Juan Pavez, Wenceslao Palma, Ismael Figueroa, Sofia Zamora, Brayn Diaz, and Ashley VanCott.  2020. Congreso Internacional de Lingüística Computacional y de Corpus.

WriteWise represents the first commercially available advanced platform that provides user’s help and feedback to improve scientific papers writing. This is thanks to the development of and advance textual data representation at different linguistic levels (e.g. words, sentences) through using cutting-edge machine-learning models and applied linguistics research.

Juan Pavez, Sebastián Rodríguez, and Eduardo N. Fuentes. 2020. Congreso Internacional de Lingüística Computacional y de Corpus.

One of the main challenges for researchers when writing scientific papers is to coherently structure and organize the content, specifically at rhetorical-discursive level. Modeling these types of text is difficult and new computation approaches are necessary. Currently, language model pre-training that learned word representations from a large amount of unannotated text has been shown to be effective for improving many natural language processing (NLP) tasks. Recent models have focused on learning context dependent word representations, such as: 1) Embeddings from Language Models (ELMo) (Peters et al., 2018); 2) Generative Pretrained Transformer (GPT) (Radford et al., 2018); 3) Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019). Specifically, BERT which consists of a transformer architecture (Vaswani et al., 2017) that produces contextualized word representations has shown state-of-the-art performance on several NLP benchmarks. Despite these advantages, BERT has been trained and tested mainly on datasets containing general domain texts (e.g. Wikipedia). Therefore, its performance in other genre types of text, such as biomedical scientific papers, is not optimal. Recently, BioBERT- the first domain-specific BERT based model pretrained on biomedical corpora (PubMed) – has shown to outperform previous models on biomedical NLP tasks (Lee et al., 2019). However, little research has been performed at rhetorical-discursive level using these state-of-the-art language models and applied them to the challenging task of identification of rhetorical-discursive steps (i.e. functional linguistic unit that fulfills a communicative purpose in a sentence). Therefore, the aim of this study was to test the accuracy of BioBERT on rhetorical-discursive steps classification in biomedical scientific papers.