Artificial Intelligence for Science
🔹How can we help scientists write and communicate their research better? What is the difference between well and poorly written papers?
🔹 What makes an easy-to-read and logically written paper? What are the underlying linguistic patterns of well-written papers?
🔹 How can we apply automatization to aid academic publishers in making the review process more efficient and quicker?
🔹 How can we make the review process of scientific texts more objective? Can papers be evaluated based on quantified factors of writing quality?
Our R+D+i in AI for Science
To solve these investigative questions, we are applying state-of-the-art machine learning techniques, applied linguistic research, and expert knowledge on scientific writing to develop new models, functions, and algorithms. In short, we are applying AI for science.
We seek to comprehensively aid researchers during the entire writing process. This goal will be achieved through our applied research, development, and innovation (R+D+i), merging the latest technological advances with established writing guidelines. Our R+D+i is manifested in WriteWise, a unique software that will modernize scientific writing by reducing the time and effort required by researchers when writing and by journals/academic publishers when reviewing manuscript submissions.
To learn more about the team behind our research into AI for Science, visit Our Team.
for Natural Language Processing
Applied to Scientific Writing
We combine machine learning and computational linguistics within the framework of natural language processing, as applied to modelling and revising the writing process and scientific texts. This line of research applies the following methodologies:
1. Novel approaches for representing textual data from scientific articles:
- Word embeddings combined with deep/machine learning models for natural language processing tasks.
- Graph-based representations
2. Novel computational approaches for analyzing scientific articles, with specific investigative focus on:
- Discourse Segmentation
- Automatic Punctuation Analysis
- Rule-based Text Mining
- Topic Modelling
- Readability/Coherence Classification
Applied to Scientific Articles
We use functional and applied discursive frameworks, combined with corpus analysis, computational linguistics, and natural language processing approaches, to empirically determine the discursive and linguistics norms and requirements of academic and scientific texts. This line of research seeks to identify and comprehend the:
1. Communicative purposes and lexical-grammar features that constitute written texts in distinct scientific disciplines.
2. Textual and discursive foundations of academic and scientific texts.
Javier Vera, Hector Allende-Cid, René Venegas, Sebastián Rodríguez, Wenceslao Palma, Sofía Zamora, Fernando Lillo, Humberto González, Ashley Van Cott, and Eduardo N. Fuentes. 2018. Molecular Biology of the Cell, 29:26.
Academic writing is one of the most valuable skills a scientist can develop. A primary challenge for graduate students is to coherently and concisely organize and present ideas within a manuscript. Writing a quality research manuscript requires transmitting the most relevant information through precise sentences that fulfill diverse communicational roles, ultimately resulting in a coherent, understandable text connected by cohesive mechanisms (e.g. lexical relationships between pairs of terms). Despite technological advances, the execution and teaching of the writing process have not similarly advanced. Therefore, a top priority for graduate programs is to implement new methodologies and technologies that aid students in communicating research advances. Through our investigation, we developed a novel, unsupervised machine-learning model applied to cell biology and biomedical texts that guides students in writing better organized and more structured texts.
Javier Vera, Wenceslao Palma, Hector Allende, Sebastian Rodriguez, Juan Pavez, and Eduardo Fuentes. 2019. NetSci-X: International Conference on Network Science.
In this work was shown how k − shell decomposition helps to understand the dynamics of the formation of the decentralized and collaborative language community defined by the electronic repository arXiv. Our results suggest that there are several global patterns that emerges from the microscopic activity of users sharing content. The growth of the collection of texts (and therefore of the associated networks) was (almost) completely governed by the outmost k −shells, which exponentially increased its size over time. Nevertheless, the size of the most dense set of nodes (Skmax ) tends to linearly increase its size. This points in the direction of the existence of an exponential accumulation of words that forces changes in the main discipline (computer science, in our case), represented by Skmax . These observations were confirmed by the behavior of the (normalized) critical index k∗ = arg maxk |Sk |, since it exponentially shifts to the outmost network layers. Further study should describe the relationship between the index k and the number of connected components of the k − shell Sk . Moreover, it is plausible to propose that the decentralized features of arXiv appear precisely at those external layers.
Brayn Díaz, Juan Pavez, Sebastian Rodríguez, Wenceslao Palma, Hector Allende-Cid, Rene Venegas, and Eduardo N. Fuentes. 2019. 5th Workshop on Automatic Text and Corpus Processing.
We demonstrated the effectiveness of both the USE and BioSentVec as methods for helping users identify and improve semantic similarity between sentences in bio-medical texts. The shared tendencies between the models support sequential similarity as a metric to evaluate a text’s cohesion. With both methods outliers can be easily spotted, and then specific modifications in the sentences can be carried out depending on the type of outlier.
Eduardo N. Fuentes, Hector Allende-Cid, Sebastián Rodríguez, Rene Venegas, Juan Pavez, Wenceslao Palma, Ismael Figueroa, Sofia Zamora, Brayn Diaz, and Ashley VanCott. 2020. Congreso Internacional de Lingüística Computacional y de Corpus.
WriteWise represents the first commercially available advanced platform that provides user's help and feedback to improve scientific papers writing. This is thanks to the development of and advance textual data representation at different linguistic levels (e.g. words, sentences) through using cutting-edge machine-learning models and applied linguistics research.
Juan Pavez, Sebastián Rodríguez, and Eduardo N. Fuentes. 2020. Congreso Internacional de Lingüística Computacional y de Corpus.
One of the main challenges for researchers when writing scientific papers is to coherently structure and organize the content, specifically at rhetorical-discursive level. Modeling these types of text is difficult and new computation approaches are necessary. Currently, language model pre-training that learned word representations from a large amount of unannotated text has been shown to be effective for improving many natural language processing (NLP) tasks. Recent models have focused on learning context dependent word representations, such as: 1) Embeddings from Language Models (ELMo) (Peters et al., 2018); 2) Generative Pretrained Transformer (GPT) (Radford et al., 2018); 3) Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019). Specifically, BERT which consists of a transformer architecture (Vaswani et al., 2017) that produces contextualized word representations has shown state-of-the-art performance on several NLP benchmarks. Despite these advantages, BERT has been trained and tested mainly on datasets containing general domain texts (e.g. Wikipedia). Therefore, its performance in other genre types of text, such as biomedical scientific papers, is not optimal. Recently, BioBERT- the first domain-specific BERT based model pretrained on biomedical corpora (PubMed) – has shown to outperform previous models on biomedical NLP tasks (Lee et al., 2019). However, little research has been performed at rhetorical-discursive level using these state-of-the-art language models and applied them to the challenging task of identification of rhetorical-discursive steps (i.e. functional linguistic unit that fulfills a communicative purpose in a sentence). Therefore, the aim of this study was to test the accuracy of BioBERT on rhetorical-discursive steps classification in biomedical scientific papers.