pretrAined BERT language model for european portuguese based on Arquivo.pt

Cofinanciado por:
Acronym | AiBERTa
Project title | pretrAined BERT language model for european portuguese based on Arquivo.pt
Project Code | 2022.03882.PTDC
Main objective | Reforçar a Investigação, o desenvolvimento tecnológico e a inovação

Region of intervention | Portugal

Beneficiary entity | Universidade de Évora(líder)

Approval date | 27-07-2022
Start date | 01-03-2023
Date of the conclusion | 31-08-2024
Date of extension | 28-02-2025

Total eligible cost | 49347 €
European Union financial support |
National/regional public financial support | República Portuguesa - 49347 €
Apoio financeiro atribuído à Universidade de Évora | 49347 €

Summary

The concept of transformers was responsible for a major breakthrough in using neural networks for Natural Language Programming (NLP). Based on this work, investigators presented BERT (Bidirectional Encoder Representations from Transformers).

Recent benchmarks for evaluation of various tasks of natural language understanding (GLUE, MultiNLI, SQuAD v1.1 and SQuAD v2.0 benchmarks) showed that the BERT language representation model improved the state-of-the-art results. In a very simplified way, we can say that the novelty of this architecture was to have a bidirectional context in all  the layers.

Although the pre-trained multilingual BERT model can be used for downstream NLP tasks like POS tagging, Named Entity Recognition or Natural Language Inference, there are numerous successful examples of replicating BERT and BERT-derived architecture in order to build monolingual models. As an illustration, we can find examples for languages such as French, Spanish, Italian, and Brazilian Portuguese. 

The main objective of this project is to build a large pre-trained Language Model (LM) for the European Portuguese. We consider that the main reasons for the nonexistence of such a model have to do with two categories of obstacles: on one hand, the need for huge amounts of unlabeled data and on the other, computational resources. 

To surpass the first obstacle, we propose to resort to the Arquivo.pt infrastructure (the main objective of this system is the preservation of information published on the Portuguese web). To overcome the second, we intend to use the recently created computing lab of the University of Évora dedicated to the area of Artificial Intelligence (AI) and Big Data –  BigData@UÉ.

The public release of such pre-trained model, based on the BERT architecture, is a relevant contribution to the Portuguese NLP community. As referred, the novelty of this proposal is the union of a rich Portuguese web archive (Arquivo.pt) and the computational power of the BigData@UÉ Lab.

In order to validate the model and establish some comparisons with similar works, we also propose to build a new NER classifier. Although such a classifier is sort of a secondary result, we also consider it to be a relevant project result.

The PI and Co-PI have a solid experience in the area of Natural Language Processing, including supervision of eleven PhDs in such field. Furthermore, they also participated in several research projects with a strong component of Machine Learning and supervised several M.Sc. in this area. Finally, both are deeply involved in the  BigData@UÉ Lab, being the Co-PI responsible for its successful funding.