• Nu S-Au Găsit Rezultate

OF THE 15

N/A
N/A
Protected

Academic year: 2022

Share "OF THE 15"

Copied!
191
0
0

Text complet

(1)

PROCEEDINGS

OF THE 15

TH

INTERNATIONAL CONFERENCE

“LINGUISTIC RESOURCES AND TOOLS FOR NATURAL LANGUAGE PROCESSING”,

ONLINE, 14-15 DECEMBER 2020

Editors

Verginica Barbu Mititelu Elena Irimia

Dan Tufiș Dan Cristea

Organisers

“Mihai Drăgănescu” Research Institute for Artificial Intelligence Romanian Academy, Bucharest

Faculty of Computer Science

“Alexandru Ioan Cuza” University of Iași

Institute for Computer Science Romanian Academy, Iași

Romanian Association of Computational Linguistics

Under the auspices of the Academy of Technical Sciences

(2)

The publication of this volume was supported by the Faculty of Computer Science,

“Alexandru Ioan Cuza” University of Iași

ISSN 1843-911X

(3)

iii PROGRAMME COMMITTEE

Verginica Barbu Mititelu, Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, Bucharest

Tiberiu Boroș, Adobe Bucharest

Alexandru Ceaușu, European Commisssion

Mihaela Colhon, Faculty of Mathematics and Natural Sciences, University of Craiova

Svetlana Cojocaru, Institute of Mathematics and Computer Science, Academy of Sciences of Moldova, Chișinău

Horia Cucu, Faculty of Electronics, Telecommunications and Information Technology, University Politehnica of Bucharest

Dan Cristea, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iaşi and Institute for Computer Science, Romanian Academy, Iași Branch

Nils Diewald, Leibniz-Institut für Deutsche Sprache, Mannheim, Germany

Tsvetana Dimitrova, Institute for Bulgarian Language, Bulgarian Academy of Sciences Ștefan Daniel Dumitrescu, Adobe Bucharest

Daniela Gîfu, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iaşi and Institute for Computer Science, Romanian Academy, Iași Branch

Florentina Hristea, Faculty of Mathematics and Computer Science, University of Bucharest Adrian Iftene, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași Diana Inkpen, School of Electrical Engineering and Computer Science, University of Ottawa, Canada

Radu Ion, Research Institute for Artificial Intelligence, Romanian Academy, Bucharest Elena Irimia, Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, Bucharest

Svetlozara Leseva, Institute for Bulgarian Language, Bulgarian Academy of Sciences Dana Lupșa, Faculty of Mathematics and Computer Science, Babeș-Balyai University Maria Mitrofan, Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, Bucharest

Alex Moruz, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași and Institute for Computer Science, Romanian Academy, Iași Branch

Mihaela Onofrei, Institute for Computer Science, Romanian Academy, Iași Branch and Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iaşi

Vasile Florian Păiș, Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, Bucharest

Traian Rebedea, Faculty of Automatic Control and Computers, University Politehnica of Bucharest

Adriana Stan, Faculty of Mathematics and Computer Science, “Babes-Bolyai” University of Cluj-Napoca

Elena Isabelle Tamba, “A. Philippide” Institute of Romanian Philology, Romanian Academy, Iași Branch

Horia-Nicolai Teodorescu, Institute for Computer Science, Romanian Academy, Iași Branch and “Gheorghe Asachi” Technical University of Iași

Diana Trandabăț, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași and Institute for Computer Science, Romanian Academy, Iași Branch

Ștefan Trăușan-Matu, Faculty of Automation, Control and Computer Engineering, University Politehnica of Bucharest and Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, Bucharest

Dan Tufiș, Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, Bucharest

(4)

ORGANIZING COMMITTEE

Verginica Barbu Mititelu, Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, Bucharest

Dan Cristea, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași and Institute for Computer Science, Romanian Academy, Iași Branch

Lucian Gâdioi, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași Daniela Gîfu, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași and Institute for Computer Science, Romanian Academy, Iași Branch

Adrian Iftene, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași Elena Irimia, Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, Bucharest

Mihaela Onofrei, Institute for Computer Science, Romanian Academy, Iași Branch and Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iaşi

Ionuț Pistol, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași

Andrei Scutelnicu, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași and Institute for Computer Science, Romanian Academy, Iași Branch

Diana Trandabăț, Faculty of Computer Science, “Alexandru Ioan Cuza" University of Iași Dan Tufiș, Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, Bucharest

(5)

v

TABLE OF CONTENTS

Foreword ... vii Invited Speakers ... 1 Machine Learning – Universal Panacea? ... 1

Corneliu Burileanu

Developments on Text to Speech Synthesis ... 1 Mircea Giurgiu

Customer Obsessed Science ... 2 Daniel Marcu

Challenges (and Opportunities) in Multimodal Sensing of Human Behavior ... 2 Rada Mihalcea

Scaling Semantic Role Labeling and Semantic Parsing across Languages ... 3 Roberto Navigli

Chapter 1. Language resources development, standardization and exploitation .... 5 The Romanian Medical Treebank - SiMoNERo ... 7

Verginica Barbu Mititelu and Maria Mitrofan

Parsing Temporal and Spatial Information ... 17 Cătălina Mărănduc, Victoria Bobicev, and Cenel Augusto Perez

Romanian Resources in Linguistic Linked Open Data Format ... 29 Verginica Barbu Mititelu, Elena Irimia, Vasile Păiș, Andrei-Marius Avram, Maria

Mitrofan and Eric Curea

The LECOR Project. A Presentation ... 41 Carmen Mîrzea Vasile

Beginning and End of Sentence Word Digrams for Printed Romanian Language ... 53 Alexandru Dinu, Adriana Vlad, Adrian Mitrea and Bogdan Hanu

Chapter 2. Tools for natural language processing ... 63 Multiple Annotation Pipelines inside the RELATE Platform ... 65

Vasile Păiș

Exploring Variational Autoencoders for Lemmatization ... 77 Petru Rebeja

A Word Sense Alignment Approach Based on the Romanian Wordnet and eDTRL Resources ... 83

Andrei Scutelnicu

Chapter 3. Speech recognition and synthesis ... 91 Exploring End-to-end Neural Text-to-speech Synthesis for Romanian ... 93

Marius Dumitrache, Traian Rebedea

Romanian Speech Recognition Experiments from the ROBIN Project ... 103 Andrei-Marius Avram, Vasile Păiș and Dan Tufiș

Improved Text Normalization and Language Models for SpeeD’s Automatic Speech Recognition System ... 115

Cristian Manolache, Alexandru-Lucian Georgescu, Horia Cucu, Verginica Barbu Mititelu and Corneliu Burileanu

Chapter 4. Applications ... 129 Author Confidence as a Predictor of the Acceptance of Scientific Papers ... 131

Mihaela Onofrei, Diana Trandabăț

Accessibility Solution for Poor Sighted People and Elderly as an End-to-end Service for Applications. Romanian Approach ... 141

(6)

Camelia-Maria Miluț, Adrian Iftene

Approaches in Assessing the Credibility of Online Information ... 151 Mircea Petic, Adela Gorea, Inga Țițchiev

Automatic Fake News Identification System ... 161 Ciprian-Gabriel Cușmuliuc, Ioan Sava, Diana-Isabela Crainic, Lucia-Georgiana Coca and Adrian Iftene

What Indicators Tell us About Making Accurate Rank of the Best Paper Predictions 173 Dan Alexandru, Adrian Iftene and Daniela Gîfu

Index of authors ... 183

(7)

vii

FOREWORD

This volume includes the papers of the 15th edition of ConsILR, the International Conference on Linguistic Resources and Natural Language Processing Tools, held between 14-16 December 2020, together with the 4th (and last) Workshop of the project

“ReTeRom – Resources and Technologies for the Development of Human-Machine Interfaces in Language”. The scientific events were organized by two institutes of the Romanian Academy, the Institute of Artificial Intelligence "Mihai Drăgănescu" in Bucharest and the Institute of Computer Science in Iași, together with the Faculty of Computer Science of the University "Alexandru Ioan Cuza” Iași and the Romanian Association of Computational Linguistics. As in most of the previous editions, the event ran under the auspices of the Technical Sciences Academy of Romania.

Organizing the ConsILR conference under the pandemic conditions was really challenging. More than once, in the past, the presentations were lively broadcast on the web during the events, but this edition, enforced by the Covid-19 pandemics, was the first when the communication was entirely online. This completely new format of the conference, to our great satisfaction, was well received by all attendants, fact proved by having one of the most numerous audiences in the whole range of ConsILR events. We are grateful to all the virtual attendants of this 15th ConsILR Conference, to the reviewers who ensured a quality selection of the papers and to the organizers who skilfully managed the online event.

From its first edition, in 2001, the ConsILR Conference (traditional acronym for Consorțiul de Informatizare pentru Limba Română, an initiative born in the Section for Information Science and Technology of the Romanian Academy) was meant as a meeting place for linguists and computational linguists, but also for researchers of the humanities, PhD students and master students in Computational Linguistics, all with interest in the study of the Romanian language from a computational perspective. The series of events have run, with few exceptions, every year, first in the format of a workshop, and since 2010 as an international conference. Thus, ConsILR does not strictly address researchers working on the Romanian language, but also to other scientists, from any part of the world, which could find sources of inspiration in the models and techniques developed for our language and apply them for their own languages. Opening the gate for researchers working on languages other than Romanian to participate in the Conference and publish their work in the Proceedings, a reverse influence is also facilitated, namely that their work inspires scientists working on the Romanian language.

The conference program, mirrored in this volume, was dense, with 17 presentations of the latest results of researchers in this field. The contributed articles were organized in four chapters: 1. Language Resources Development, Standardization and Exploitation, 2. Tools for Natural Language Processing, 3. Speech Recognition and Synthesis, 4. Applications. The addressed topics are of real interest, from corpora and banks of syntactic trees, models, algorithms for the most important phases of natural language processing, as well as standardized methods of representation of linguistic resources, hardware and software infrastructures for speech and textual language processing. The fourth part of the volume includes papers presenting a wide range of applications of speech recognition and synthesis for Romanian language, recognition of false news, assistance for visually impaired patients or the elderly, etc. Most of the

(8)

results presented at the conference are public, open to anyone interested in taking them over and using them.

The organisers invited five well-known researchers in our field, who accepted to deliver keynote speeches:

• Dr. Corneliu Burileanu, Professor at the Faculty of Electronics, Telecommunications and Information Technology, Vice President of University

"Politehnica" of Bucharest, founder in 1984 of the Research Group in Speech Technologies, currently known as SPeeD (Speech and Dialogue Laboratory), coordinator of the biannual international conference SpeD.

• Dr. Mircea Giurgiu, Professor in the Department of Communications, Faculty of Electronics, Telecommunications and Information Technology, Technical University of Cluj-Napoca, coordinator of the Speech Processing Laboratory, with prestigious results in automatic speech synthesis.

• Dr. Daniel Marcu, a great personality in the field of language technologies.

Known for his exceptional contributions, both as a scientist and as an entrepreneur, he is currently the director of Applied Sciences at Amazon, the coordinator of the teams that develop the famous ALEXA communication systems and Amazon Translate machine translation.

• Dr. Rada Mihalcea, Professor of Computer Science and Engineering at the University of Michigan and Director of the Artificial Intelligence Laboratory at the University of Michigan, Honorary Citizen of the city of Cluj-Napoca. She is the holder of numerous awards and distinctions for scientific results, being the coordinator, among others, of projects for the detection of false declarations and false news, systems that have accurately surpassed human evaluators.

• Dr. Roberto Navigli, Professor of Computer Science at Sapienza University in Rome and the leader of the University's research group in the field of Natural Language Processing. Winner of several international awards (Prof. Navigli is one of the few winners of two ERC grants), he is the coordinator of BabelNet's high-impact international projects, BabelScape.

The combination between the brightness of the spirits of this exquisite range of researchers and the quality of papers accepted for regular presentations made this event one of the most remarkable in the whole series.

December 2020 The editors

(9)

1

INVITED SPEAKERS

ABSTRACTS

MACHINE LEARNING – UNIVERSAL PANACEA?

CORNELIU BURILEANU

Faculty of Electronics, Telecommunications and Information Technology, University Politehnica of Bucharest

[email protected]

In my previous presentation at the ConsILR 2018 Conference I pointed out some of the main research directions for the “SpeeD” team. Now I am able to give more details about some achievements in several areas of interest: emotions recognition from speech, DNN approach to Romanian speech and speaker recognition, automatic music transcription, EEG classifier based on deep neural network, real-time EMG-based gesture recognition system, deep learning system for improved segmentation of lesions related to covid-19 chest CT scans, analysis of seismic waves. What do these seemingly very different areas have in common?

I am trying to demonstrate that the methods offered by machine learning could provide viable solutions for the most diverse applications. But it is also an opportunity to share some of the achievements of the team I am working with.

DEVELOPMENTS ON TEXT TO SPEECH SYNTHESIS MIRCEA GIURGIU

Faculty of Electronics and Telecommunications, Technical University of Cluj-Napoca [email protected]

The presentation will focus on an important research topic developed in the last decade at the Speech Processing Research Group from Technical University of Cluj-Napoca:

text to speech synthesis for Romanian. While the earlier achievements in this field were related to speech synthesis using diphone concatenation or statistical methods, much effort has been also dedicated to speech synthesis using Deep Neural Networks (DNN).

Starting from a successful end to end approach, that is training the network only with the text – audio pair, without any other text annotation, we show that there is still room for speech quality improvement from both perspectives: text processing modules, as

(10)

well as acoustic modelling. First, this has been realised through several text annotations:

phonetic transcription, syllabification, lexical stress positioning, POS tagging, or even by a higher level of representations, such as text style information. The methods and the performance for these text processing modules are presented. Second, a number of investigations have been accomplished to experiment various neural network architectures for acoustic modelling in order to enhance the speech quality. For example: Tacotron 2 for expressive speech synthesis, an improved DCTTS implementation for speaker adaptation, or Tacotron for speech synthesis trained with imperfect data. Further work will conclude the presentation.

CUSTOMER OBSESSED SCIENCE DANIEL MARCU

Amazon, [email protected]

Advancing the state of the art in the context of products and services used by hundreds of millions of customers poses challenges that go beyond those associated with advancing the state of the art in customer-free settings. In this talk, I will highlight some of these challenges and discuss approaches to overcoming them in the context of two Amazon services: Amazon Translate and Alexa.

CHALLENGES (AND OPPORTUNITIES) IN MULTIMODAL SENSING OF HUMAN BEHAVIOR

RADA MIHALCEA

University of Michigan, [email protected]

Much of what we do today is centered around humans ‒ whether it is creating the next generation smartphones, understanding interactions with social media platforms, or developing new mobility strategies. A better understanding of people can not only answer fundamental questions about “us” as humans, but can also facilitate the development of enhanced, personalized technologies. In this talk, I will overview the main challenges (and opportunities) faced by research on multimodal sensing of human behavior, and illustrate these challenges with projects conducted in the Language and Information Technologies lab at Michigan.

(11)

3

SCALING SEMANTIC ROLE LABELING AND SEMANTIC PARSING ACROSS LANGUAGES

ROBERTO NAVIGLI

Sapienza University of Rome, [email protected]

Sentence-level semantics is hampered by the lack of large-scale annotated data in non- English languages. In this talk I will focus on two key tasks aimed at enabling Natural Language Understanding, that is, Semantic Role Labeling and semantic parsing, and put forward innovative approaches which we developed to scale across several languages. I will show how new, language-independent techniques, as well as a brand-new, wide- coverage, multilingual verb frame resource, namely VerbAtlas, will help significantly close the gap between English and low-resource languages, and achieve the state of the art across the board.

(12)
(13)

CHAPTER 1.

LANGUAGE RESOURCES DEVELOPMENT,

STANDARDIZATION AND EXPLOITATION

(14)
(15)

THE ROMANIAN MEDICAL TREEBANK - SIMONERO VERGINICA BARBU MITITELU AND MARIA MITROFAN Romanian Academy Research Institute for Artificial Intelligence

{vergi,maria}@racai.ro

Abstract

We present here the first Romanian medical treebank. It builds on a gold standard morphologically annotated corpus, also containing hand validated annotation with medical named entities. Enriched with a further linguistic level, namely syntax, it is a domain specific resource released within a multilingual context. We present quantitative data about it and the creation methodology. We also present and discuss here some comparative statistical data between this treebank and the general language treebank for Romanian.

Key words — corpus, medicine, named entities, Romanian, treebank, Universal Dependencies.

1. Introduction

Domain-specific language resources are valuable assets in natural language processing tools development and testing. We have been concerned with the medical domain for several years and have already reported two resources for it: a specialized corpus, BioRo (Mitrofan and Tufiș, 2018), created as part of the CoRoLa corpus (Tufiș et al.

2019), and a gold standard medical corpus (MoNERo) annotated morphologically (i.e., PoS tagged) and with domain specific named entities (Mitrofan et al., 2019). We continue here with the presentation of a new type of medical resource in Romanian, namely a newly released treebank, developed on top of MoNERo by adding a new annotation level, the syntactic one. It is thus called SiMoNERo.

A treebank is a corpus annotated at the syntactic level, with the tree as the representation of the sentence structure. Actually, the syntactic level of annotation is usually a further level of analysis in a treebank: on top of the morphological annotation, grammatical functions of words, dependencies between words, constituent boundaries become explicit.

Almost half a century ago manual annotation was the way to go to create treebanks ‒ see Sampson (2003) who nostalgically remembers working on the first trees and being photographed from an aeroplane and then being sold the picture showing two disks close to each other in his yard, a pink one, his bald head, and a white one, the table covered with papers being drawn with trees. Nowadays, automatic annotation is the solution, although, Abeillé’s 2003 remark that “human post-checking is always necessary” (Abeillé, 2003) still holds, as proven by the results of the Universal Dependencies Shared Task session within CoNLL 2018 (Zeman et al., 2018), where the best ranked system had a Labeled Attachment Score of 75.84.

When developing a treebank, rarely do developers commit to a certain linguistic theory:

Head-driven Phrase Structure Grammar (Simov et al., 2002; Oepen et al., 2002) and Tree Adjoining Grammar (Shen and Joshi, 2005) are two of the few linguistic theories

(16)

VERGINICA BARBU MITITELU, MARIA MITROFAN

reflected by existing treebanks. The decisions to make when taking up this endeavour are actually two: (i) choose between a dependency and a constituency annotation; (ii) choose between deep or shallow parsing, i.e. annotating only overt elements or also empty slots (Abeillé, 2003).

In this paper we present the creation of a new treebank for Romanian, a domain specific one, including medical texts, called SiMoNERo. Section 2 presents related work with respect to medical treebanks, on the one hand, and to Romanian treebanks, on the other hand. The preprocessing and annotation steps involved in the creation of this new treebank are presented in Section 3, while some statistics on it are given in Section 4, where we also draw a comparison between this treebank and a general language one, developed also within our group. We conclude the paper after we envisage potential uses of the treebank and offer information about how it can be accessed and queried in Section 5.

2. Related work

At present, there are treebanks available for tens of languages, most of them with an open license. Multiple treebanks for the same language are also available, developed by different authors, following different principles, adapted to domains, etc. One has to acknowledge the fact that this abundance of treebanks and their availability are also the results of the penetration of the Universal Dependencies1 (UD) project principles and objectives in many research groups, of their dynamism and interest in the resources quality, especially in a multilingual context: only in the UD May 2020 release there were 163 treebanks for 92 languages, whereas in the UD November 2020 release 20 new treebanks were released and 12 new languages were represented.

2.1. Medical treebanks

As stated by Jiang et al. (2015), retraining existing parsers using medical treebanks is critical for improving their performance, while combining medical and general domain corpora can lead to achieving optimal performance for parsing clinical text. Therefore, several initiatives in the clinical NLP community have established the guidelines for annotating medical texts, as well as annotated corpora for parsing clinical text.

Fan et al. (2013) developed guidelines for parsing medical texts and annotated a corpus accordingly. They also created a treebank of 25 progress notes from University of Pittsburgh Medical Center. The annotated treebank contains 1,100 sentences, with a median length of 8 tokens per sentence, thus quite short ones.

Another annotated clinical corpus, named MiPACQ (Albright et al., 2013), was created using pathology and other clinical notes from the Mayo Clinic. MiPACQ contains multiple layers of annotations, including named entities, syntactically parsed trees, dependency parsed trees and semantic role labeling on 13,091 sentences.

Within the UD project, several other treebanks, which also contain medical texts, were made public, but are not specific to this field, and the Romanian Reference Treebank

1universaldependencies.org

(17)

THE ROMANIAN MEDICAL TREEBANK SIMONERO

9

(see below) is one of them. For the Romanian language, SiMoNERo is the first medical treebank.

2.2. Romanian treebanks

To the best of our knowledge, there are several treebanks for Romanian available. The first created (Hristea and Popescu, 2003) was a dependency one, containing 4,042 short sentences selected from journalistic texts; its peculiarities are the analysis of clauses exclusively, not of sentences, and the exclusion of subordinating conjunctions from clauses; consequently, the language image it offers is deformed.

Another treebank is UAIC-RoDepTb (Perez, 2014), again a dependency one, containing 4,500 sentences, quite long ones, with an average of 37 words/sentence, manually annotated according to “traditional grammar” principles, while the list of relations also reflects the syntactic functions used in the Romanian traditional syntactic approach. The corpus is heterogeneous in structure, with texts from literature, from Romanian Wikipedia, from law texts, journalistic ones, etc. They are both original and translations.

Bick and Greavu (2010) report on a 21 million words journalistic treebank, automatically annotated within the Constraint Grammar formalism with a parser whose grammar is an adaptation of an Italian one.

Irimia and Barbu Mititelu (2015) created another dependency treebank (RACAI-RoTb) containing 5,000 sentences, semi-automatically annotated with a set of relations that was meant to be close to the UD one. The sentences were extracted from ROMBAC (Ion et al., 2012), a balanced corpus of Romanian (reflecting the journalistic, medical, imaginative, juridical and scientific genres), and feature the most frequent verbs therein.

A conversion of UAIC-RoDepTb and RACAI-RoTb to UD format resulted in a reference treebank for Romanian (RoRefTrees or RRT) (Barbu Mititelu et al., 2016) which was released in UD. It contains 9,523 sentences, covers a variety of genres, reflects the contemporary language and was manually validated at the syntactic level.

Another treebank for Romanian released within UD is Romanian Non-standard or UAIC-RoDia (Colhon et al., 2017). With its 572,436 tokens, it is the largest available here. It stands out due to the fact that it contains texts from older periods of the language (16th to 19th centuries), as well as from folklore.

3. Treebank description

In this section we present the corpus content from the texts types perspective, its processing and the levels of annotation available.

3.1. Types of texts in the corpus

SiMoNERo consists of texts extracted from three types of documents: medical scientific journal articles, scientific medical literature books and medical blog posts, but most of them are those coming from medical books. The main reason for choosing these three sources was the good quality of the texts, the correct usage of medical terminology and the abundance of medical terms. All the sentences were extracted based on the metadata

(18)

VERGINICA BARBU MITITELU, MARIA MITROFAN

scheme associated with each document present in BioRo, the corpus from which SiMoNERo was extracted. All texts are I(ntellectual)P(roperty)R(ights)-cleared, which is a valuable asset in the perspective of offering large access to the resource.

3.2. Levels of annotation

The texts were sentence split, tokenized and lemmatized using the TTL tool (Ion, 2007).

The annotation scheme has three different levels:

i) the morfologic level was developed in two steps: automatic annotation performed with the TTL tool (Ion, 2007) followed by the manual verification of the tags. During this phase, several types of errors were corrected (Mitrofan et al., 2018). The annotation scheme used was based on the MSD tag-set developed in the Multext-East project (Dimitrova et al., 1998), which contains 715 tags for Romanian and fourteen classes of words.

ii) the named entity level was manually developed by two annotators: one physician and one experienced annotator, both having Romanian as native language. The annotation scheme of the named entities (NE) was based on four UMLS2 semantic groups: anatomy (ANAT), chemicals and drugs (CHEM), disorders (DISO) and procedures (PROC). The main reason for choosing these four types of entities was a trade-off between the minimum number of entities of each type and the maximum relevance for our corpus. Since the corpus was tokenized and in CONLL-U3 format, the IOB2 (Inside-Outside-Beginning) (Sang and Veenstra, 1999) format was chosen to represent the named entities. The B-tag is used for the first token (so the beginning) of every NE, the I-tag indicates the token that is inside an NE and the O- tag is used for surrounding tokens that do not belong to an NE.

iii) the syntactic level was automatically added using the NLP-Cube parser4 (Boroș et al., 2018) that was trained on RRT. A validation process was run so as to ensure the treebank’s conformance with the UD specifications: a lot of manual intervention was necessary so that all validation tests5 created in UD are now passed by SiMoNERo. One such example is represented by the removal of auxiliary chains from annotations: e.g. in Figure 1, we show how the auxiliary ar must be attached to the head of the clause, the adjective utilă, and not to the verb fi, because the latter’s part of speech is also AUX (given its copula reading here); this type of annotation goes against the morphological knowledge: the auxiliary ar is used here for creating the present conditional of the verb, so a certain mood, which is a grammatical category of the verb, not of the adjective; however, it is the copula reading of the verb fi which prevents such annotation in this kind of examples: the adjective is considered the head of the clause and all functional categories related to the predicate (auxiliaries among them) depend on it.

2https://semanticnetwork.nlm.nih.gov/

3 https://universaldependencies.org/format.html

4 https://opensource.adobe.com/NLP-Cube/index.html

5 A comprehensive list of these tests is available at https://universaldependencies.org/svalidation.html.

(19)

THE ROMANIAN MEDICAL TREEBANK SIMONERO

11

Figure 1: Avoiding chains of auxiliaries 4. Statistics of the treebank

4.1. SiMoNERo

In this section we present general statistics about the treebank. Table 1 shows the distribution of sentences within the medical domains. It can be seen that this distribution is not balanced because of the copyright restrictions which made it impossible to collect the same amount of texts for each medical domain.

Table 1: Distribution of texts from medical domains in the corpus

Domain Tokens Sentences Cardiology 40.7% 40.6%

Diabetes 44.7% 43%

Endocrinology 14.6% 16.4%

The treebank contains 4,681 sentences split into three files, as shown in Table 2 (see also Section 5), where we included the average length of sentences in each file, which is always at least 30 tokens/sentence. The bottom line of the table includes similar information about the medical component of RRT: we notice that the sentences here are shorter that the ones in SiMoNERo.

Table 2: Number of sentences and their average length. Comparison with the medical subcorpus of RRT

File Number of sentences Average sentence length (tokens/sentence)

SiMoNERo train 3,747 31

SiMoNERo dev 443 33

SiMoNERo test 491 30

medical RRT 1,210 23

Analysing the content words, it turned out that the texts have a descriptive structure, 27.8% of content words being nouns, followed by adjectives, 11.5%, and 10.4% verbs.

The multitude of cases in which nouns are followed by two or more adjectives contributes to the descriptive character of the texts.

(20)

VERGINICA BARBU MITITELU, MARIA MITROFAN

SiMoNERo also has 14,133 medical named entities marked among the tokens, distributed in the four types mentioned above as shown in Table 3. As expected, medical texts mainly describe diseases and medical conditions, that is why NEs belonging to the DISO semantic group are prevalent.

Table 3: Types of medical named entities and their number in SiMoNERo

Type Number DISO 6,611 CHEM 4,156 ANAT 1,964 PROC 1,402

4.2. SiMoNERo versus RRT

In this section we compare the two Romanian treebanks reflecting contemporary language that are available in UD, namely SiMoNERo, as a medical treebank, and RRT, as a general language treebank, with the aim of highlighting the characteristics of the former.

SiMoNERo contains much longer sentences, as well as a higher frequency of punctuation signs: there are a lot of scientific data in the form of results of measurements or analyses (e.g. 180 mg / dl or TA < 120 / 80 mmHg; there are 401 slashes in SiMoNERo) rendered also as percentages (e.g. 77% dintre pacienți “77% of the patients”) and as intervals of values (e.g. TA diasistolice 90 - 99 mmHg). Moreover, many statements are backed by references to papers, rendered in the form of numbers between brackets sending to a position in the list of references (e.g. (3) or [3]), while further explanations are given also between brackets within the sentence (e.g. substituție C - T silențioasă ( care nu determină modificarea aminoacidului codificat ) “silent C - T substitution (which does not determine the modification of the coded amino acid)”);

there are 2,540 pairs of brackets in SiMoNERo and only 1,671 pairs in RRT.

Besides their different size, Table 4 also shows the lexical diversity of each treebank:

we can see that both of them are characterized by lexical repetitiveness: the percent of unique lemmas is below 10 in both treebanks. However, comparing the vocabulary specific to each of them we notice that 58% of the unique lemmas in SiMoNERo do not occur in RRT. They are mainly medical terms: infarct (heart attack), reparatoriu (reparatory), compresiv (compressive), cefalalgic (cephalgic), neuroglicopenic (neuroglycopenic), toracotomie (thoracotomie), osteoblastic (osteoblastic) etc. There is a higher percentage of lemmas specific to RRT, namely 74%, and this can be explained by the multiple domains that coexist in it: law, sciences, literature, journalism, medicine, wikipedia etc. Some examples are: peștișor (little fish), împrejur (around), paralelipiped (parallelipiped), alocat (allocated), zdrențuit (ragged), înger (angel), turnesol (litmus), behăială (bleat), pneu (tyre), arbitru (referee), contribuabil (taxpayer), penumbră (penumbra), urologic (urologic), frescă (fresco), pricepere (know-how), neorânduială (disorder), interstelar (interstellar), comerciant (trader), sat

(21)

THE ROMANIAN MEDICAL TREEBANK SIMONERO

13

(village), cutremur (earthquake), sfoară (rope), artă (art), tehnician (tehnician), necinste (dishonesty), zevzec (fool).

As expected, proper nouns are more numerous in RRT (17% of the unique lemmas) than in SiMoNERo (only 4% of the unique lemmas). Nouns (proper ones excluded) are rather equally represented in the two resources (47% in SiMoNERo and 45% in RRT).

However, SiMoNERo’s descriptive character mentioned above is supported by the higher frequency of adjectives, both when considered as unique lemmas and when considering their actual occurrence in the corpus.

Table 4: General data - a comparison between SiMoNERo and RRT

SiMoNERo RRT

sentences 4,681 9,523

tokens 146,020 218,511

tokens / sentence 31.19 22.94

punctuation 19,614 27,506

punctuation / sentence 4.2 2.9

adjectives 17,053 15,229

unique lemmas 10,711 17,458

unique lemmas minus proper nouns 10,282 14,409 unique lemmas minus proper nouns,

numerals and punctuation

9,346 13,456

unique lemmas - only nouns 5,012 7,925

unique lemmas - only adjectives 2,891 3,484

% of adjectives 12 7

% of unique lemmas 7 8

% of proper nouns in unique lemmas 4 17

% of unique lemmas minus proper nouns, numerals and punctuation

87 77

% of numerals and punctuation 9 6

% of nouns in unique lemmas 47 45

% of adjectives in unique lemmas 27 20

lemmas only in SiMoNERo 6,210 -

lemmas only in RRT - 12,957

% lemmas only in one treebank 58 74

5. Format, Access, Query and Use of SiMoNERo

SiMoNERo is UTF-8 encoded, with LF character as line break and is available in CONLL-U format, having named entities marked on the last column.

It was split in a random fashion in three files as follows: the test set (ro_simonero-ud- test.conllu) is 10% of the whole treebank, the development set (ro_simonero-ud- dev.conllu) also 10%, while the rest of the treebank (80%) is the training set (ro_simonero-ud-train.conllu) (see also Table 2). The whole treebank is freely available for download6 under a CC BY-SA 4.0 license.

6 https://github.com/UniversalDependencies/UD_Romanian-SiMoNERo/find/master

(22)

VERGINICA BARBU MITITELU, MARIA MITROFAN

The treebank is available for querying online, alongside other treebanks, on two platforms: PML Tree Query7 and Grew-match8.

Being the first Romanian biomedical treebank annotated with both part of speech tags and named entities, SiMoNERo has an important contribution in named entity recognition (Ion et al., 2019), machine translation (Neves et al., 2018) and other NLP tasks. In order to render it more importance, we intend to proceed to a systematic improvement of the syntactic annotation for the next UD release.

6. Conclusions

The Romanian medical treebank SiMoNERo is the outcome of our interest in developing medical language resources. It was the next natural step after releasing MoNERo, the gold standard morphologically annotated corpus, with domain specific named entities (Mitrofan et al., 2019). In a larger context, it is in line with our concern with the development of language resources in general and, in an even larger context, with the community’s understanding of their importance for the development of language processing tools. Further, we intend to annotate the NEs with semantic information such as WordNet senses and also to annotate the events. The treebank will also be made available in a standardized format, namely linguistic linked open data.

References

Abeillé, A. (2003). Treebanks. Building and Using Parsed Corpora, Dordrecht, Boston, London, Kluwer Academic Publishers.

Albright, D., Lanfranchi, A., Fredriksen, A., Styler IV, W.F., Warner, C., Hwang, J.D., Choi, J.D., Dligach, D., Nielsen, R.D., Martin, J., Ward, W. (2013). Towards comprehensive syntactic and semantic annotations of the clinical narrative.

Journal of the American Medical Informatics Association, 20(5), 922-930.

Barbu Mititelu, V., Ion, R., Simionescu, R., Irimia, E., and Perez, C.-A. (2016). The Romanian Treebank Annotated According to Universal Dependencies.

Proceedings of HrTAL2016, Dubrovnik, Croatia, 29 September - 1 October 2016.

Bick, E. and Greavu, A. (2010). A Grammatically Annotated Corpus of Romanian Business Texts. In Proceedings of Multilinguality and Interoperability in Language Processing with Emphasis on Romanian, Editura Academiei Romane, 169-183.

Boroș, T., Dumitrescu, S. D. and Burtica, R. (2018). NLP-Cube: End-to-End Raw Text Processing With Neural Networks. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, 171-179.

Colhon, M., Mărănduc, C. and Mititelu, C. (2017). A Multiform Balanced Dependency Treebank for Romanian. In Proceedings of Knowledge Resources for the Socio- Economic Sciences and Humanities, (KnowRSH), Varna, Bulgaria, September 8,

7 http://lindat.mff.cuni.cz/services/pmltq/#!/home

8 http://match.grew.fr/

(23)

THE ROMANIAN MEDICAL TREEBANK SIMONERO

15

2017 workshop at the Recent Advances in Natural Language Processing (RANLP), 9-19.

Dimitrova, L., Ide, N., Petkevic, V., Erjavec, T., Kaalep, H.J. and Tufiș, D. (1998).

Multext-east: Parallel and comparable corpora and lexicons for six central and eastern european languages. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 1, ACL ’98/COLING ’98, 315-319, Stroudsburg, PA, USA. Association for Computational Linguistics.

Fan, J.W., Yang, E.W., Jiang, M., Prasad, R., Loomis, R.M., Zisook, D.S., Denny, J.C., Xu, H. and Huang, Y. (2013). Syntactic parsing of clinical text: guideline and corpus development with handling ill-formed sentences. Journal of the American Medical Informatics Association, 20(6), 1168-1177.

Hristea, F., Popescu, M. (2003). A Dependency Grammar Approach to Syntactic Analysis with Special Reference to Romanian. In F. Hristea, M. Popescu (eds), Building Awareness in Language Technology, București, Editura Universității din București, 9-16.

Ion, R. (2007). TTL: A portable framework for tokenization, tagging and lemmatization of large corpora, PhD dissertation, Romanian Academy, Bucharest (in Romanian).

Ion, R., Irimia, E., Ștefănescu, D. and Tufiș, D. (2012). ROMBAC: The Romanian Balanced Annotated Corpus. In Proceedings of LREC 2012, Istanbul, Turkey, 339-344.

Ion, R., Păiș, V.F. and Mitrofan, M. (2019). RACAI’s System at PharmaCoNER 2019.

In Proceedings of The 5th Workshop on BioNLP Open Shared Tasks, 90-99.

Irimia, E., Barbu Mititelu, V. (2015). RACAI-RoTb: nucleu de corpus de limbă română adnotat sintactic cu relaţii de dependenţă, Revista Română de Interacţiune Om- Calculator 8 (2) 2015, 101-120.

Jiang, M., Huang, Y., Fan, J.W., Tang, B., Denny, J., Xu, H. (2015). Parsing clinical text: how good are the state-of-the-art parsers?. BMC medical informatics and decision making, 15(S1), p.S2.

Mitrofan, M., and Tufiş, D. (2018). BioRo: The biomedical corpus for the Romanian language. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 1192-1196.

Mitrofan, M., Barbu Mititelu, V., Mitrofan, G. (2018). Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language. Data 3.4 (2018):

53; https://doi.org/10.3390/data3040053.

Mitrofan, M., Barbu Mititelu, V. and Mitrofan., G. (2019). MoNERo: A Biomedical Gold Standard Corpus for the Romanian Language. In Proceedings of the 18th BioNLP Workshop and Shared Task, ACL, 71-79.

Neves, M., Yepes, A.J., Névéol, A., Grozea, C., Siu, A., Kittner, M. and Verspoor, K.

(2018). Findings of the wmt 2018 biomedical translation shared task: Evaluation on medline test sets. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, 324-339.

(24)

VERGINICA BARBU MITITELU, MARIA MITROFAN

Oepen, S., Flickinger, D., Toutanova, K. and Manning, C. D. (2002). LinGORedwoods.

A rich and dynamic treebank for HPSG. In Proceedings of the 1st Workshop on Treebanks and Linguistic Theories, 139-149.

Perez, C. A. (2014). Linguistic Resources for Natural Language Processing, PhD dissertation, A.I. Cuza University of Iasi (in Romanian).

Sampson, G. (2003). Thoughts on Two Decades of Drawing Trees. In Abeillé, A. (ed.) Treebanks. Building and Using Parsed Corpora, Dordrecht, Boston, London, Kluwer Academic Publishers, 23-42.

Sang, E. F. and Veenstra, J. (1999). Representing text chunks. In Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics, 173–179. Association for Computational Linguistics.

Simov, K., Osenova, P., Slavcheva, M., Kolkovska, S., Balabanova, E., Doikoff, D., Ivanova, K., Simov, E. and Kouylekov, M. (2002). Building a Linguistically Interpreted Corpus of Bulgarian: the BulTreeBank. In Proceedings of LREC 2002, Canary Islands, Spain, 1729-1736.

Shen, L. and Joshi, A. K. (2005). Building an LTAG treebank. Technical Report MS- CIS-05-15, CISDept., UPenn.

Tufiș, D., Barbu Mititelu, V., Irimia, E., Păiș, V., Ion, R., Diewald, N., Mitrofan, M. and Onofrei, M. (2019). Little strokes fell great oaks. Creating CoRoLa, the reference corpus of contemporary Romanian. RRL, LXIV, 3, 227-240.

Zeman, D., Hajic, J., Popel, M., Potthast, M., Straka, M., Ginter, F., Nivre, J. and Petrov, S. (2018). CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. In Proceedings of the CoNLL 2018 Shared Task:

Multilingual Parsing from Raw Text to Universal Dependencies, October, Brussels, Belgium, ACL, 1-21.

(25)

PARSING TEMPORAL AND SPATIAL INFORMATION

CĂTĂLINA MĂRĂNDUC1, VICTORIA BOBICEV2, AND CENEL AUGUSTO PEREZ1

1 Faculty of Computer Science, Al. I. Cuza University, Iași

2 Technical University of Moldova, Chișinău

[email protected], [email protected], [email protected]

Abstract

In this paper we present a dependency treebank morphologically and syntactically annotated in a specific scheme. We managed to increase the accuracy of the POS-tagger and the syntactic parser used, which led to the increase in the volume of annotated texts. First, we analysed the accuracy with which the syntactic parser recognizes the 14 types of circumstantial complements, especially the temporal and spatial ones. These are the most numerous circumstantial complements, and they are very important for the configuration of a textual world describing reality or proposing a fictitious world, providing information about the type of text. In December 2020 our treebank comprised 42,542 sentences (919,608 words and punctuation). We studied our documents containing fictional and non-fictional narrative. Using a Malt parser optimizer, we extracted dependency chains of time and spatial complements. The number of complements and the degree to which they are precise is related to the type of text, fictional or nonfictional. In order to construct a classifier of texts, one can count the spatial and temporal complements and one can observe if they represent determinations of exact landmarks (with geographical proper names and numbers) - in which case the text is a real narrative, or if they represent imprecise determinations, in which case the narrative is fictional.

Key words — Local complements, narrative fiction, narrative reality, syntactic parser, temporal complement, treebank, type of text.

1. Introduction

The RoDia (Romanian Diachronic) Dependency Treebank was created in 2007 and it increased to 4,600 sentences in 2014 (Perez, 2014). Regarding the basic syntactic format, created in 2007 in accordance with the Dependency Grammar principles (Tesnière, 1959; Mel’Čuk, 1987), we have only made insignificant changes since 2014.

The list of complements includes 14 types of circumstances and the coordination is a chain starting from the first coordinate, which also includes the connecting words or punctuation. But the volume of the treebank has increased a lot with the improvement of automatic annotation tools. We have used a hybrid POS-tagger (Simionescu, 2011), which we adapted for Nonstandard Romanian, the list of morphological labels from the MULTEXT-East project (Erjavec, 2012), as well as diverse variants of the MaltParser (Hall et al., 2006), trained on the growing gold corpus that we created. The treebank is corrected manually, but this is getting easier as the number of errors decreases. In December 2020, it comprises 42,542 sentences, with 919,608 words and punctuation.

(26)

CĂTĂLINA MĂRĂNDUC, VICTORIA BOBICEV, CENEL AUGUSTO PEREZ

We have focused on old Romanian texts from the 16th-19th centuries, and we have annotated whole books, because we have noticed that in this way the parser is better trained on more and more specific structures, and the texts are also available for other types of research.

In November 2017, a treebank for Nonstandard Romanian was created on the Universal Dependencies (UD) portal (Mărănduc and Bobicev, 2017). After three years, in November 2020, the treebank is available with 26,221 sentences (572,259 tokens, punctuation included).

Regarding the semantic annotation convention, we failed to create a semantic parser for it and that is why the semantic treebank has only 5,566 sentences with 99,341 tokens (Mărănduc et al., 2018). As a first step towards creating a semantic parser, we have tried to train diverse Malt Parser variants (Smith et al., 2018) available on the UD site, on the basic format of our treebank, with the 14 types of circumstantial complements, attempting to improve their accuracy.

The transformation into the UD convention is done automatically using the Treeops program (Colhon et al., 2017) and the result depends on the correctness of the morphological and syntactic annotation in the basic format.

A method for obtaining a better accuracy is to increase the training corpus with texts rich in the type of complement to which poor accuracy is recorded, because it has too few attestations in the texts. Thus, in order to study the time-space confusions made by the parser, we annotated Neculce’s chronicle, which is very rich in spatial complements and quite rich in time complements, being a non-fiction narrative text.

The automatic parser mistakes certain pairs of circumstantial complements:

• time and place;

• associative and instrumental;

• conditional and concessive;

• conditional and consecutive;

• cause and purpose.

We refer here only to the local and temporal ones, which are very frequent and very important for the configuration of the textual world, be it fictional or non-fictional, and of the text type. If we solve the annotation of one of the 2 categories, we also solved the one with which it is confused. If we manage to solve the correct annotation of time, then all other information that refers to landmarks or directions or sequences is spatial.

Time annotation of circumstantial complements is preserved in the UD format by the existence of three sub-classifications specific for Romanian treebanks: nmod:tmod, advmod:tmod, advcl:tcl.

To increase the accuracy of complement recognition, we have used several methods, including re-correcting those types of complements where we found a large number of errors, based on inconsistencies in the training corpus.

For example, if the erroneous annotation of the constructions: pe vremea aceea, pe cea vreme, în vremea ceea (En: at that time, at the time mentioned) as being c.c.l. (space

(27)

PARSING TEMPORAL AND SPATIAL INFORMATION

19

complement) was repeatedly found in the automatic parsing, then we looked for the word vremea (En: time) in the training corpus to discover further errors not corrected and that generated the current errors. In the first of the five sub-corpora, Neculce’s

“Cronicle”, we found 4 errors. These generated 10 errors in the second sub-corpus and an even higher number of errors in the others. The issue would be solved if all of these errors were corrected.

Thus, the lexical elements that should induce the annotation of the dependency relationship as temporal are extracted: an, (archaic synonyms: leat, let, vleat), zi, lună, dimineața, seara, noapte, ianuarie, februarie, martie, aprilie, mai, iunie, iulie, august, septembrie, octombrie, noiembrie, decembrie, luni, marți, miercuri, joi, vineri, sâmbătă, duminică, iarnă, vară, primavară, toamnă, timp, anotimp, vreme, săptămână, oră, zi, veac, veci, început, sfârșit, când, cândva, oricând, atunci, acum (acu, acuș), mereu, totdeauna, apoi (păi), mâine, poimâine, ieri, alaltăieri, odată, diseară, târziu, devreme, astăzi, (azi), imediat, deocamdată, curând, pururea (En: year, day, month, morning, evening, night, January, February, March, April, May, June, July, August, September, October, November, December, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday, winter, summer, spring, autumn, time, season, weather, week, hour, day, century, forever, beginning, end, when, sometime, anytime, then, now, always, all the times, then, tomorrow, the day after tomorrow, yesterday, the day before yesterday, once, tonight, late, early, today, immediately, for now, soon, forever).

In order for the parser to memorize them as inducers of the c.c.t. relationship, each of them must be annotated with that relationship at all occurrences with this temporal meaning in the treebank, without any error disturbing the induction process. This is the way for the parser to memorize the words related to the notion of time, following the training with correctly annotated texts.

A solution could also be to link the words in the treebank to the information in the Romanian WordNet, as was done recently with a parsing experiment with which an increase in accuracy of 0.5 percent was obtained (Barbu Mititelu et al., 2016).

2. Parsing Experiments

We parsed our documents with MaltParser (Smith et al., 2018), a data-driven parser.

This parser demonstrated the ability to obtain good results for multiple languages and has been widely used in Universal Dependencies projects (Nivre et al., 2016). The first set of experiments was performed with the whole corpus. The whole corpus, except for one document, was used for training, and testing was performed on the document that was excluded from the training set, thus obtaining the data presented in Table 2.

However, the documents in our corpus are quite different and training on one of them and testing on other results in poor accuracy. Thus, we decided to experiment with every document separately. Some of them are relatively small, too small to be used for training, and we selected the three largest documents in our corpus, namely the New Testament Gospels and Acts, and Neculce’s Chronicle.

The MaltParser offers a wide range of parameters for optimization, including nine different parsing algorithms, two different machine learning libraries (each with a number of different learners), and an expressive specification language that can be used

(28)

CĂTĂLINA MĂRĂNDUC, VICTORIA BOBICEV, CENEL AUGUSTO PEREZ

to define arbitrarily rich feature models. In our case we are especially interested in the feature set optimization.

First of all, we used the MaltOptimizer (Ballesteros and Nivre, 2012) to detect the best algorithm and feature set for our documents. The MaltOptimizer processed the documents in three steps. The first step was used to gather information about the various properties of the training set. During the second step, the MaltOptimizer explored a subset of the parsing algorithms implemented in the MaltParser, based on the results of the data analysis to detect the best one for this particular training set. The goal of the third step was optimization of the feature model given by the parsing algorithm chosen.

It tested potentially useful features one by one and in combination to ensure that all features in the model actually make a contribution. The result of MaltOptimizer use is presented in Table 1.

Table 1: The best algorithms and the best Labelled Attachment Score (LAS) for three largest documents of our corpus.

Document Best Algorithm Best Performance (LAS)

New Testament Gospels Nivreeager 83,8

New Testament Acts of Apostles Nivrestandard 78,9

Neculce Chronicle Nivreeager 84,59

Most of the effort when optimizing MaltParser usually goes into feature selection, that is, in tuning the feature representation that constitutes the input to the classifier. A feature model in MaltParser is defined by a feature specification file in XML. It states that the parsing algorithm uses 32 features including: POSTAG values of the neighbouring tokens around the current token; 4 FORM that presents words around the current token; LEMMA of the current token, 3 DEPREL (dependency relation labels) and 8 complex features that merge two or three features as, for example, morphological label of the word and its dependency relation to the left and to the right.

3. Related Work

The annotation of space and time, as a means of configuring textual worlds or communication situations, is increasingly in the attention of linguists and computer scientists. It is also the basis for the search for time information retrieval, TIR, or geographic information retrieval, GIR. Strötgen (2010) shows how co-occurrences of spatial and temporal information are determinant for the spatio-temporal profiles of documents.

Llorens et al. (2009) only deal with the annotation of temporal semantic roles, in accordance with the internationally accepted TimeML scheme, and evaluates a set of time-related MWEs, TIMEX3 in English and Spanish, with an accuracy of 76%, which makes the authors consider that they are likely to be identified in other languages as well. Three years later, Llorens et al. (2012) propose an automatic system for identifying time relationships in natural language. The experiments were made on an available English data set annotated with temporal information (TimeBank) in a 10‐fold cross‐validated evaluation, with an accuracy of 46%.

(29)

PARSING TEMPORAL AND SPATIAL INFORMATION

21

In the paper (Lefeuvre et al., 2016), a syntactic rather than lexical annotation of time in a treebank in French is described, and the authors make proposals to extend the TimeML scheme. An annotation of temporal dependency structure is performed on a corpus of children’s narratives in (Kolomiyets et al., 2012). The agreement among more annotators is: 0.856 on the event words, 0.822 on the links between events, and of 0.700 on the ordering relation labels.

In Romanian, the English corpus of Time Bank was ported in Romanian (by translation) with all temporal annotations (Forăscu and Tufiș, 2012), having 4715 sentences (65,375 tokens). A conference of the same year, on semantic web data annotation focuses, among other things, on the recognition of TimeML noun events, i.e. on a scheme for processing the event and temporal expressions in natural language processing fields (Jeong and Myaeng, 2012).

A chapter in a Springer book is also interested in a database which can manage events that are evolving with time, i.e., the information of spatial objects whose shape and position evolve with time (Xiaoping et al., 2011).

Our corpus consists of texts in Old Romanian and is not annotated with the categories in Time Bank, but it could be because all the information about the modes and tenses are in X-Postag (the morphological annotation specific to our treebank).

Therefore, we did not give up the annotation of the verbal circumstantial (time) modifiers in our basic syntactic convention; we have tried to see what information we can extract from the syntactically annotated corpus we hold. In the semantic format, we managed to annotate the space and time when they are verbal or nominal determinants, but in this paper we discuss only verbal modifiers.

4. Narrative Corpus Content

Table 2 below presents the documents included in this study, annotated morphologically and syntactically, the number of sentences and tokens, the accuracy of the automatic parser (labelled and unlabelled attachment score, i.e. LAS and UAS) and the type of the texts. It is a balanced corpus, i.e. the contemporary and the old texts, the regional ones, and the social media communication are all represented. The first word in the title, with capitals, marks these categories. For the old texts, the century is also added. We excluded from this study the legal style, Wikipedia, the lyrical poetry, popular and church (Psalms). For the classification of these types of texts, we need some other criteria (Mărănduc, 2005).

The information on the time and space framing appears in narrative texts. We can study the number of such complements and what would be their form when the narrative is fictional, compared to when the narrative is mystical or a reality one. These complements are also found in dialogues, where they circumscribe the communication situation.

Referințe

DOCUMENTE SIMILARE

4 ID: 132.400 Coordonarea din partea Universității ,,Alexandru Ioan Cuza,, din Iași a proiectului finanţat de Autoritatea de Management pentru Programul Operaţional

Models and implementation of distributed applications, Databases, Distributed artificial intelligence, Games computational theory, Languages, tools and programming media,

 A small vocabulary, an increased accuracy requirement. 

Dată fiind absența unui studiu sociologic solid care ar putea întări concluziile acestui studiu, așa cum este cazul în cercetarea academică americană, concluziile tezei mele nu

– Monitoring - there is a wide set of contexts in which monitoring allows decision- making: if system business functions are exposed you need to monitor them, if you want to

ASTRONOMICAL INSTITUTE FACULTY OF MATHEMATICS AND COMPUTER SCIENCE CLUJ-NAPOCA BRANCH - ASTRONOMICAL OBSERVATORY FACULTY OF PHYSICS.. ROMANIAN NATIONAL COMMITTTEE

Department of Mathematics, Faculty of Mathematics and Computer Science, Babe¸s-Bolyai University, Cluj-Napoca, Romania.. MSC

2 RACAI – Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy; the Speech Processing Laboratory at the Faculty of Electronics,