“LINGUISTIC RESOURCES AND TOOLS FOR NATURAL LANGUAGE PROCESSING”

(1)

“LINGUISTIC RESOURCES AND TOOLS FOR NATURAL LANGUAGE PROCESSING”

CLUJ-NAPOCA, 18-20 NOVEMBER 2019

Editors Mihaela Onofrei Anca-Diana Bibiri Constantin Dragoș Nicolae

Dan Tufiș Dan Cristea

Organisers

Faculty of Computer Science

“Alexandru Ioan Cuza” University of Iași

Research Institute for Artificial Intelligence “Mihai Drăgănescu”

Romanian Academy, Bucharest

Institute for Computer Science Romanian Academy, Iași Branch

Faculty of Mathematics and Computer Science

“Babes-Bolyai” University of Cluj-Napoca Romanian Association of Computational Linguistics

(2)

Under the auspices of the Academy of Technical Sciences

The publication of this volume was supported by the Faculty of Computer Science,

“Alexandru Ioan Cuza” University of Iași

ISSN 1843-911X

(3)

Anca-Diana Bibiri, Social Sciences and Humanities Research Department, Institute for Interdisciplinary Research, “Alexandru Ioan Cuza” University of Iaşi

Corneliu Burileanu, Faculty of Electronics, Telecommunications and Information Technology, University Politehnica of Bucharest and Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, Bucharest

Camelia Chira, Faculty of Mathematics and Computer Science, “Babes-Bolyai” University of Cluj-Napoca

Mihaela Colhon, Faculty of Mathematics and Natural Sciences, University of Craiova

Svetlana Cojocaru, Institute of Mathematics and Computer Science, Academy of Sciences of Moldova, Chișinău

Horia Cucu, Faculty of Electronics, Telecommunications and Information Technology, University Politehnica of Bucharest

Dan Cristea, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iaşi and Institute for Computer Science, Romanian Academy, Iași Branch

Nils Diewald, Leibniz-Institut für Deutsche Sprache, Mannheim, Germany

Liviu Dinu, Faculty of Mathematics and Computer Science, University of Bucharest Corina Forăscu, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași Mircea Giurgiu, Faculty of Mathematics and Computer Science, “Babes-Bolyai” University of Cluj-Napoca

Daniela Gîfu, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iaşi and Institute for Computer Science, Romanian Academy, Iași Branch

Gabriela Haja, “A. Philippide” Institute of Romanian Philology, Romanian Academy, Iași Branch

Adrian Iftene, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași Diana Zaiu Inkpen, School of Electrical Engineering and Computer Science, University of Ottawa, Canada

Radu Ion, Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, Bucharest

Elena Irimia, Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, Bucharest

Dana Lupșa, Faculty of Mathematics and Computer Science, “Babes-Bolyai” University of Cluj-Napoca

Alex Moruz, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași Vivi Năstase, Institut für Computerlinguistik, Universität Heidelberg, Germany

Constantin Dragoș Nicolae, Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, Bucharest

Mihaela Onofrei, Institute for Computer Science, Romanian Academy, Iași Branch and Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iaşi

Ionuț Pistol, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași

Adriana Stan, Faculty of Mathematics and Computer Science, “Babes-Bolyai” University of Cluj-Napoca

Elena Isabelle Tamba, “A. Philippide” Institute of Romanian Philology, Romanian Academy, Iași Branch

Horia-Nicolai Teodorescu, Institute for Computer Science, Romanian Academy, Iași Branch and “Gheorghe Asachi” Technical University of Iași

Diana Trandabăț, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași Ștefan Trăușan-Matu, Faculty of Automation, Control and Computer Engineering, University Politehnica of Bucharest and Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, Bucharest

(4)

Marius Zbancioc, “Gheorghe Asachi” Technical University of Iași and Institute for Computer Science, Romanian Academy, Iași Branch

ORGANIZING COMMITTEE

Anca Andreica, Faculty of Mathematics and Computer Science, “Babes-Bolyai” University of Cluj-Napoca

Anca-Diana Bibiri, Social Sciences and Humanities Research Department, Institute for Interdisciplinary Research, “Alexandru Ioan Cuza” University of Iași

Camelia Chira, Faculty of Mathematics and Computer Science, “Babes-Bolyai” University of Cluj-Napoca

Dan Cristea, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași and Institute for Computer Science, Romanian Academy, Iași Branch

Corina Forăscu, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași Lucian Gâdioi, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași Daniela Gîfu, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași and Institute for Computer Science, Romanian Academy, Iași Branch

Adrian Iftene, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași Simona Motogna, Faculty of Mathematics and Computer Science, “Babes-Bolyai” University of Cluj-Napoca

Constantin Dragoș Nicolae, Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, Bucharest

Mihaela Onofrei, Institute for Computer Science, Romanian Academy, Iași Branch and Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iaşi

Ionuț Pistol, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași

Horia Pop, Faculty of Mathematics and Computer Science, "Babes-Bolyai" University of Cluj- Napoca

Cristian Pădurariu, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași and Institute for Computer Science, Romanian Academy, Iași Branch

Andrei Scutelnicu, Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași and Institute for Computer Science, Romanian Academy, Iași Branch

Diana Trandabăț, Faculty of Computer Science, “Alexandru Ioan Cuza" University of Iași Dan Tufiș, Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, Bucharest

iv

(5)

TABLE OF CONTENTS ... v FOREWORD ... vii CHAPTER 1 CORPUS APPROACHES ... 1 POLISHING MONERO, THE MORPHOLOGICALLY AND MEDICAL NAMED ENTITIES ANNOTATED CORPUS OF ROMANIAN ... 3

Ioana Marinescu, Verginica Barbu Mititelu, Maria Mitrofan

THE COBILIRO PROJECT: BUILDING AND DISTRIBUTING A BIMODAL CORPUS FOR ROMANIAN LANGUAGE ... 13 Dan Cristea,Cristian Pădurariu,Șerban Boghiu,Daniela Gîfu,Mihaela Onofrei, Diana Trandabăț, Ionuț Cristian Pistol, Anca-Diana Bibiri, AndreiScutelnicu SPANISH LEARNER CORPUS CAES: A RESEARCH ON HOW RUSSIAN LEARNERS USE IMPERFECT TENSE IN SPANISH ... 25

Maria Adelaida Gil Martinez

AUTOMATIC IDENTIFICATION AND CLASSIFICATION OF LEGAL TERMS IN ROMANIAN LAW TEXT ... 39

Andrei Coman, Maria Mitrofan, Dan Tufiș

LOOKING ALONG ZIPF’S LAW FOR THE DISTRIBUTION OF WORDS BEGINNING AND ENDING SENTENCES IN LITERARY PRINTED ROMANIAN CORPORA ... 51

Bogdan Hanu, Adriana Vlad, Alexandru Dinu, Adrian Mitrea

CHAPTER 2 NEURAL APPROACHES ... 63 ROMANIAN AUTOMATIC DIACRITICS RESTAURATION CHALLENGE ... 65

Florin Iordache, Lucian Georgescu, Dan Oneață, Horia Cucu

TRIPOD: LEARNING LATENT REPRESENTATIONS FOR SEQUENCES ... 75 Tiberiu Boroș, Andrei Cotaie, Alexandru Meterez, Paul Ilioaica

CHAPTER 3 SYMBOLICAL APPROACHES VS. STATISTICS ... 85 SOME LOGICAL AND COMPUTATIONAL METHODS FOR THE ANALYSIS OF THE INEFFABLE ... 87

Cătălina Mărănduc, Anca-Diana Bibiri

REVISITING THE STATISTICAL INDEPENDENCE FOR THE PRINTED ROMANIAN LANGUAGE ... 99

Alexandru Dinu, Adriana Vlad, Bogdan Hanu, Adrian Mitrea

CHAPTER 4 DIACHRONIC STUDIES ... 113 SEMI-AUTOMATIC WORD LEVEL ALIGNMENT FOR RHOTACISING PSALTERS IN ROMANIAN ... 115

Ana-Maria Gînsac, Maria Moruz, Mihai Alex Moruz, Mădălina Ungureanu SOLUTIONS FOR SCANED DOCUMENTS SEGMENTATION AND LETTER RECOGNITION ... 127

Constantin Cristian Pădurariu, Dan Cristea

MORHO-SYNATCTIC REGULARITIES IN UD_ROMANIAN NONSTANDARD PARSING ... 139

Cătălina Mărănduc, Victoria Bobicev, Roman Untilov

v

(6)

vi

WRAPPING OUR HEADS AROUND VMWES AND THEIR DERIVATIVES .. 153 Ivelina Stoyanova,Svetlozara Leseva, Verginica Barbu-Mititelu, Maria Todorova, Mihaela Cristescu

FRAME SPECIALISATION MOTIVATED BY INTER-FRAME RELATIONS IN FRAMENET ... 167

Svetlozara Leseva, Ivelina Stoyanova, Maria Todorova, Hristina Kukova CHAPTER 6 NLP TOOLS AND LANGUAGE-BASED APPLICATIONS ... 179

INTEGRATION OF ROMANIAN NLP TOOLS INTO THE

RELATE PLATFORM ... 181 Vasile Păiș, Dan Tufiș, Radu Ion

READ-ME – MORPHO-SYNTACTIC ANALYSES FOR ROMANIAN LANGUAGE ... 193 Anda-Mădălina Florea, Irina Toma, Robert Botarleanu, Mihai Dascălu, Ștefan Trăușan-Matu

QUERY EXPANSION – AUTOMATIC GENERATION OF SEMANTIC SIMILAR PHRASES USING WORDNET ... 203

Diana Lucaci, Adrian Iftene

INDEX OF AUTHORS ... 217

(7)

Today’s communication is more and more mediated by the information technology and, by analogy with the already established term international communication language, one could speak about e-communication languages with direct reference to the languages used in the cyberspace. This concept, beyond its direct significance, has profound cultural, social, economical and ethical implications, stating the right of any citizen to have access in her/his own language to the knowledge and services of the cyberspace. Thus, it is not surprising the huge volume of research in the area of human language engineering, consistently supported through the international or national research programs which produced, in the last 20 years or so, practical results that paved the way to the multilinguality industry. The unprecedented increase of computers’

performances and of data storage capacities, the success of neural networks and deep- learning approaches generated in the last decade not only a research paradigm change but also more consistent and convincing results than ever.

Romanian language is one of the emerging languages in the virtual space and, as such, it is present in the international research programs, by united forces of Romanian experts and their colleagues from important institutions around the world. With a visionary anticipation of this trend, the Section for Information Science and Technology of the Romanian Academy initiated in 2001 the Consortium for Informatization of the Romanian Language, which, afterwards, organised in Iași a first workshop in what will become a series of events named ConsILR (by the acronym for Consorțiul pentru Informatizarea Limbii Române). They were meant to bring together linguists and computational linguists, researchers of the humanities, PhD students and master students in Computational Linguistics, all with a major interest in the study of the Romanian language from a computational perspective. The events have run, with few exceptions, every year, and since 2010 – the series was transformed into an annual conference with an international scope. To reflect this opening towards general methods, which, although initially addressing one language interest could easily be generalised to cope with other languages and multilingual issues, the ConsILR conference is now called Linguistic Resources and Technologies for Natural Language Processing.

To the three traditional organisers of the conference, the Faculty of Computer Science,

“Alexandru Ioan Cuza” University of Iași, the Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy, and the Institute for Computer Science of the Iași Branch of the Romanian Academy, this year have joined also the

“Alexandru Philippide” Institute of Romanian Philology from the Iași Branch of the Romanian Academy, the Faculty of Mathematics and Computer Science, “Babeș- Bolyai” University of Cluj-Napoca, as the local organiser, and the Romanian Association of Computational Linguistics. We, the organisers, thank again the Academy of Technical Sciences, which accepted to place our conference under their high auspices.

vii

(8)

viii

universities and including one of the most vivid IT community in Romania. The 17 contributions, quite heterogeneous, were organized in 6 chapters: Corpus Approaches, Neural Approaches, Symbolic Approaches vs. Statistics, Diachronic Studies, Multiword Expressions and Frame Semantics, and NLP Tools and Language-Based Applications.

The current selection of papers is inherently incomplete but the multitude of topics approached reflects the current wide coverage of the domain.

To conclude, this volume offers a valuable collection of contributions presenting research and development results. Several issues discussed throughout the book are present-day concerns of the scientific community irrespective of the investigated language. Therefore, we believe that this volume will be a very useful instrument for NLP researchers, for students preparing their MSc dissertations or PhD theses in the area of computational studies of natural languages.

ACKNOWLEDGMENTS

Part of the work published in this volume was supported by a grant of the Romanian Ministry of Research and Innovation, CCCDI – UEFISCDI, project number PN-III-P1- 1.2-PCCDI-2017-0818 / 73PCCDI (ReTeRom), within PNCDI III.

November 2019 The editors

(9)

CORPUS APPROACHES

(10)

(11)

IOANA MARINESCU¹, VERGINICA BARBU MITITELU², MARIA MITROFAN²

1 Mircea cel Bătrân National College, Râmnicu Vâlcea

2 Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy [email protected]

{vergi, maria}@racai.ro Abstract

MoNERo is a Romanian specialized corpus, reflecting the medical language from three fields (cardiology, endocrinology and diabetes). It contains 4,987 sentences and 154,825 tokens with an unbalanced distribution among the fields. The corpus is one of the few gold standard ones for Romanian. It was automatically processed (tokenized, part of speech-tagged and lemmatized) and then manually checked. It is also annotated with medical terms, using the IOB2 format and contains labels belonging to four entity semantic groups:

anatomy (ANAT), chemicals and drugs (CHEM), disorders (DISO) and medical procedures (PROC). The annotation with medical entities was manually done, but only a small part of the corpus was doubly annotated.

Given the inherent rate of errors characterizing manual work, even that of specialists, we present here: (i) the steps taken for an automatic identification of the errors left in the corpus, (ii) the types of errors found in the corpus; (iii) the way in which they were solved in the corpus. MoNERo has been publicly released for the community and can be used as a reliable resource for training NER systems and NLP systems.

Key words — medical corpus, part of speech tagging, medical terms, automatic identification of errors, gold standard.

1. Introduction

Language resources in the form of written corpora are a great asset, useful for language specialists and language engineers alike: they mirror the linguistic (lexical, morphological, syntactic, discourse) phenomena characterizing a language at the time when the texts they contain were written. When the corpus is specialized, i.e. it contains texts from a certain domain of activity, it is also interesting for terminologists (who are concerned with the correctness of the terms form according to the language characteristics and with standardizing the terms specific to the domain) and for the specialists of that domain, who need to mine the text for specific information.

In this paper we present the characteristics of the first Romanian medical corpus, MoNERo, linguistically annotated with lexical and morphological information and with medical information in the form of several types of terms; the main interest here lies in applying an automatic procedure for the identification of the errors left after manual correction of the annotations. After a brief general presentation of the corpus (section 2), we present related work in the field of automatic identification of annotation errors

(12)

4

in corpora (section 3), then we describe our methodology for spotting such errors in MoNERo (section 4). The types of errors found are discussed in section 5 and conclusions are drawn in section 6.

2. MONERO – general presentation of the corpus

Romanian benefits of the existence of a reference contemporary corpus, CoRoLa (Barbu Mititelu et al., 2018), whose design was meant to ensure a large coverage of the language styles, of as many as possible domains and subdomains of activity. As a consequence, the medical scientific domain is represented among the many others.

Mitrofan and Tufiș (2018) reported on BioRo, made up of 9,864,707 tokens from texts belonging to the medical domain. They are unequally distributed into 9 subdomains:

oncology, cardiology, diabetes, surgery, endocrinology, neurology, nephrology, psychiatry, and alternative medicine. The texts come from several types of sources:

scientific books, medical journals, medical blogs and medical school lecture notes. The corpus is automatically split into sentences, tokenized, part-of-speech tagged and lemmatized with TTL (Ion, 2007).

A part of BioRo that raised no copyright issues has been turned into a freestanding corpus, MoNERo (Mitrofan et al., 2019). The corpus was extracted from sources that guarantee the users unlimited access to the site and gives them the right to modify or to reproduce the texts in whole or in part. Since it is difficult to obtain texts with copyright-free status, MoNERo contains texts from only three out of the nine medical subdomains in BioRo: cardiology, diabetes, and endocrinology, distributed as in Table 1 below, where one can notice their unbalanced representation.

Table 1: Subdomain distribution of the texts in MoNERo

Domain Number of tokens

Cardiology 63,043

Diabetes 69,085

Endocrinology 22,697

TOTAL 154,825

Although not large as a corpus, MoNERo is large enough to be used as a resource for training systems dealing with named entity recognition (NER) in the medical domain (Mitrofan et al., 2019). Some quantitative data about it are provided in Table 2:

Table 2: Quantitative data about MoNERo

Number of: #

Sentences 4,987

Tokens 154,825

Words 134,084

Sentence length (words/sentence) 26.9

MoNERo is comparable in size with other annotated medical corpora available within the scientific community: CLEF corpus (Roberts et al., 2009), i2b2 corpus (Uzuner et

(13)

5

al., 2010), NCBI corpus (Doğan et al., 2014), Quaero corpus (Névéol et al., 2014) CHEMDNER corpus (Krallinger et al., 2015), IxaMedGS corpus (Oronoz et al., 2015), DrugSemantics corpus (Moreno et al., 2017), etc.

Being part of BioRo, MoNERo is sentence split, tokenized, part-of-speech (POS) tagged and lemmatized. What distinguishes it from BioRo is the fact that all these levels of processing and annotation have been manually checked: sentence limits, tokens, lemmas and the morphosyntactic descriptions have been corrected when necessary. Another distinguishing aspect of MoNERo is its manual annotation with medical named entities.

These are medical terms of varying length, i.e. made up of a variable number of words, either continuous or discontinuous in their occurrences in the corpus. For example, in the sentence in Ex. 1 the entity peretelui aortic is continuous, of length 2, the entities vasului and lumenul are of length 1, while anevrismele fusiforme, anevrismele sacciforme, pseudoanevrismele sacciforme are discontinuous ones, of length 2.

Example 1:

Din punct de vedere morfologic, anevrismele adevărate pot fi fusiforme (dilatație simetrică, aspect cilindric al vasului) sau sacciforme (afectează doar o porțiune a peretelui aortic și comunică printr-un colet cu lumenul), iar pseudoanevrismele, doar sacciforme.

‘From point of view morphologic, aneurysms-the real can be fusiform (dilatation symmetric, aspect cylindrical of vessel-the) or saccular (affects only a portion of wall aortic and communicates through-a colet with lumen-the), and pseudoaneurysms-the, only saccular.’

“From a morphological point of view, real aneurysms can be fusiform (symmetric dilatation, cylindrical aspect of the vessel) or saccular (affecting only a portion of the aortic wall and communicating with the lumen through a colet), while pseudoaneurysms can be only saccular.”

The manual annotation with named entities (NEs) observed some guidelines drafted beforehand and adjusted whenever new cases were encountered and was made by two annotators, who discussed together the problematic cases during the annotation process (Mitrofan et al., 2019).

In order to ensure a high-quality annotation of this gold corpus, besides the manual correction of all annotations, we also used some automatic strategies of spotting inconsistencies left in the annotation after the manual correction, as described in section 4 below.

3. Related work

The focus of automatic POS error detection is identifying errors in POS tags assigned by human annotators. Variation in POS tags in the corpus can be caused either by ambiguous word forms which, depending on the context, can belong to different word classes, or by incorrect judgments made by the annotators (Eskin, 2000; van Halteren, 2000; Kveton and Oliva, 2002; Dickinson and Meurers, 2003; Loftsson, 2009). They

(14)

6

are the results of several factors, among which we mention: complex annotation scheme, rare occurrences of words, even hapax legomena, which are difficult to analyze, especially if the words characterize specialized vocabulary, factors affecting the humans’ work, e.g. tiredness, tight deadlines, etc.

The variation n-gram algorithm (Dickinson and Meurers, 2003) allows users to identify potentially incorrect tagger predictions by looking at the variation in the assignment of POS tags to a particular word n-gram. The algorithm produces a list of varying tagger decisions which have to be processed by a human annotator.

Results obtained from work with another focus could also be combined with our consistency checking approach. Deriving and searching for bigrams of tags which should never be allowed (Kveton and Oliva, 2002) shows that inconsistencies are mostly bigrams. Sparse Markov transducers used to detect anomalies, i.e., rare local tag patterns (Eskin, 2000) show that inconsistencies are mostly recurrent, not rare. Parsing failures are used to detect ill-formed annotation serving as parser input (Hirakawa et al., 2000; Muller and Ule, 2002). Searching and correcting with hand-written rules is also proposed (Oliva, 2001; Blaheta, 2002).

4. Work description

The solution for detecting errors in the part-of-speech tags annotation is programed based on the variation n-gram method. The implementation was made using C++

libraries vector, string, and map. The program was developed in three parts, by coding the functions for reading (section 4.1), generating the n-grams (section 4.2), and identifying the errors (section 4.3).

4.1. Reading

The reading function stores each word from all lines in a vector. Then for every word, it stores a list of all occurrences of the word in the corpus using the number of the line (e.g. vascular: 28567, 60245, 123160).

4.2. Generating N-grams

For each word, each occurrence is analyzed separately. For each occurrence, there are N possibilities for the position of the word in a context of N words, from first to last. Thus, each group of N words acts like a unit (gram). For each gram, a list of its occurrences in the corpus is stored. The list contains the indexes of the lines where the word analyzed in the context of the current gram occurs. Having stored information about the n-grams, the POS tags are then analyzed as in Section 4.3.

While any value of N relatively small (N <= 100) would have preserved the linear complexity of the algorithm, the results presented were obtained using N = 3 (e.g.

analyzing the word perioadă in the context of pe o perioadă (“over a period”) appears eight times and the context of perioadă de urmărire (“period of observation”) occurs three times.

(15)

7 4.3. Identifying errors of POS-tags

Considering each N-gram separately, for every occurrence of it, the annotation of the initial word is checked and for each different annotation, the tag and its position are stored. Supposing that the most frequent tag is the right one, all the rest are errors and must be replaced. For examples, the adjective cronice (“chronic”) in the context boli pulmonare cronice (“chronic lung disease”) occurs on the lines 40851, 50127, 62431, with Afpfprn as the most frequent tag assigned to it, while Afpfp-n is the tag for the occurrence on line 62431 (in the file this is represented as: cronice: boli pulmonare cronice – 40851, 50127, 62431 Afpfprn Afpfp-n 62431). The tag Afpfprn is to be read as adjective (A) of the type qualificative (f), with the degree positive (p), gender feminine (f), number plural (p), case direct (r), non-definite (n), while the tag Afpfp-n expresses the same information with the exception of the case position which is left underspecified (-) because the same form can be used for all cases of the plural number, feminine gender and non-definite use.

5. Results

5.1. Improved morphological annotation

The TTL tool is reported good annotation results on news texts: 98.23% accuracy.

When run on medical data, the accuracy decreased, as expected, however, not to a high extent: it was 97.83% (Mitrofan and Ion, 2017). This implies a low overload of manual annotation corrections. As no pointers to the wrong tags or lemmas are provided, the human annotator had to go through all tokens, one by one, and make sure they are correctly tokenized, POS-tagged and lemmatized. This is similar to finding a needle in the haystack. The complex tagset used for Romanian adds to the difficulty of the task: it contains 714 complex tags.

The types of errors that were corrected are:

- Errors of tokenization:

- most of the times they are connected to the different types of uses of the hyphens: between numbers (years: 1960-1990, periods: 15.03.1997- 15.09.1997, etc.) or between words (especially clitics: s-a

‘Refl.Cl.3.sg.Acc.-has’): the automatic tokenization of such cases was not consistent: consider the first example: 1960-1990: sometimes this is considered a single token, some other times it is split into three tokens;

- sometimes, there are spaces inside a word, thus it is interpreted as two tokens (e.g., fi cat instead of ficat “liver”); they are typos, but require the deletion of the blanc and the recovery of the word;

- inconsistent treatment of several symbols (%, +, etc.): sometimes they are treated as a separate token, while some other times, together with the previous token, they make up one token: for example, 50% is sometimes analyzed as a single token, and some other times it is split into 50 and %;

(16)

8

- Errors of lemmatization: their causes are multiple:

- words unknown to the annotation tool: given the fact that MoNERo is a specialized corpus, the number of words specific to the medical domain and even to the three subdomains represented in the corpus is quite high:

stenoză (“stenosis”), aterom (“atheroma”), etc. Moreover, some of the words are foreign (mostly English): biomarker, strain, etc. Another linguistic phenomenon encountered is the calque: cost-eficientă (“cost- effective”) is a strange compound in Romanian, copying the inner structure of its English equivalent;

- morphologically ambiguous words: one of the frequent wrong lemmatization is of the word copii, ambiguous between the indefinite plural of the masculine noun copil (“child”) and the indefinite plural of the feminine noun copie (“copy”); its most frequent occurrence is with the first value, but the latter also occurs and the TTL tool is not able to correctly distinguish between them;

- Tagging errors: they affect either the part of speech or the morphological classes:

- POS errors: they occur in the case of homonyms of different parts of speech: nouns - adjectives (prezent “present”), nouns - adverbs (seara

“the evening”); adjectives - adverbs (e.g.: in the string Un studiu recent a arătat că… “A recent study has shown that…” the word recent is annotated as an adverb, although it is an adjective), verbs in the participles - participial adjectives (e.g.: in the context ... vor trebui upgradate ulterior sau chiar înlocuite “... will have to be eventually updated or even replaced” the words upgradate and înlocuite are annotated as adjectives, but they are verbs);

- morphological classes are misinterpreted as a result of the homonymy in the inflectional paradigm of the respective word: folositoare (“useful”) is an adjective either in the oblique (dative or genitive) case of the singular number or in any of the cases of the plural number; many morphological categories are affected: number, person, case, gender, verbal mood, tense, etc.

Running the script presented above for checking the consistency of the morphologic annotation 565 tokens were found for which inconsistencies are suspected to happen in their annotation. An automatic replacement of the least frequent with the most frequent of the annotation of a token in a certain context cannot be made: consider the tri-gram care este apoi “which is then”: the competing POS-tags are Vmip3s (Verb main indicative present 3rd.person singular) and Vaip3s (Verb auxiliary indicative present 3rd.person singular), which means that the token is ambiguous between a main and an auxiliary verb, but the context is not large enough for deciding upon the correct value.

Thus, a human annotator must check each and every occurrence and decide for each case separately. Such ambiguous examples are not rare in the corpus.

After the manual intervention following this automatic identification of possibly inconsistent annotations, we counted 5279 lines that were modified: 923 lemmas and 4530 POS tags were corrected. There were 174 cases in which both the lemma and the

(17)

9

POS tag were corrected. The interventions on the POS tags meant changing the part of speech in 243 cases and/or changing the values of their attributes in 4287 cases.

The most frequent change of the part of speech is the result of the homonymy between participle and the participial adjectives:

•88 adjectives (A) were changed into participles (V): example: when the trigram asociată sau nu (“associated or not”) occurs in a syntactic position in which asociată is a modifier of a noun, it should be annotated as a participle (V): there is an ellipsis of the auxiliary verb, so the underlying structure would be care este asociată sau nu este asociată (“that is associated or is not associated”);

•28 participles (V), into adjectives (A): example: in the context în stadii avansate (“in advanced stages”) the underlined word is definitely an adjective, as it modifies the noun.

The homonymy between adjectives and nouns also affects 31 tokens, which had to have their part of speech changed from noun (N) into adjective (A): in the string limita superioară a normalului (“the upper limit of normal”) the word normalului (“normal”) is a noun, not an adjective (which would be the most frequent case of this word in the language).

The word a was found 17 times wrongly annotated as a(n auxiliary) verb (V) instead of an (infinitive) particle (Q).

Many abbreviations were annotated as nouns by TTL, although there is a specialized tag for them, namely Y. The initial manual correction of the corpus involved a large number of corrections of such cases. However, after checking the consistency we found 13 abbreviations containing also numbers that we decided to treat as residual (thus, annotating them with the tag X): examples: Ca2, T1DM (type 1 diabetes mellitus), etc.

The other types of modifications involving the part of speech are not numerous, each type having at most 5 occurrences. A few examples are: particles (Q) that should have been annotated as prepositions (S) (4 cases), nouns (N) that should have been annotated as adverbs (R) (3 cases), etc.

5.2. Improved NE annotation

The manual annotation with NEs was done by two annotators, one a physician, the other one a computer scientist, using the IOB2 format (Sang and Veenstra, 1999). The labels attached to the NEs are: ANAT (for anatomy-related entities), CHEM (for chemicals and drugs), DISO (for different types of disorders), and PROC (for all medical procedures).

The annotators’ different backgrounds, the linguistic characteristics of the medical texts (with a high frequency of abbreviations), the lack of terminological resources in Romanian (which could have helped the annotators understand the meaning of various terms) are the main explanations for the errors on the NEs annotation in MoNERo. The task being so complex, discussions were allowed between the annotators during their work. This explains the high agreement between them on a small part of the corpus

(18)

10

(1,628 tokens) that was doubly annotated by them, at the end of the task, that is when both had gathered experience in this work.

The types of errors found are:

•different lengths of the same NE - this happens especially when coordinating conjunctions occur between two NEs: ocluzia/B-DISO arterelor/I-DISO mici/I- DISO și/I-DISO mijlocii/I-DISO (“occlusion of small and medium sized arteries”) is interpreted by the physician as only one NE, while the computer scientist analyzed it as two NEs connected by the conjunction și;

•different labels for the same NE: this is mainly the result of the wrong interpretation of abbreviations: e.g. ACE can be either CHEM or PROC, depending on the context of occurrence and telling them apart is not always trivial for the computer scientist.

The errors found through the extraction of all NEs from the corpus and by comparing their annotations are only a handful, thus showing again the high quality of this manual annotation. Table 3 shows the number of errors found in each NE category. There was a total of 131 annotation errors. Most of them, almost 40%, were confusions between disorders and others categories (50) followed by chemicals tagged wrongly (30).

Table 3: Table of corrections of NEs Corrected Annotations

Initial Annotations

ANAT CHEM DISO PROC O TOTAL

ANAT 0 7 11 2 1 21

CHEM 15 0 12 0 3 30

DISO 7 5 0 36 2 50

PROC 0 4 11 0 2 17

O 0 4 5 4 0 13

6. Conclusions

MoNERo is a valuable lexical resource for Romanian, reflecting the medical domain, with three subdomains: cardiology, diabetes, and endocrinology. Its size, the quality of the annotation with morphological information and NEs types are important assets. It is also made available for free to the community¹, to those interested to use it for different types of analysis of such texts (linguistic or terminological characteristics of such texts, etc.), or for training, improving and testing systems for automatic medical NEs recognition and classification.

A new version of MoNERo, with a new level of annotation, namely syntactic, will be made available soon, as part of the Romanian contribution to the Universal Dependencies initiative².

1 http://www.racai.ro/tools/text/

2 universaldependencies.org

(19)

11 Acknowledgements

This work was supported in part by a grant of the Romanian Ministry of Research and Innovation, PCCDI - UEFISCDI, project number PN-III-P1-1.2-PCCDI-2017-0818/72, within PNCDI III.

References

Barbu Mititelu, V., Tufiș, D., Irimia, E. (2018). The Reference Corpus of the Contemporary Romanian Language (CoRoLa). In Proceedings of LREC 2018, May, Japan, 1178-1185.

Blaheta, D. (2002). Handling noisy training and testing data. In Proceedings of the 7th conference on Empirical Methods in Natural Language Processing, 111–116.

Dickinson, M. and Meurers, D. W. (2003). Detecting errors in part-of-speech annotation. In 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL-03), April, Budapest, Hungary, 107-114.

Dogan, R.I., Leaman, R., Lu, Z. (2014). NCBI disease corpus: a resource for disease name recognition and concept normalization. Journal of biomedical informatics, 47, 1-10.

Eskin, E. (2000). Automatic corpus correction with anomaly detection. In 1st Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Seattle, Washington, 148-153.

Hirakawa, H., Ono, K., Yoshimura, Y. (2000). Automatic Refinement of a POS Tagger Using a Reliable Parser and Plain Text Corpora. In Proceedings of the 18th International Conference on Computational Linguistics (COLING), Saarbrucken, Germany: ICCL.

Ion, R. (2007.) Word Sense Disambiguation Methods Applied to English and Romanian (in Romanian). Ph.D. thesis, Romanian Academy.

Krallinger, M., Rabal, O., Leitner, F., Vazquez, M., Salgado, D., Lu, Z., Leaman, R., Lu, Y., Ji, D., Lowe, D.M. et al. (2015). The CHEMDNER corpus of chemicals and drugs and its annotation principles. Journal of Cheminformatics, 7(1):S2.

Kveton, P. and Oliva, K. (2002). (Semi-)Automatic detection of errors in PoS-tagged corpora. In Proceedings of 19th International Conference on Computational Linguistics (COLING-02).

Loftsson, H. (2009). Correcting a POS-tagged corpus using three complementary methods. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), March, Athens, Greece, 523-531.

Meurers, D. (2010). Detecting Errors in Corpus Annotation. LingLunch, UFR de Linguistique, Université Paris Diderot, Paris 7, March 10, 2010.

Mitrofan, M., Barbu Mititelu, V., Mitrofan, G. (2019). MoNERo: a Biomedical Gold Standard Corpus for the Romanian Language. In Proceedings of the BioNLP workshop, 1 August, Florence, Italy, 71-79.

Mitrofan, M. and Ion, R. (2017). Adapting the TTL Romanian POS tagger to the biomedical domain. In Proceeding of BiomedicalNLP@ RANLP, 8-14.

(20)

12

Moreno, I., Boldrini, E., Moreda, P., Roma-Ferri, M. T. (2017). DrugSemantics: a corpus for named entity recognition in Spanish summaries of product characteristics. Journal of biomedical informatics, 72:8–22.

Muller, F. H. and Ule, T. (2002). Annotating topological fields and chunks – and revising POS tags at the same time. In Proceedings of COLING, 1-7.

Neveol, A., Grouin, C., Leixa, J., Rosset, S., Zweigenbaum, P. (2014). The Quaero French medical corpus: A resource for medical entity recognition and normalization. In Proc BioTextM, Reykjavik, 24-30.

Oliva, K. (2001). The Possibilities of Automatic Detection/Correction of Errors in Tagged Corpora: A Pilot Study on a German Corpus. In V. Matousek, P. Mautner, R. Moucek and K. Tauser (eds.), Text, Speech and Dialogue. 4th International Conference, TSD 2001, September 11-13, Zelezna Ruda, Czech Republic, 39-46.

Oronoz, M., Gojenola, K., Perez, A., de Ilarraza, A.D., Casillas, A. (2015). On the creation of a clinical gold standard corpus in Spanish: Mining adverse drug reactions. Journal of biomedical informatics, 56: 318–332.

Rehbein, I. (2014). POS error detection in automatically annotated corpora Published.

In Levin, Lori/Stede, Manfred (eds.), Proceedings of the 8th Linguistic Annotation Workshop in conjunction with COLING 2014 (LAW-VIII), August 23-24, Dublin, Ireland, 20-28.

Roberts, A., Gaizauskas, R., Hepple, M., Demetriou, G., Guo, Y., Roberts, I., Setzer, A.

(2009). Building a semantically annotated corpus of clinical texts. Journal of biomedical informatics, 42:5, 950-966.

Sang, E. F. and Veenstra, J. (1999). Representing text chunks. In Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics, 173-179.

Uzuner, O., Solti, I., Xia, F., Cadag, E. (2010). Community annotation experiment for ground truth generation for the i2b2 medication challenge. Journal of the American Medical Informatics Association, 17:5, 519-523.

Van Halteren, H. (2000). The detection of inconsistency in manually tagged text. In Proceedings of the COLING2000 Workshop on Linguistically Interpreted Corpora, August, Centre Universitaire, Luxembourg, 48-55.

(21)

THE COBILIRO PROJECT: BUILDING AND DISTRIBUTING A BIMODAL CORPUS FOR ROMANIAN LANGUAGE

DAN CRISTEA^1,2, CRISTIAN PĂDURARIU^1,2, ȘERBAN BOGHIU¹, DANIELA GÎFU^1,2, MIHAELA ONOFREI^1,2, DIANA TRANDABĂȚ¹, IONUȚ CRISTIAN

PISTOL¹, ANCA-DIANA BIBIRI³, ANDREI SCUTELNICU^1,2

1 “Alexandru Ioan Cuza” University of Iași, Faculty of Computer Science

2 Institute of Computer Science, Romanian Academy, Iaşi Branch,

3 “Alexandru Ioan Cuza” University of Iași, Institute for Interdisciplinary Research, Social Sciences and Humanities Research Department

{dcristea, cristian.padurariu, serban.boghiu, daniela.gifu, mihaela.onofrei, dtrandabat, ionut.pistol, andrei.scutelnicu}@info.uaic.ro

[email protected]

Abstract

CoBiLiRo (Corpus Bimodal pentru Limba Română - Bimodal Corpus for Romanian Language) is an on-going research project aimed to collect, standardise and make available a collection of Romanian language files containing both text and audio recordings, aligned at boundaries of sentences, words, phones, and/or other linguistic levels. This paper describes the current efforts carried out as part of this project. We present the design of the format aimed to serve as an annotation standard for bimodal resources, the main operations of the web Platform which hosts the corpus, and the automatic conversion flow that brings the submitted file at the format accepted by the Platform.

Key words — bimodal corpus, annotation standard, web platform, speech and text processing, metadata of linguistic resources.

1. Introduction

CoBiLiRo is a component part of ReTeRom¹, a project aiming to push forward the state of the art in Romanian language technology, grouping researchers from four natural language processing laboratories² that work on speech understanding, speech synthesis, text processing, alignment of speech - text resources and organisation of big repositories of language data for research and public use. CoBiLiRo aims to create a thesaurus with audio and textual resources annotated on different levels of acoustic and linguistic

1www.racai.ro.

2 RACAI – Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy; the Speech Processing Laboratory at the Faculty of Electronics, Telecommunications and Information Technology of Politehnica University of Bucharest; the Speech Processing Lab at the Department of Communications, Faculty of Electronics, Telecommunications and Information Technology of the Technical University of Cluj-Napoca, and the Natural Language Processing Group at the Faculty of Computer Science of the “Alexandru Ioan Cuza” University of Iași.

(22)

SCUTELNICU

14

achievement, which will stand as the most important reference of this type for the Romanian language, addressing future developments of human-machine interfacing technologies. As such, the project makes a careful inventory of existing bimodal resources at partners, finds ways to harmonize the representation, the annotation and the metadata formats, designs and implements an infrastructure that will finally house the resources, and does a wide dissemination of the bimodal corpus, for research valorisation and use in applications.

In ReTeRom, but also sharing other views (Mihăilă and Mekhaldi, 2009), by a bimodal corpus we understand a collection of oral records accompanied by their transcripts and their corresponding metadata. A bimodal corpus is hosted on a specialized platform, together with its web access, maintenance services, processing technology and possible applications.

Stages in the evolution of linguistic corpora included: the first generation, which contained only texts (e.g. British National Corpus³), the second generation, which contained both oral and written texts (e.g. Michigan Corpus of Academic Spoken English⁴), and the third generation, which included also the alignment between written and oral components. Recent stages in the evolution of corpora are characterized by the inclusion of the video dimension (Rasso and Mello, 2014: 29). The object of our interest is the third generation of corpora.

In order to advance Romanian technologies integrated with natural language processing, our goal is to construct bimodal corpora to be used in learning. Part of our bimodal corpora is annotated on different linguistic levels of the bimodal corpus generated within CoBiLiRo based on some conventions, which will then be taken as a model for an automatic process, the proposal of an inventory of unstructured data and specifications for the realization of user interfaces.

The next sections will briefly describe existing approaches of bimodal resources, the CoBiLiRo format, the design and implementation of converters, and the CoBiLiRO web platform.

2. Bimodal resources

Research on automatic speech recognition have grown dramatically since the 1960s (Halle and Stevens, 1962; Denes and Mathews, 1960; Denes, 1960), although the use of oral corpora as a language storage device that should be interpreted or generated by the machine is much more recent. In order to make available these corpora, specialized interfaces have begun to be created. A bimodal corpus should be hosted on a specialized platform, together with its web access, development and maintenance services and applications, where the researcher can find methods and algorithms for corpus use and,

3 http://www.natcorp.ox.ac.uk/

4 https://www.lib.umich.edu/database/link/11887

(23)

15

from where, in some cases, examples of applications of the corpus in training and evaluation of the technology can be downloaded.

Even if the terms speech corpus and oral corpus are sometimes differentiated (Llisterri, 1996), they are often used interchangeable, without a clear distinction between them.

For simplicity reasons, we will not differentiate between these two categories of corpora, mixing what some researchers consider to be specific characteristics, and using the generic term of speech corpus.

A speech corpus, in general, means a database of audio files and their textual transcripts, in a format that can be used to create acoustic models that support both speech recognition and speech synthesis research. An example is the Switchboard transcripts reviewed at the Institute for Signal and Information Processing (Godfrey and Hollman, 1997). Audio files and their transcripts can be aligned at phoneme, syllable, word and sentence levels, sometimes marking also prosody elements. Within speech recognition systems, prosody models are mainly used to predict prose events (syntactic and semantic accents) associated with a text. Research in which the emphasis is on the audible signal, on its acoustic properties, and the articulatory properties of the vocal tract, is making heavy use of speech corpora. The symbolic representation in this case is the phonetic alphabet. The transcript may include marks, in an enriched spelling system, of various auditory phenomena that accompany the pronunciation: murmurs, pauses, coughs, laughter, etc.

On the other hand, a speech corpus can be of particular interest for researchers dedicated to the use of a language and the characteristics of the various linguistic levels:

lexical, morphological, syntactic, semantic, discourse, conversation, for pragmatic studies of communication, in sociolinguistics, dialectology, etc. Ideally, both language researchers and natural language technology researchers should use the same data set, regarding data collection, transcription, coding and annotation.

At the moment of writing this paper, the ReTeRom collection of bimodal corpora contains the following items: the audio files and their transcripts from the CoRoLa⁵ corpus - contributed by ARFI-IIT and RACAI, the Read Speech Corpus (RSC) - contributed by UPB, the Spontaneous Speech Corpus (SSC-train), Spontaneous Speech Corpus (SSC-eval) and Spontaneous Speech Corpus 2 (SSC-eval2) – all contributed by UPB, the SWARA⁶ Speech Corpus – contributed by UTCN, the text corpus Adevărul.ro, Mara⁷ – an Audiobook, Ro-GRID contributed by UTCN, IIT⁸ corpus – contributed by ARFI-IIT and the SoRoEs⁹ corpus – contributed by UAIC. Most of the 11 linguistic resources are bimodal, therefore including both audio files and transcripts, together with their alignments. In total, there have been reported more than 450 hours of recordings.

5 http://corola.racai.ro

6 https://speech.utcluj.ro/swarasc/

7 https://speech.utcluj.ro/corpora/mara.html

8 89.38.230.23la home/corola/corpusIasi/

9 http://soroes.ro/

(24)

SCUTELNICU

16

As will be presented in Section 4, the inventoried corpora expose a large diversity of formats, containing audio files and their text transcriptions, but in some cases also simplified text transcriptions (without punctuation and segmentation), and TextGrid files, which put in evidence alignment of boundary clues between segments of speech and text. To unify this diversity, a standard format has been proposed by the consortium members, a bunch of conversion modules operating on the CoBiLiRo platform to bring any of the input files to the standard one.

3. Design criteria in building the hosting Platform and the adoption of the corpus format

To fulfill CoBiLiRo’s purposes, we designed a platform capable to host linguistic resources of the Romanian language, intended to be used for the development of automatic speech recognition and synthesis systems. Many of these resources, used to train acoustic models, are speech corpora containing recordings of different speakers, paired with their corresponding textual transcripts.

In order to ensure naturalness in automatic speech systems, the recordings to be placed at the base of learning experiments should generally be acquired in spontaneous interactions between speakers, therefore readings in lab conditions and eBooks containing actors’ voices are less recommended. On the other hand, production of linguistic resources of this type is expensive, in terms of time, cost and necessary involvement of experts, which are difficult to find. For these reasons, the interest of the ReTeRom consortium was focused towards sources of real life speech, audio corpora available online and in the media: radio and television shows, recordings of public meetings for certain institutions, and ad-hoc interviews addressed to people on the street intended to evidentiate local accents and dialectal voices. As not all of these resources have intrinsic textual transcriptions, we had to prepare the ground for transcribing parts of the corpus, which is by no means a trivial task. Adding transcriptions can be done manually, by listening to the recordings and simultaneously writing down the related text, or by using already existing automatic speech recognition systems. The apparent vicious circle (of using automatic systems to transcribe naturally produced speech, followed by training recognition systems out of the parallel corpus produced this way) is broken down by involving the use of several architecturally different speech recognition systems, which are supposed to make non-symmetrical errors, keeping as correct identical transcribed spans and manually correcting the regions which display discrepancies.

As such, recently, the CoBiLiRo Platform frontend has been upgraded to accommodate the process of uploading bimodal corpora files (speech plus text) even at different moments of time, as each textual transcript can be decoupled from its speech component. Properly annotated, these two components can be paired later, on the Platform, when both are uploaded there. The alignment is assured through segmentation clues, which can be placed at sentence, word or even letter/phoneme boundaries, as will be explained in the following section. Moreover, the frontend includes functionalities

(25)

17

that allow online editing of the two components in view of creating the speech-text alignment, more precisely the inclusion and synchronisation of boundary markers.

The CoBiLiRo annotation format is inspired by the TEI-P5¹⁰ standard (Sperberg- McQueen and Burnard, 2018), the well known scheme for representing a diversity of document types, but also includes elements from other proposals to best fulfil our goals (Li and Yin, 2007). This standard has been simplified in some aspects and augmented in others to best accommodate the requirements of our bimodal corpora of speech and text data.

The CoBiLiRo format includes a header, which encapsulates metadata related to the resource. This section holds information about: the source of the object stored, the identity of speakers (in conditions of respecting confidentiality terms), the type of voice (spontaneous or voice-in-reading), aspects regarding technical conditions of the recording, its duration, the type of file stored (mp3 or wav), the segmentation level, etc.

The most common level of segmentation is the sentence, but voice can also be segmented in morphological units (words), phonological (phonemes), prosodical (pitch, raise and decrease of the fundamental frequency), or syntactic (nominal group, clause, etc.). These pieces of information are stored in appropriate xml tags and attributes, within the teiHeader tag. Figure 1 shows an example of such a header.

<teiHeader Description="..." Collection="..." Keywords="bimodal, text, speech" Language="ro" Contribuitor="cristian.padurariu"

Distribution="...">

<speechSection SpeechCreator="..." AcousticMedia="wav"

Duration="10:01" SamplingFrequency="10" Resolution="4"

RecordDate="2010-02-01" RecordTime="11:24" Equipment="microphone"

SpeechFile="speech.wav" SpeechSegmentation="Start-stop">

<speaker SpeakerName="..." SpeakerAccent="moldavian" Gender="Male"

Age="20-30" />

</speechSection>

<dataSection MetadataCreator="a" AnnotationCreator="a"

AnnotationLevel="SentenceAlign" />

</teiHeader>

Figure 1: Extract from a teiHeader tag

The segmentation and alignment of the resource is available in unit tags, which can be of three types, depending on the manner the resource is represented. The first type (“file”) is for resources that are stored in multiple files. So, for this case, each unit tag will have a child node called speech, which indicates the name of the file containing the speech component and a child node called text containing the textual transcription associated with the audio file specified.

10http://www.tei-c.org/

(26)

SCUTELNICU

18

The second type of segmentation is called “start-stop” and is used for those resources that present just one speech file, segmented and aligned at temporal boundaries, the text being reproduced between each two such consecutive markers. So, the unit tag will contain a speech subelement with two attributes start and stop (in seconds). Along with the speech tag, the text tag contains a reproduction of the text being spoken. Other components can be added in each unit group under specific tags (see an example in the next section).

The third type of segmentation is “file-start-stop”, which is a combination of the two types presented above. It is meant to accommodate those resources that contain multiple audio files and a “start-stop” segmentation for each of them. So, for each unit tag pointing to a speech file, a series of child nodes called subunits are also created. Each subunit will hold the “start-stop” segmentation, similar to the one described above. An example of such an annotation is shown in figure 2.

<unit>

</subunit>

...

</subunit>

</unit>

...

<unit>

</subunit>

…

</subunit>

</unit>

Figure 2: Extract of a file-start-stop type of file

4. Designing and implementing convertors

In the bunch of resources contributed by different partners of the ReTeRom project we have identified three specific types of formats, according to which we have designed the first set of convertors, supposed to “understand” the corresponding files and on which they act accordingly to transform them to the CoBiLiRo standard.

(27)

19

The first format is composed of groups of four files: a wav file - containing the audio recording, a txt file - containing the text associated with the recording speech, a lab file - containing the same text as the txt, but from which the punctuation has been eliminated and all letters are reduced to lowercase, and a phs file - containing a list of all letters present in the recording along with their start-stop moments. The conversion of this format to the CoBiLiRo standard starts with the creation of the header containing the metadata. Part of the information that fills in the header should be provided by the contributor through the form imposed by the interface. This type of resource is converted to the file-start-stop standard representation described above. All files belonging to the same group will have a unique name. After grouping the files, four subunits are created: speech, text, lab and phs. Their contents are extracted from, respectively: a wav file - containing the recorded segment of voice; a txt file - containing the textual transcription on the segment; a lab file - containing the same text but in only lower case letters and without punctuation; a phs file - containing the sequence of letters in the segment, each paired with time marks showing its start and end as it is pronounced in the wav file. In figure 3 you can see an example of such a file group.

PHS

0 3700000 pau pau

3700000 4400000 p purt\304\203torul 4400000 4900000 u

4900000 5200000 r 5200000 5600000 t 5600000 6200000 @ 6200000 6700000 t 6700000 7200000 o 7200000 7500000 r 7500000 8100000 u 8100000 8400000 l 8400000 8800000 d de 8800000 9200000 e

LAB

purtătorul de cuvânt al biroului electoral central marian muhuleț a adăugat că au mai rămas de centralizat procese verbale cu voturile exprimate în județele bacău bihor constanța olt vrancea și tulcea

TXT

Purtătorul de cuvânt al Biroului Electoral Central, Marian Muhuleț, a adăugat că au mai rămas de centralizat procese verbale cu voturile exprimate în județele Bacău, Bihor, Constanța, Olt, Vrancea și Tulcea.

Figure 3: First type of aligned corpora

The second format is called MULTEXT/TEI and is composed of some audio files and an xml file containing metadata (not relevant to the scope of our platform) and a series of div tags mapping text to the audio files. The first step of the conversion, as in the previous case, is the creation of the CoBiLiRo header and it is done in the same manner as for the first format. Considering that there are multiple audio files of this type, the

“file” representation is used. The next step is to identify the div tags that contain the mapping of the text to the audio files from the original xml file. Then the texts in-

(28)

SCUTELNICU

20

between consecutive div tags are extracted and inserted into text tags belonging to different units. A div tag also contains an url tag, where the name of the audio file associated with the corresponding text can be found. This information is inserted into the speech tag of the output format belonging to the appropriate unit element. As such, the expected pairs of xml elements <speech/> - <text/> are formed. A fragment from such a file, including the corresponding speech resource, is given in figure 4.

<head>*BLOCK: O1</head>

<p id="sro.2.2"><s id="sro.2.2.1">Am o problemă cu aparatul de dedurizare a apei.</s>

...</p>

<ab>[<xref url="../spc/spch01-ro.wav">speech file</xref>]</ab></div>

Figure 4: A fragment of a MULTEXT type alignment

The third format discussed here is called TEXTGRID and it contains groups of three files. The first type of file is an audio recording that contains the speech part of the resource. The textgrid file contains tuples of values (letter, xmin, xmax) referring to letters extracted from the text and the time interval between which each letter was spoken. The third file (txt) contains information about the energy of the enunciation of each letter, expressed in decibels and the speech frequency. After copying the header information, unit tags for each of the audio files are created, with the attribute speechFile containing the name of the audio file. Next, a series of child nodes, each containing a sub-element called speech and receiving the attributes start and stop are created. These attributes’ values will represent the xmin and xmax values from the tuples present in the textgrid file. The letter values from the tuples will be placed in each subunit under the tag text.

5. The CoBiLiRO web platform

In order to provide a unified working space where all users can upload, store and find resources, we have created a web platform which facilitates collaboration. The platform is available for all CoBiLiRo users that have an account and a password¹¹. It integrates the roles of Admin, Contributor and Trustee.

The Admin controls the list of users and their credentials and can get information, through logs, on the flow of data on the Platform. This person also manages the creation of accounts from requests addressed by unregistered users.

A Contributor may upload its own resources. A user can gain this quality when she/he makes the first request to upload a resource to the Platform¹². After a resource has been uploaded, the platform processes the content and, provided its format is compatible with one it knows already (as explained in the previous section), it creates one or more xml

11 http://85.122.23.18:81/

12For security reasons, this status can presently be trusted only to members of the ReTeRom consortium.