Algorithms and Data Design Issues for Basic NLP Tools

(1)

IOS Press, 2009

Algorithms and Data Design Issues for Basic NLP Tools

Dan TUFIŞ

Research Institute for Artificial Intelligence of the Romanian Academy

Abstract. This chapter presents some of the basic language engineering pre- processing steps (tokenization, part-of-speech tagging, lemmatization, and sentence and word alignment). Tagging is among the most important processing steps and its accuracy significantly influences any further processing. Therefore, tagset design, validation and correction of training data and the various techniques for improving the tagging quality are discussed in detail. Since sentence and word alignment are prerequisite operations for exploiting parallel corpora for a multitude of purposes such as machine translation, bilingual lexicography, import annotation etc., these issues are also explored in detail.

Keywords. BLARK, training data, tokenization, tagging, lemmatization, aligning

Introduction

The global growth of internet use among various categories of users populated the cyberspace with multilingual data which the current technology is not quite prepared to deal with. Although it is relatively easy to select, for whatever processing purposes, only documents written in specific languages, this is by no means the modern approach to the multilingual nature of the ever more widespread e-content. On the contrary, there have been several international initiatives such as [1], [2], [3], [4] among many others, all over the world, towards an integrative vision, aiming at giving all language communities the opportunity to use their native language over electronic communication media. For the last two decades or so, multilingual research has been the prevalent preoccupation for all major actors in the multilingual and multicultural knowledge community. One of the fundamental principles of software engineering design, separating the data from the processes, has been broadly adhered to in language technology research and development, as a result of which numerous language processing techniques are, to a large extent, applicable to a large class of languages.

The success of data-driven and machine learning approaches to language modeling and processing as well as the availability of unprecedented volumes of data for more and more languages gave an impetus to multilingual research. It has been soon noticed that, for a number of useful applications for a new language, raw data was sufficient, but the quality of the results was significantly lower than for languages with longer NLP research history and better language resources. While it was clear from the very beginning that the quality and quantity of language specific resources were of crucial importance, with the launching of international multilingual projects, the issues of interchange and interoperability became research problems in themselves. Standards and recommendations for the development of language resources and associated processing tools have been published. These best practice recommendations (e.g. Text Encoding Initiative (http://www.tei-c.org/index.xml), or some more restricted specifications, such as XML Corpus Encoding Standard (http://www.xml-ces.org/), Lexical Markup Framework (http://www.lexicalmarkupframework.org/) etc.) are

(2)

language independent, abstracting away from the specifics, but offering means to make explicit any language-specific idiosyncrasy of interest.

It is worth mentioning that the standardization movement is not new in the Language Technology community, but only in recent years the recommendations produced by various expert bodies took into account a truly global view, trying to accommodate most of (ideally, all) natural languages and as many varieties of language data as possible. Each new language covered can in principle introduce previously overlooked phenomena, requiring revisions, extensions or even reformulations of the standards.

While there is an undisputed agreement about the role of language resources and the necessity to develop them according to international best practices in order to be able to reuse a wealth of publicly available methodologies and linguistic software, there is much less agreement on what would be the basic set of language resources and associated tools that is “necessary to do any pre-competitive research and education at all.” [5]. A minimal set of such tools, known as BLARK (Basic LAnguage Resource Kit), has been investigated for several languages including Dutch [6], Swedish [7], Arabic [8], Welsh (and other Celtic languages) [9].

Although the BLARK concept does not make any commitment with respect to the symbolic-statistical processing dichotomy, in this paper, when not specified otherwise, we will assume a corpus-based (data-driven) development approach towards rapid prototyping of essential processing requirements for a new language.

In this chapter we will discuss the use of the following components of BLARK for a new language:

• (for monolingual processing) tokenization, morpho-lexical tagging and lemmatization; we will dwell on designing tagsets and building and cleaning up the training data required by machine learning algorithms;

• (for multilingual processing) sentence alignment and word alignment of a parallel corpus.

1. Tokenization

The first task in processing written natural language texts is breaking the texts into processing units called tokens. The program that performs this task is called segmenter or tokenizer. Tokenization can be done at various granularity levels: a text can be split into paragraphs, sentences, words, syllables or morphemes and there are already various tools available for the job. A sentence tokenizer must be able to recognize sentence boundaries, words, dates, numbers and various fixed phrases, to split clitics or contractions etc. The complexity of this task varies among the different language families. For instance in Asian languages, where there is no explicit word delimiter (such as the white space in the Indo-European languages), automatically solving this problem has been and continues to be the focus of considerable research efforts.

According to [10], for Chinese “sentence tokenization is still an unsolved problem”.

For most of the languages using the space as a word delimiter, the tokenization process was wrongly considered, for a long time, a very simple task. Even if in these languages a string of characters delimited by spaces and/or punctuation marks is most of the time a proper lexical item, this is not always true. The examples at hand come from the agglutinative languages or languages with a frequent and productive compounding morphology (consider the most-cited Lebensversicherungsgesellschaftsangestellter, the German compound which stands for “life insurance company employee”). The non- agglutinative languages with a limited compounding morphology frequently rely on analytical means (multiword expressions) to construct a lexical item. For translation purposes considering multiword expressions as single lexical units is a frequent

(3)

processing option because of the differences that might appear in cross-lingual realization of common concepts. One language might use concatenation (with or without a hyphen at the joint point), agglutination, derivational constructions or a simple word. Another language might use a multiword expression (with compositional or non-compositional meaning). For instance the English in spite of, machine gun, chestnut tree, take off etc. or the Romanian de la (from), gaura cheii (keyhole), sta în picioare (to stand), (a)-şi aminti (to remember), etc. could be arguably considered as single meaningful lexical units even if one is not concerned with translation. Moreover, cliticized word forms such as the Italian damelo or the Romanian dă-mi-le (both meaning “give them to me”), need to be recognized and treated as multiple lexical tokens (in the examples, the lexical items have distinct syntactic functions: predicate (da/dă), indirect object (me/mi) and direct object (lo/le).

The simplest method for multiword expression (MWE) recognition during text segmentation is based on (monolingual) lists of most frequent compound expressions (collocations, compound nouns, phrasal verbs, idioms, etc) and some regular expression patterns for dealing with multiple instantiations of similar constructions (numbers, dates, abbreviations, etc). This linguistic knowledge (which could be compiled as a finite state transducer) is referred to as tokenizer’s MWE resources. In this approach the tokenizer would check if the input text contains string sequences that match any of the stored patterns and, in such a case, the matching input sequences are replaced as prescribed by the tokenizer’s resources. The main criticism of this simple text segmentation method is that the tokenizer’s resources are never exhaustive.

Against this drawback one can use special programs for automatic updating of the tokenizer’s resources using collocation extractors. A statistical collocation extraction program is based on the insight that words that appear together more often than would be expected under an independence assumption and conform to some prescribed syntactic patterns are likely to be collocations. For checking the independence assumption, one can use various statistical tests such as mutual information, DICE, log- likelihood, chi-square or left-Fisher exact test (see, for instance, http://www.d.umn.edu/~tpederse/code.html). As these tests are considering only pairs of tokens, in order to identify collocations longer than two words, bigram analysis must be recursively applied until no new collocations are discovered. The final list of extracted collocations must be filtered out as it might include many spurious associations.

For our research we initially used Philippe di Cristo’s multilingual segmenter MtSeg (http://www.lpl.univ-aix.fr/projects/multext/MtSeg/) built in the MULTEXT project. The segmenter comes with tokenization resources for many Western European languages, further enhanced, as a result of the MULTEXT-EAST project, with corresponding resources for Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene. MtSeg is a regular expression interpreter whose performance depends on the coverage of available tokenization resources. Its main advantage is that for tokens with the same cross-lingual interpretation (numbers dates, clitics, compounds, abbreviations etc) the same label will be assigned, irrespective of the language. We re-implemented MtSeg in an integrated tokenization-tagging and lemmatization web service called TTL [11], available at http://nlp.racai.ro, for processing Romanian and English texts.

For updating the multiword expressions resource file of the tokenizer, we developed a statistical collocation extractor [12] which is not constrained by token adjacency and thus can detect token combinations which are not contiguous. The criteria for considering a pair of tokens as a possible interesting combination are:

• stability of the distance between the two lexical tokens within texts (estimated by a low standard deviation of these distances)

• statistical significance of co-occurrence for the two tokens (estimated by a log-likelihood test).

(4)

The set of automatically extracted collocations are hand-validated and added to the multiword expressions resource file of the tokenizer.

2.Morpho-lexical Disambiguation

Morpho-lexical ambiguity resolution is a key task in natural language processing [13].

It can be regarded as a classification problem: an ambiguous lexical item is one that in different contexts can be classified differently and given a specified context the disambiguation/classification engine decides on the appropriate class.

Any classification process requires a set of distinguishing features of the objects to be classified, based on which a classifier could make informed decisions. If the values of these features are known, then the classification process is simply an assignment problem. However, when one or more values of the classification criteria are unknown, the classifier has to resort to other information sources or to make guesses. In a well- defined classification problem each relevant feature of an entity subject to classification (here, lexical tokens) has a limited range of values.

The decisions such as what is a lexical token, what are the relevant features and values in describing the tokens of a given language, and so on, depend on the circumstances of an instance of linguistic modeling (what the modeling is meant for, available resources, level of knowledge and many others). Modeling language is not a straightforward process and any choices made are a corollary of a particular view of the language. Under different circumstances, the same language will be more often than not modeled differently. Therefore, when speaking of a natural language from a theoretical-linguistics or computational point of view, one has to bear in mind this distinction between language and its modeling. Obviously this is the case here, but for the sake of brevity we will use the term language even when an accurate reference would be (X’s) model of the language.

The features that are used for the classification task are encoded in tags. We should observe that not all lexical features are equally good predictors for the correct contextual morpho-lexical classification of the words. It is part of the corpus linguistics lore that in order to get high accuracy level in statistical part-of-speech disambiguation, one needs small tagsets and reasonably large training data.

Earlier, we mentioned several initiatives towards the standardization of morpho- lexical descriptions. They refer to a neutral, context independent and maximally informative description of the available lexical data. Such descriptions in the context of the Multext-East specifications are represented by what has been called lexical tags.

Lexical tagsets are large, ranging from several hundreds to several thousands of tags.

Depending on specific applications, one can define subsets of tagsets, retaining in these reduced tagsets only features and values of interest for intended applications. Yet, given that the statistical part of speech (POS) tagging is a distributional method, it is very important that the features and values preserved in a tagset be sensitive to the context and to the distributional analysis methods. Such reduced tagsets are usually called corpus tagsets.

The effect of tagset size on tagger performance has been discussed in [14] and several papers in [13] (the reference tagging monograph). If the underlying language model uses only a few linguistic features and each of them has a small number of attributes, than the cardinality of the necessary tagset will be small. In contrast, if a language model uses a large number of linguistic features and they are described in terms of a larger set of attributes, the necessary tagset will be necessarily larger than in the previous case. POS-tagging with a large tagset is harder because the granularity of the language model is finer-grain. Harder here means slower, usually less accurate and requiring more computational resources. However, as we will show, the main reason for errors in tagging is not the number of feature-values used in the tagset but the adequacy of selected features and of their respective values. We will argue that a

(5)

carefully designed tagset can assure an acceptable accuracy even with a simple-minded tagging engine, while a badly designed tagset could hamper the performance of any tagging program.

It is generally believed that the state of the art in POS tagging still leaves room for significant improvements as far as correctness is concerned. In statistics-based tagging, besides the adequacy of the tagset, there is another crucial factor¹, the quantity and quality of the training data (evidence to be generalized into a language model). A training corpus of anywhere from 100,000 up to over a million words is typically considered adequate. Although some taggers are advertised as being able to learn a language model from raw texts and a word-form lexicon, they require post-validation of the output and a bootstrapping procedure that would take several iterations to bring the tagger’s error rate to an acceptable level.

Most of the work in POS-tagging relies on the availability of high-quality training data and concentrates on the engineering issues to improve the performance of learners and taggers [13-25]. Building a high-quality training corpus is a huge enterprise because it is typically hand-made and therefore extremely expensive and slow to produce. A frequent claim justifying poor performance or incomplete evaluation for POS taggers is the dearth of training data. In spite of this, it is surprising how little effort has been made towards automating the tedious and very expensive hand- annotation procedures underlying the construction or extension of a training corpus.

The utility of a training corpus is a function not only of its correctness, but also of its size and diversity. Splitting a large training corpus into register-specific components can be an effective strategy towards building a highly accurate combined language model, as we will show in Section 2.5.

2.1.Tagsets encoding

For computational reasons, it is useful to adopt an encoding convention for both lexical and corpus tagsets. We briefly present the encoding conventions used in the Multext- East lexical specifications (for a detailed presentation, the interested reader should consult the documentation available at http://nl.ijs.si/ME/V3/msd/).

The morpho-lexical descriptions, referred to as MSDs, are provided as strings, using a linear encoding. In this notation, the position in a string of characters corresponds to an attribute, and specific characters in each position indicate the value for the corresponding attribute. That is, the positions in a string of characters are numbered 0, 1, 2, etc., and are used in the following way (see Table 1):

• the character at position 0 encodes part-of-speech;

• each character at position 1, 2,...,n, encodes the value of one attribute (person, gender, number, etc.), using the one-character code;

• if an attribute does not apply, the corresponding position in the string contains the special marker ‘-’ (hyphen).

Table 1. The Multilingual Multext-East Description Table for the Verb

1 We don’t discuss here the training and the tagging engines, which are language-independent and obviously play a fundamental role in the process.

(6)

The “does not apply” marker (‘-’) in the MSD encoding must be explained. Besides the basic meaning that the attribute is not valid for the language in question, it also indicates that a certain combination of other morpho-lexical attributes makes the current one irrelevant. For instance, non-finite verbal forms are not specified for Person.

The EAGLES recommendations (http://www.ilc.cnr.it/EAGLES96/morphsyn/

morphsyn.html) provide another special attribute value, the dot (“.”), for cases where an attribute can take any value in its domain. The ‘any’ value is especially relevant in situations where word-forms are underspecified for certain attributes but can be recovered from the immediate context (by grammatical rules such as agreement). By convention, trailing hyphens are not included in the MSDs. Such specifications provide a simple and relatively compact encoding, and are in intention similar to feature- structure encoding used in unification-based grammar formalisms.

As can be seen from Table 1, the MSD Vmmp2s, will be unambiguously interpreted as a Verb+Main+Imperative+Present+Second Person+Singular for any language.

In many languages, especially those with a productive inflectional morphology, the word-form is strongly marked for various feature-values, so one may take advantage of this observation in designing the reduced corpus tagset. We will call the tags in a reduced corpus tagset c-tags. For instance, in Romanian, the suffix of a finite verb together with the information on person, almost always determine all the other feature values relevant for describing an occurrence of a main verb form. When this dependency is taken into account, almost all of the large number of Romanian verbal MSDs will be filtered out, leaving us with just three MSDs: Vm--1, Vm--2 and Vm—3, each of them subsuming several MSDs, as in the example below:

Vm--2 ⇒ {Vmii2s----y Vmip2p Vmip2s Vmsp2s----y Vmip2p----y Vmm-2p Vmm-2s Vmil2p----y Vmis2s----y Vmis2p Vmis2s Vmm-2p----y Vmii2p----y Vmip2s----y Vmsp2p----y Vmii2p Vmii2s Vmil2s----y Vmis2p----y Vmil2p Vmil2s Vmm-2s----y Vmsp2p Vmsp2s}

The set of MSDs subsumed by a c-tag is called its MSD-coverage denoted by msd_cov(c-tag). Similar correspondences can be defined for any c-tag in the reduced corpus tagset. The set of these correspondences defines the mapping M

Position Attribute Value Code

0 POS verb V

1 Type

l.s.

main auxiliary modal copula base

m a o c b

2 Vform

l.s.

indicative subjunctive imperative conditional infinitive participle gerund supine transgress quotative

i s m c n p g u t q 3 Tense

l.s.

present imperfect future past pluperfect aorist

p i f s l a 4 Person first

second third

1 2 3

5 Number

l.s.

singular plural dual

s p d

Position Attribute Value Code 6 Gender masculine

feminine neuter

m f n 7 Voice active

passive

a p 8 Negative no

yes

n y 9 Definite

l.s.

no yes short_art ful_art 1s2s

n y s f 2 10 Clitic no

yes

n y 11 Case nominative

genitive dative accusative locative instrumental illative inessive elative translative abessive

n g d a l i x 2 e 4 5 12 Animate no

yes

n y 13 Clitic_s no

yes

n y

(7)

between a corpus tagset and a lexical tagset. For reasons that will be discussed in the next section, a proper mapping between a lexical tagset and a corpus tagset should have the following properties:

• the set of MSD-coverages for all c-tags represents a partition of MSD tagset

• for any MSD in the lexical tagset there exists a unique c-tag in the corpus tagset.

By definition, for any MSD there exists a unique c-tag that observes the properties above and for any c-tag there exists a unique MSD-coverage. The mapping M represents the essence of our tiered-tagging methodology.

As we will show, given a lexical tagset one could automatically build a corpus tagset and a mapping M between the two tagsets. If a training corpus is available and disambiguated in terms of lexical tags, the tiered tagging design methodology may generate various corpus tagsets, optimized according to different criteria. The discussion that follows concentrates on Romanian but similar issues arise and must be resolved when dealing with other languages.

2.2.The Lexical Tagset Design: A Case Study on Romanian

An EAGLES-compliant MSD word-form lexicon was built within the MULTEXT- EAST joint project within the Copernicus Program. A lexicon entry has the following structure:

word-form <TAB> lemma <TAB> MSD

where word-form represents an inflected form of the lemma, characterized by a combination of feature values encoded by MSD code. According to this representation, a word-form may appear in several entries, but with different MSDs or different lemmas. The set of MSDs with which a word-form occurs in the lexicon represents its ambiguity class. As an ambiguity class is common to many word-forms, another way of saying that the ambiguity class of word wk is Am, is to say that (from the ambiguity resolution point of view) the word wk belongs to the ambiguity class Am.

When the word-form is identical to the lemma, then an equal sign is written in the lemma field of the entry (‘=‘). The attributes and most of the values of the attributes were chosen considering only word-level encoding. As a result, values involving compounding, such as compound tenses, though familiar from grammar textbooks, were not chosen for the MULTEXT-EAST encoding.

The initial specifications of the Romanian lexical tagset [26] took into account all the morpho-lexical features used by the traditional lexicography. However, during the development phase, we decided to exploit some regular syncretic features (gender and case) which eliminated a lot of representation redundancy and proved to be highly beneficial for the statistics-based tagging. We decided to use two special cases (direct and oblique) to deal with the nominative-accusative and genitive-dative syncretism, and to eliminate neuter gender from the lexicon encoding. Another feature which we discarded was animacy which is required for the vocative case. However, as vocative case has a distinctive inflectional suffix (also, in normative writing, an exclamation point is required after a vocative), and given that metaphoric vocatives are very frequent (not only in poetic or literary texts), we found the animacy feature a source of statistical noise (there are no distributional differences between animate and inanimate noun phrases) and, therefore, we ignored it.

With redundancy eliminated, the word-form lexicon size decreased more than fourfold. Similarly the size of the lexical tagset decreased by more than a half. While any shallow parser can usually make the finer-grained case distinction and needs no further comment, eliminating neuter gender from the lexicon encoding requires explanation. Romanian grammar books traditionally distinguish three genders:

(8)

masculine, feminine and neuter. However there are few reasons – if any – to retain the neuter gender and not use a simpler dual gender system. From the inflectional point of view, neuter nouns/adjectives behave in singular as masculine nouns/adjectives and in plural as feminine ones. Since there is no intrinsic semantic feature specific to neuter nouns (inanimacy is by no means specific to neuter nouns; plenty of feminine and masculine nouns denote inanimate things) preserving the three-valued gender distinction creates more problems than it solves. At the lookup level, considering only gender, any adjective would be two-way ambiguous (masculine/neuter in singular and feminine/neuter in plural). However, it is worth mentioning that if needed, the neuter nouns or adjectives can be easily identified: those nouns/adjectives that are tagged with masculine gender in singular and with feminine gender in plural are what the traditional Romanian linguistics calls neuter nouns/adjectives. This position has recently found adherents among theoretical linguists as well. For instance, in [27] neuter nouns are considered to be underspecified for gender in their lexical entries, having default rules assigning masculine gender for occurrences in singular and feminine gender for occurrences in plural.

For the description of the current Romanian word-form lexicon (more then one million word-forms, distributed among 869 ambiguity classes) the lexical tagset uses 614 MSD codes. This tagset is still too large because it requires very large training corpora for overcoming data sparseness. The need to overcome data sparseness stems from the necessity to ensure that all the relevant sequences of tags are seen a reasonable number of times, thus allowing the learning algorithms to estimate (as reliably as possible) word distributions and build robust language models. Fallback solutions for dealing with unseen events are approximations that significantly weaken the robustness of a language model and affect prediction accuracy. For instance in a trigram-based language model, an upper limit of the search space for the language model would be proportional to N³ with N denoting the cardinality of the tagset. Manually annotating a corpus containing (at least several occurrences of) all the legal trigrams using a tagset larger than a few hundreds of tags is practically impossible.

In order to cope with the inherent problems raised by large tagsets one possible solution is to apply a tiered tagging methodology.

2.3.Corpus Tagset Design and Tiered Tagging

Tiered tagging (TT) is a very effective technique [28] which allows accurate morpho- lexical tagging with large lexicon tagsets and requires reasonable amounts of training data. The basic idea is using a hidden tagset, for which training data is sufficient, for tagging proper and including a post-processing phase for transforming the tags from the hidden tagset into the more informative tags from the lexicon tagset. As a result, for a small price in tagging accuracy (as compared to the direct reduced tagset approach), and with practically no changes to computational resources, it is possible to tag a text with a large tagset by using language models built for reduced tagsets. Consequently, for building high quality language models, training corpora of moderate size would suffice.

In most cases, the word-form and the associated MSD taken together contain redundant information. This means that the word-form and several attribute-value pairs from the corresponding MSD (called the determinant in our approach) uniquely determine the rest of the attribute-value pairs (the dependent). By dropping the dependent attributes, provided this does not reduce the cardinality of ambiguity classes (see [28]), several initial tags are merged into fewer and more general tags. This way the cardinality of the tagset is reduced. As a result, the tagging accuracy improves even with limited training data. Since the attributes and their values depend on the grammar category of the word-forms we will have different determinants and dependents for each part of speech. Attributes such as part of speech (the attribute at position 0 in the MSD encoding) and orth, whose value is the given word form, are included in every

(9)

determinant. Unfortunately, there is no unique solution for finding the rest of the attributes in the determinants of an MSD encoding. One can identify the smallest set of determinant attributes for each part of speech but using the smallest determinant (and implicitly the smallest corpus tagset) does not necessarily ensure the best tagging accuracy.

A corpus tagset (Ctag-set) whose c-tags contain only determinant feature values is called a baseline Ctag-set. Any further elimination of attributes of the baseline Ctag-set will cause information loss. Further reduction of the baseline tagset can be beneficial if information from eliminated attributes can be recovered by post-tagging processing.

The tagset resulting from such further reduction is called a proper Ctag-set.

The abovementioned relation M between the MSD-set and the Ctag-set is encoded in a mapping table that for each MSD specifies the corresponding c-tag and for each c- tag the set of MSDs (its msd-coverage) that are mapped onto it. The post-processor that deterministically replaces a c-tag with one or more MSDs, is essentially a database look-up procedure. The operation can be formally represented as an intersection of the ambiguity class of the word w, referred to as AMB(w), and the msd-coverage of the c- tag assigned to the word w. If the hidden tagset used is a baseline Ctag-set this intersection always results in a single MSD. In other words, full recovery of the information is strictly deterministic. For the general case of a proper Ctag-set, the intersection leaves a few tokens ambiguous between 2 (seldom, 3) MSDs. These tokens are typically the difficult cases for statistical disambiguation.

The core algorithm is based on the property of Ctag-set recoverability described by the equation Eq.(1). We use the following notation: Wi, represents a word, Ti

represents a c-tag assigned to Wi, MSDk represents a tag from the lexical tagset, AMB(Wk) represents the ambiguity class of the word Wk in terms of MSDs (as encoded in the lexicon Lex) and |X| represents the cardinality of the set X.

∀ Ti ∈Ctag-set, msd-coverage (Ti)={MSD1…MSDk}⊂MSD-tagset, ∀W_k∈Lex & AMB(W_k)={MSDk1…MSDkn}⊂MSD-tagset ⇒

Once Ctag-set has been selected, the designer accounts for the few remaining ambiguities after the c-tags are replaced with the corresponding MSDs. In the original implementation of the TT framework, the remaining ambiguities were dealt with by a set of simple hand-written contextual rules. For Romanian, we used 18 regular expression rules. Depending on the specific case of ambiguity, these rules inspect left, right or both contexts within a limited distance of a disambiguating tag or word-form (in our experiment the maximum span is 4). The success rate of this second phase is almost 99%. The rule that takes care of the gender, number and case agreement between a determiner and the element it modifies by solving the residual ambiguity between possessive pronouns and possessive determiners is as follows:

Ps|Ds

{Ds.αβδ:(-1 Ncαβδy)|(-1 Af. αβδy)|(-1 Mo.αβδy)|(-2 Af.αβδn and –1 Ts)|

(-2 Ncαβδn and –1 Ts)|(-2 Np and –1 Ts)|(-2 D..αβδ and –1 Ts) Ps.αβδ: true}

In English, the rule can be glossed as:

Choose the determiner interpretation if any of the conditions a) to g) is true:

a) the previous word is tagged definite common Noun b) the previous word is tagged definite Adjective c) the previous word is tagged definite ordinal Numeral

d) the previous two words are tagged indefinite Adjective and possessive Article e) the previous two words are tagged indefinite Noun and possessive Article

(1) cases 10%

<

for 1

cases 90%

>

for ) 1 AMB(W )

coverage(T -

msd ⁱ ^k





= >

I

(10)

f) the previous two words are tagged proper Noun and possessive Article g) the previous two words are tagged Determiner and possessive Article.

Otherwise, choose the pronoun interpretation.

In the above rule αβδ denotes values for gender, number and case respectively. In Romanian, these values are usually realized using a single affix.

In [29] we discuss our experimentation with TT and its evaluation for Romanian, where the initial lexicon tagset contained over 1,000 tags while the hidden tagset contained only 92 (plus 10 punctuation tags). Even more spectacular results were obtained for Hungarian, a very different language [30], [31], [32]. Hinrichs and Trushkina [33] report very promising results for the use of TT for German.

The hand-written recovery rules for the proper Ctag-set are the single language- dependent component in the tiered-tagging engine. Another inconvenience was related to the words not included in the tagger's lexicon. Although our tagger assigns any unknown word a c-tag, the transformation of this c-tag into an appropriate MSD is impossible, because, as can be seen from equation Eq.(1), this process is based on lexicon look-up. These limitations have been recently eliminated in a new implementation of the tiered tagger, called METT [34]. METT is a tiered tagging system that uses a maximum entropy (ME) approach to automatically induce the mappings between the Ctag-set and the MSD-set. This method requires a training corpus tagged twice: the first time with MSDs and the second time with c-tags. As we mentioned before, transforming an MSD-annotated corpus into its proper Ctag-set variant can be carried out deterministically. Once this precondition is fulfilled, METT learns non-lexicalized probabilistic mappings from Ctag-set to MSD-set. Therefore it is able to assign a contextually adequate MSD to a c-tag labeling an out-of-lexicon word.

2.3.1.Automatic Construction of an Optimal Baseline Ctag-set

Eliminating redundancy from a tagset encoding may dramatically reduce its cardinality without information loss (in the sense that if some information is left out it could be deterministically restored when or if needed). This problem has been previously addressed in [17] but in that approach a greedy algorithm is proposed as the solution. In this section we present a significantly improved algorithm for automatic construction of an optimal Ctag-set, originally proposed in [35], which outperforms our initial tagset designing system and is fully automatic. In the previous approach, the decision which ambiguities are allowed to remain in the Ctag-set relies exclusively on the MSD lexicon and does not take into account the occurrence frequency of the words that might remain ambiguous after the computation described in Eq. (1). In the present algorithm the frequency of words in the corpus is a significant design parameter. More precisely, instead of counting how many words in the dictionary will be partially disambiguated using a hidden tagset we compute a score for the ambiguity classes based on their frequency in the corpus. If further reducing a baseline tagset creates ambiguity in the recovery process for a number of ambiguity classes and these classes correspond to very rare words, then the reduction should be considered practically harmless even without recovering rules.

The best strategy in using the algorithm is to first build an optimal baseline Ctag- set, with the designer determining the criteria for optimality. From the baseline tagset, a corpus linguist may further reduce the tagsets taking into account the distributional properties of the language in question. As any further reduction of the baseline tagsets leads to information loss, adequate recovering rules should be designed for ensuring the final tagging in terms of lexicon encoding.

For our experiments we used the 1984 Multext-East parallel corpus and the associated word-forms lexicons [36]. These resources were produced in the Multext- East and Concede European projects. The tagset design algorithm takes as input a word-form lexicon and a corpus encoded according to XCES-specifications used by the Multext-East consortium.

(11)

Since for the generating the baseline Ctag-sets, no expert language knowledge is required, we ran the algorithm with the ambiguity threshold set to 0 (see below) and generated the baseline Ctag-sets for English and five East-European languages – Czech, Estonian, Hungarian, Romanian and Slovene. In order to find the best baseline tagset (the one ensuring the best tagging results), each generated tagset is used for building a language model and tagging unseen data (see the next section for details). We used a ten-fold validation procedure (using for training 9/10 of the corpus and the remaining 1/10 of the corpus for evaluation and averaging the accuracy results).

2.3.2.The Algorithm

The following definitions are used in describing the algorithm:

Ti = A c-tag

SAC(AMBi)=Σw∈AMBi RF(w) ≤ threshold: the frequency score of an ambiguity class AMBi where:

RF(w) is the relative frequency in a training corpus of the word w characterized by the ambiguity class AMBi and threshold is a designer parameter (a null value corresponds to the baseline tagset); we compute these scores only for AMBs characterizing the words whose c-tags might not be fully recoverable by the procedure described in Eq.(1);

fAC(Ti)={(AMBik,SAC(AMBik)|AMBik∩msd-coverage(T_i)≠∅}is the set of pairs of ambiguity classes and their scores so that each AMB contains at least one MSD in msd- coverage(Ti);

pen(Ti,AMBj )= SAC(AMBj) if card |AMBj∩ msd-coverage (T_i)|>1 and 0 otherwise; this is a penalty for a c-tag labeling any words characterized by AMBi which cannot be deterministically converted into an unique MSD. We should note that the same c-tag labeling a word characterized by a different AMBj might be deterministically recoverable to the appropriate MSD.

PEN(Ti) = Σ(pen(T_i,AMBj)|AMBj ∈ f_AC(Ti))

DTR = {APi} = a determinant set of attributes: P is a part of speech; the index i represents the attribute at position i in the MULTEXT-East encoding of P; for instance, AV4 represents the PERSON attribute of the verb. The attributes in DTR are not subject to elimination in the baseline tagset generation. Because the search space of the algorithm is structured according to the determinant attributes for each part of speech, the running time significantly decreases as DTRs become larger.

POS(code)=the part of speech in a MSDor a c-tag code.

The input data for the algorithm is the word-form lexicon (MSD encoded) and the corpus (disambiguated in terms of MSDs). The output is a baseline Ctag-set. The CTAGSET-DESIGN algorithm is a trial and error procedure that generates all possible baseline tagsets and with each of them constructs language models which are used in the tagging of unseen texts. The central part of the algorithm is the procedure CORE, briefly commented in the description below.

procedure CTAGSET-DESIGN (Lex, corpus;Ctag-set) is:

MSD-set = GET-MSD-SET (Lex) AMB = GET-AMB-CLASSES (Lex) DTR = {POS(MSDi)}, i=1..|MSD-set|

MATR = GET-ALL-ATTRIBUTES (MSD-set) T= {} ; a temporary Ctag-set

for each AMBi in AMB

execute COMPUTE-SAC(corpus, AMBi) end for

while DTR ≠ MATR

for each attribute Ai in MATR\ DTR D=DTR ∪ {Ai} ; temporary DTR

T=T ∪ execute CORE ({(AMBi , SAC(AMBi))⁺}) end for

Ak = execute FIND-THE-BEST(T)

(12)

DTR= DTR ∪ {Ak} & T={}

end while

Ctag-set=KEEP-ONLY-ATT-IN-DTR (MSD-set, DTR)

; attribute values not in DTR are converted into ’+’(redundant) in all MSDs & duplicates are removed.

end procedure

procedure FIND-THE-BEST ({(ctagset, DTR)⁺}; Attr) is:

rez = {}

for each ctagset in {(ctagseti, DTRi)⁺}

tmp-corpus = execute MSD2CTAG(corpus, ctagseti) train = 9/10*tmp-corpus & test = tmp-corpus \ train LM = execute BUILD-LANGUAGE-MODEL(train) Prec = execute EVAL (tagger, LM, test) rez = rez ∪ {(|ctagseti|, Preci, DTRi)}

end for

Attr = LAST-ATTRIB-OF-DTRI-WITH-MAX-PRECI-IN(rez) end procedure

procedure CORE ({(AMBi, SAC(AMBi))⁺},DTR;({(Ti, msd-coverage(Ti))⁺}, DTR)) Ti = MSDi i=1..|MSD-set|

msd-coverage(Ti)={MSDi} & AMB(Ti)=fAC(Ti) TH = threshold & Ctag-set={Ti}

{repeat until no attribute can be eliminated for each Ti in Ctag-set

{START: for each attribute Ajk of Ti so that Ajk∉DTR if newTi is obtained from Ti by deleting Ajk

1) if newTi ∉ Ctag-set then

Ctag-set=(Ctag-set\{Ti})∪{newTi} continue from START 2) else if newTi =Tn∈ Ctag-set then

msd-coverage(newTi)= msd-coverage(Tn)∪msd-coverage(Ti) AMB (newTi) = AMB(Tn) ∪ AMB(Ti)

if PEN(newTi) = 0 then

Ctag-set=(Ctag-set\{Tn,Ti})∪{newTi} continue from START else

3) if PEN(newTi) ≤ THR then

mctag=Ti & matrib=Aik & TH=PEN(newTi) continue from START end for}

end for}

{ 4) eliminate matrib from mctag and obtain newTi

for each Tn în Ctag-set so that Tn = newTi

msd-coverage(newTi) = msd-coverage(Tn) ∪ msd-coverage(mctag) AMB (newTi) = AMB(Tn) ∪ AMB(mctag)

Ctag-set=(Ctag-set\{mctag,Tn})∪{newTi} TH=threshold } ; closing 4)

end repeat } end procedure

The procedures BUILD-LANGUAGE-MODEL and EVAL were not presented in detail, as they are standard procedures present in any tagging platform. All the other procedures not shown (COMPUTE-SAC, KEEP-ONLY-ATT-IN-DTR, MSD2TAG, and LAST- ATTRIB-OF-DTRI-WITH-MAX-PRECI-IN) are simple transformation scripts.

The computation of the msd-coverage and AMB sets in step 2) of the procedure CORE can lead to non-determinism in MSD recovery process (i.e. PEN(newTi) ≠ 0).

Step 3) recognizes the potential non-determinism and, if the generated ambiguity is acceptable, stores the dispensable attribute and the current c-tag eliminated in step 4).

In order to derive the optimal Ctag-set one should be able to use a large training corpus (where all the MSDs defined in the lexicon are present) and to run the algorithm on all the possible DTRs. Unfortunately this was not the case for our multilingual data.

The MSDs used in the 1984 corpus represent only a fraction of the MSDs present in the word-form lexicons of each language. Most of the ambiguous words in the corpus occur only with a subset of their ambiguity classes. It is not clear whether some of the morpho-lexical codes would occur in a larger corpus or whether they are theoretically possible interpretations that might not be found in a reasonably large corpus. We made a heuristic assumption that the unseen MSDs of an ambiguity class were rare events, so they were given a happax legomenon status in the computation of the scores SAC(AMBj). Various other heuristics were used to make this algorithm more efficient.

(13)

This was needed because generating of the baseline tagset takes a long time (for Slovene or Czech it required more than 80 hours).

2.3.3.Evaluation results

We performed experiments with six languages represented in the 1984 parallel corpus:

Romanian (RO), Slovene (SI), Hungarian (HU), English (EN), Czech (CZ) and Estonian (ET). For each language we computed three baseline tagsets: the minimal one (smallest-sized DTR), the best performing one (the one which yielded the best precision in tagging) and the Ctag-set with the precision comparable to the MSD tagset.

We considered two scenarios, sc1 and sc2, differing in whether the tagger had to deal with unknown words; in both scenarios, the ambiguity classes were computed from the large word-form lexicons created during the Multext-East project.

In sc1 the tagger lexicon was generated from the training corpus; words that appeared only in the test part of the corpus were unknown to the tagger;

In sc2) the unigram lexicon was computed from the entire corpus AND the word- form lexicon (with the entries not appearing in the corpus been given a lexical probability corresponding to a single occurrence); in this scenario, the tagger faced no unknown words.

The results are summarized in Table 2. In accordance with [37] we agree that “it is not unreasonable to assume that a larger dictionary exists, which can help to obtain a list of possible tags for each word-form in the text data”. Therefore we consider the sc2 to be more relevant than sc1.

Table 2. Optimal baseline tagsets for 6 languages

MSD-set Minimal Ctag- Best prec.

Ctag-set

Ctag-set with prec.

close to MSD Lang.

No. Prec. No. Prec. No. Prec. No. Prec.

RO^SC¹ 615 95.8 56 95.1 174 96.0 81 95.8

RO^sc2 615 97.5 56 96.9 205 97.8 78 97.6

SI^SC¹ 2083 90.3 385 89.7 691 90.9 585 90.4

SI^sc2 2083 92.3 404 91.6 774 93.0 688 92.5

HU^SC¹ 618 94.4 44 94.7 84 95.0 44 94.7

HU^sc2 618 96.6 128 96.6 428 96.7 112 96.6

EN^SC¹ 133 95.5 45 95.5 95 95.8 52 95.6

EN^sc2 133 95.9 45 95.9 61 96.3 45 95.9

CZ^SC¹ 1428 89.0 291 88.9 735 90.2 319 89.2

CZ^sc2 1428 91.8 301 91.0 761 92.5 333 91.8

ET^SC¹ 639 93.0 208 92.8 355 93.5 246 93.1

ET^sc2 639 93.4 111 92.8 467 93.8 276 93.5

The algorithm is implemented in Perl. Brants’ TnT trigram HMM tagger [25] was the model for our tagger included in the TTL platform [11] which was used for the evaluation of the generated baseline tagsets. However, the algorithm is tagger- and method-independent (it can be used in HMM, ME, rule-based and other approaches), given the compatibility of the input/output format. The programs and the baseline tagsets can be freely obtained from https://nlp.racai.ro/resources, on a research free license.

The following observations can be made concerning the results in Table 2:

• the tagging accuracy with the “Best precision Ctag-set” for Romanian was only 0.65% inferior to the tagging precision reported in [29] where the hidden tagset (92 c-tags) was complemented by 18 recovery rules;

• for all languages the “Best precision Ctag-set” (scenario 2) is much smaller than the MSD tagset, it is fully recoverable to the MSD annotation and it always outperforms the MSD tagset; it seems unreasonable to use the MSD- set when significantly smaller tagsets in a tiered tagging approach would ensure the same information content in the final results;

• using the baseline Ctag-sets instead of MSD-sets in language modeling should result in more reliable language models since the data sparseness effect is

(14)

significantly diminished; the small differences in precision shown in Table 2 between tagging with the MSD-set and any baseline Ctag-set should not be misleading: it is very likely that the difference in performance will be much larger on different register texts (with the Ctag-sets always performing better);

• remember that the tagsets produced by the algorithm represent a baseline; to take full advantage of the power of the tiered tagging approach, one should proceed further with the reduction of the baseline tagset towards the hidden tagset. The way our algorithm is implemented suggests that the best approach in designing the hidden tagset is to use as DTRs the attributes retained in the

“Minimal Ctag-set”. The threshold parameter (procedure CORE) which controls the frequency of words that are not fully disambiguated in the tagged text should be empirically determined. To obtain the hidden tagset mentioned in [29] we used a threshold of 0.027.

There are several applications for which knowing just the part of speech of a token (without any other attribute value) is sufficient. For such applications the desired tagset would contain about a dozen tags (most standardized morpho-lexical specifications distinguish 13-14 grammar categories). This situation is opposite to the one we discussed (having very large tagsets). Is the Ctag-set optimality issue relevant for such a shallow tagset? In [29] we described the following experiment: in our reference training corpora all the MSDs were replaced by their corresponding grammar category (position 0 in the Multext-East linear encoding, see Table 2). Thus, the tagset in the training corpora was reduced to 14 tags. We built language models from these new

“training corpora” and used them in tagging a variety of texts. The average tagging accuracy was never higher than 93%. When the same texts were tagged with the language models build from the reference training corpora, annotated with the optimal Ctag-set; and when all the c-tag attributes in the final tagging were removed (that is, the texts were tagged with only 14 tags) the tagging accuracy was never below 99%

(with an average accuracy of 99.35%). So, the answer to the last question is a definite yes!

2.4.Tagset Mapping and Improvement of Statistical Training Data

In this section we address another important issue concerning training data for statistical tagging, namely deriving mapping systems for unrelated tagsets used in existing training corpora (gold standards) for a specific language. There are many reasons one should address this problem, some of which are given below:

• training corpora are extremely valuable resources and, whenever possible, should be reused; however, usually, hand-annotated data is limited both in coverage and in size and therefore, merging various available resources could improve both the coverage and the robustness of the language models derived from the resulting training corpus;

• since gold standards are, in most cases, developed by different groups, with different aims, it is very likely that data annotation schemata or interpretations are not compatible, which creates a serious problem for any data merging initiative;

• for tagging unseen data, the features and their values used in one tagset could be better predictors than those used in another tagset;

• tagset mappings might reveal some unsystematic errors still present in the gold standards.

The method discussed in the previous section was designed for minimizing the tagsets by eliminating feature-value redundancy and finding a mapping between the lexical tagset and the corpus tagset, with the latter subsuming the former. In this section, we are instead dealing with completely unrelated tagsets [38]. Although the

(15)

experiments were focused on morpho-lexical (POS) tagging, the method is applicable to other types of tagging as well.

For the experiments reported herein, we used the English component of the 1984 MULTEXT-EAST reference multilingual corpus and a comparable-size subset of the SemCor2.0 corpus (http://www.cs.unt.edu/~rada/downloads.html#semcor).

Let us introduce some definitions which will be used in the discussion that follows:

• A_GS(X) denotes the gold standard corpus A which is tagged in terms of the X tagset and by BGS(Y) the gold standard corpus B which is tagged in terms of the Y tagset.

• The direct tagging (DT) is the usual process of tagging, where a language model learned from a gold standard corpus AGS(X) is used in POS-tagging of a different corpus B: AGS(X) + B BDT(X)

• The biased tagging (BT) is the tagging process of the the same corpus AGS(X) used for language model learning: AGS(X) + A ABT(X). This process is useful for validating hand-annotated data. With a consistently tagged gold standard, the biased tagging is expected to be almost identical to the one in the gold standard [39]. We will use this observation to evaluate the gold standard improvements after applying our method.

• The cross-tagging (CT) is a method that, given two reference corpora, AGS(X) and BGS(Y), each tagged with different tagsets, produces the two corpora tagged with the other one’s tagset, using a mapping system between the two tagsets: AGS(X)+ADT(Y)+BGS(Y)+BDT(X)ACT(Y)+BCT(X).

Cross-tagging is a stochastic process which uses both language models learned from the reference corpora involved. We claim that the cross-tagged versions ACT(Y), BCT(X) will be more accurate than the ones obtained by direct tagging, ADT(Y), BDT(X).

The cross-tagging works with both the gold standard and the direct-tagged versions of the two corpora and involves two main steps: a) building a mapping system between the two tagsets and b) improving the direct-tagged versions using this mapping system.

The overall system architecture is shown in Figure 1.

Figure 1. System Architecture

From the two versions of each corpus <AGS(X),ADT(Y)> and <BGS(Y),BDT(X)>

tagged with the two tagsets (X and Y), we will extract two corpus-specific mappings

<MA(X, Y)> and < MB(X, Y)>. Merging the two corpus-specific mappings there will result in a corpus-neutral, global mapping between the two considered tagsets M(X, Y).

2.4.1.Corpus-Specific Mappings

Let X = {x1, x2, …, xn} and Y = {y1, y2, …, ym} be the two tagsets. For a corpus tagged with both X and Y tagsets, we can build a contingency table (Table 3). For each tag x∈X, we define a subset of Y, Y_x⊆Y, that has the property that for any y_j∈Y_x and for any yk∈Y–Y_x, the probability of x conditioned by yj is significantly higher than the probability of x conditioned by yk. We say that x is preferred by tags in Yx, or conversely, that tags in Yx prefer x.

Mapping System AGS(X)

ADT(Y)

BCT(X) ACT(Y)

BDT(X) BGS(Y)