Summarizing Short Texts Through a Discourse-Centered Approach in a Multilingual Context

(1)

Summarizing Short Texts Through a Discourse-Centered Approach in a Multilingual Context

Daniel Anechitei¹, Dan Cristea^1,2, Ioannidis Dimosthenis³, Eugen Ignat¹, Diman Karagiozov⁴, Svetla Koeva⁵, Mateusz Kopeć⁶, Cristina Vertan⁷

1 “Alexandru Ioan Cuza” University of Iași, Department of Computer Science

2 Institute for Computer Science, Romanian Academy, Iași branch

3 ATLANTIS, Athens

4 Tetracom Interactive Solutions Ltd., Sofia

5 Institute for Bulgarian Language, Bulgarian Academy of Sciences, Sofia

6 Institute of Computer Science, Polish Academy of Sciences, Warsaw

7 Department of Linguistics, University of Hamburg Abstract

The chapter presents the architecture of a system targeting summaries of short texts in six languages. At the core of a summary, which is comprised of clauses and sentences extracted from the original text, is the structure of the discourse and its relationship with its coreferential links. The approach shows a uniform design for all languages, while language specificity is attributed to the resources that fuel the component modules. The design described here includes a number of feedback loops used to fine-tune the parameters by comparing the output of the modules against annotated corpora. “Average” summaries over some human-produced ones are used to evaluate the accuracy of each of the monolingual systems. The study also presents some quantitative data on the corpora used, showing a comparison among languages and results that, mostly, prove to be above the state-of-the-art.

1 INTRODUCTION

The purpose of this chapter is to describe a multilingual summary production line which applies techniques centered on discourse structure. There are a number of features that individuates this approach from other state-of-the-art summarization techniques. The first

(2)

feature relates to the quality of the summaries produced by the system, which places at its core issues of coherence. That is, a summary is a text by itself and, as such, just like the original, it should preserve the qualities of being cohesive and coherent, even if it is made up of elementary pieces extracted from the original text and re-assembled again. The second feature is important from an engineering point of view: we show that a generic architecture can describe summarization systems for more than one language. Part of this architecture includes text processing modules usually used in many NLP applications, but some are built for the explicit purpose of being integrated into the summarization system and are adapted for one language or another. In all cases, the component modules are designed as standardized input- output black-boxes and added to the summarization system. While the concept of language independence is very much desired and discussed in modern NLP, there are few systems that can truly operate as language-independent. Finally, what distinguishes our system from others is that our unique summarization architecture, which can be generically applied to more than one language, opens up some interesting possibilities for forming comparisons among languages. As a result, we have been able to identify some very interesting correlations between quantitative data characterizing the training corpora and the evaluation results obtained from our experiments.

In this chapter, we focus on presenting the individual natural language processing (NLP) modules belonging to the proper summarization chain, the linguistic resources necessary for localization of these modules in the specific languages, along with our evaluation and results.

The chapter is broken into the following sections: Section 2 gives a brief background of the approach, Section 3 presents the anaphora resolution module, Section 4 – the clause segmentation module, Section 5 – the discourse parser, Section 6 – the summarization module, Section 7 – the corpora used and the results, and Section 8 includes a discussion and some concluding remarks.

2 BACKGROUND

In our experiments, the languages under consideration are: Bulgarian, German, Greek, English, Polish and Romanian, but, as mentioned earlier, our approach is general enough not to be limited to these specific languages. We targeted only short texts (less than 6 pages) and produced extract type summaries out of the discourse structure. We describe in this chapter an extract type of summarizer, one in which summaries are made out of sequences of text spans

(3)

(extracts, body of text) that are copied and pasted from the original input. As it will become obvious later on, the elementary pieces of text out of which we assemble summaries are discourse clauses. Our approach to short text summarization follows the one presented in (Cristea et al., 2005), where the summary is generated from a tree-like discourse structure of the original text. The discourse structures obtained resemble the Rhetorical Structure Theory (RST) trees (Mann and Thomson, 1988); namely, because the constituent nodes evidence rhetorical relations between text spans that are either nuclei or satellites, and that the terminal nodes are themselves elementary discourse units (edus). However, for our summarization goal in particular, we ignore the relation names and retain only the nuclearity markings from the discourse structure. The final output of the system consists of general summaries, but it is also possible to produce summaries focused on entities, characters or events. For evaluation, we compare summaries extracted automatically against those indicated by human subjects.

The overall summarization system is truly multilingual in the sense that it first detects the language of the text and subsequently commutes to the specific language processing chain.

Apart from small variations, all language versions have a similar design, as displayed in Figure 1, Figure 2 and Figure 3.

Figure 1: The block architecture of the summarization process

The Prerequisite part is a basic language processing chain (LPC), which includes the steps usually needed in many applications. In Figure 2, this is indicated by placing the modules in a pipeline, although slight variations of this chain could be effective in different languages, depending on the proper realization of the envisioned functionalities.

Figure 2: Details of the summarization Prerequisites

Prerequisite Summarizer summary

text

text SEN TOK POS LEM NP NER xml

AR CS DP SUM SMO summary

(4)

Figure 3: Details of the proper Summarizer chain

The abbreviations in Figures 2 and 3 have the following meaning: SEN = sentence splitter, TOK = tokenizer, POS = part of speech tagger, LEM = lemmatizer, NP = noun phrase chunker, NER = name entity recognizer, AR = anaphora resolver, CS = clause splitter, DP = discourse parser, SUM = summarizer, SMO = smoothing module¹.

The intermediate format between the modules belonging to the Prerequisite chain and the proper Summarization chain is depicted here as xml, although, in order to cope with the standardization requirements of an international project², each module has also been encapsulated into an UIMA CAS³ objects’ interface. The UIMA modules of the resulted version will be referred to as the integrated components in this chapter. With a few exceptions, all modules implement a language independent vision, in which processing reflects transformations applied to the input in order to obtain an enriched output, and a similar type of processing is performed in all languages. To obtain the specific behavior in one language or the other, the modules are fuelled with language specific resources. We will not insist in this chapter on the Prerequisite processors, attentively described elsewhere⁴.

3 ANAPHORA RESOLUTION

3.1 The model and the engine

The resolution of pronominal anaphors is important in a summarization task for at least two reasons: 1) we want the pronouns appearing in a summary to have their antecedents included in the summary; and 2) we want the position of the anaphor and the antecedent to be in correlation with the discourse-tree structure (Fox, 1987; Cristea et al., 1999; Serețan and Cristea, 2002). For these reasons, it is clear that building the discourse structure is a process

1 Not described in this chapter: it does cosmetics to the produced summaries, such as placing majuscules at the beginning of sentences, introducing commas between clauses, gluing the punctuation signs to the previous words, but especially replacing pronouns with co-referent proper nouns when all more informative antecedents happened to be left outside the summary.

2 The ICT-PSP ATLAS project, see Acknowledgements.

3 http://uima.apache.org/

4 http://ec.europa.eu/information_society/apps/projects/logos//7/250467/080/deliverables/001_ATLASD41Langua geProcessingChains1012609.pdf

(5)

that works in tandem with one that discovers antecedents of referential expressions. We describe here an approach in which the discovery of discourse structure comes after the resolution of anaphors and greatly benefits from it.

Anaphora is the phenomenon of reiteration of an entity (called “antecedent”) by a reference (called “anaphor”) that points back to that entity. For practical reasons, we will call referential expressions (RE) both participants in an anaphoric relation. As such, anaphora resolution (AR) is understood as the process of identifying the antecedent of an anaphor. For a proper understanding of a text, it is extremely important that pronouns, common nouns and even proper nouns correctly recuperate their antecedents. Actually, during reading, it is very likely that an anaphor becomes, in its turn, an antecedent for another co-referential anaphor that appears later on in the text.

We anchor our AR mechanism on a cognitive model that describes the reading of a text as a mental process of developing abstract descriptions of the entities mentioned in the text⁵. We will call discourse entity (DE) a semantic representation (placed on a cognitive layer) of a referential expression (residing on the text layer). In Figure 4, the coreference relation between different participants of a coreferential chain is shown as a series of propose-evoke-evoke relations, linking the different textual realizations (i.e. REs), and their unique semantic (cognitive) representation (i.e. a DE).

Figure 4: Two-layer representation of a co-referential anaphoric relation

In our approach the referential expressions are noun phrases (NPs), which include different surface forms of pronouns, common nouns and proper nouns, with their modifiers (except for relative clauses). The notation for an NP should contain an indication for the head noun. NPs could have recursive structures, but in this case the corresponding heads should be distinct.

5 In fact, the development of these mental structures attached to discourse entities is only one of the many processes that are performed during reading: recognition of events and filling their respective roles, interpretation of metaphors, correlation of time mentions, etc.

The text layer ………..………

The cognitive layer ………. DE

a

RE1

RE₁ proposes DE

RE2 REk

RE_k evokes DE RE2 evokes DE

(6)

Examples (NPs in brackets, heads – underlined): <John Smith>, <him>, <<her> hat>, <<the University> building>, <two cats>, <a wonderful brunette in <a blue car>>.

As the reading progresses, a semantic representation is first born when RE1 is encountered.

Then, at a later moment when RE2 is read, it evokes the DE already built by RE1, and any subsequent co-referring REs will in turn evoke the same DE. One way of representing discourse entities in NLP systems is as feature structures, which consist of lists of attribute- value pairs. The exact configuration of these attributes, as well as their types (range of accepted values) are evidenced by the anaphora resolution model. The DE, thus, becomes a repository of features contributed by the different REs it connects and can be either stable during reading or can evolve from knowledge poor to knowledge rich representation⁶. For instance, the coreferential chain the professor… she implies the proposal of an initial DE, configured as [sem={person, professor}] during the reading of the RE the professor, which is then enriched to [sem={person, professor, female}] at the moment of the reading of she. Let’s note that the reference she may include the feature [sem={person,ship}] and thus a partial match helps the resolution process.

The text is processed left-to-right, and a decision is taken each time a new referential expression is met. The motor leaves behind chains of co-referential expressions. Each chain is characterized by a data structure, recording all features of the REs in the chain. This is what we call a discourse entity (DE) – see above. To give an example, suppose John Smith, an ex- professor of computer science, 70 years of age, has been referred to in a text, as: a professor of computer science, John Smith, a 70 year old man, Mr. Smith, he, John, him, he, the old man, the professor, John Smith, and he again. At the end of the text, RARE, ideally, leaves behind a DE which approximately includes the following feature structure: [ID = DE009; SEM = {person, professor of computer science, 70 years old man}; GEN

= male; NAME = {John, Smith}], as well as links to all the corresponding REs on the surface string.

The process runs as follows, while the text unfolds left to right. When a new RE, say REx, is met, its set of morphological, syntactic and semantic features is tested against the recently

6 Sometimes, the representation can change dramatically during the unfolding text, as in the case of coreferential chains of the form: the child… the young lady… the woman. A solution would be to keep more DEs on the cognitive layer, representing distinct instances of the same entity at different moments of time (Cristea and Dima, 2001).

(7)

proposed/evoked DEs which have been left behind by the engine. If, among them, there is one, say DEy, for which the matching score of the pair (REx, DEy) individualizes significantly well, then the actual REx is added to the already existing chain of referential expressions attached to DEy, or else a new DE is generated (proposed – in Figure 4), REx’s features are copied onto it and REx becomes the first referential expression of a new chain.

It is interesting to see that in this model the distinction between anaphora and cataphora is given by the order of distinct surface realizations: proper-noun or common-noun before the pronoun or vice-versa. In fact, cataphora represents only an instance of a whole class of references in which a knowledge rich reference enriches the semantic representation of an existent knowledge-poor DE, by bringing in new features. Another instance of the same class are coreferential sequences such as an animal… the elephant, which can hardly be accepted as coreferential (compare to the sequence an elephant… the animal).

Resolution of anaphora in the multilingual summarization enterprise that we describe here has been performed with RARE (Robust Anaphora Resolution Engine) (Cristea and Postolache, 2005) – a framework for building rule-based anaphora resolution tools⁷. Its collection of symbolic rules uses weights which are optimized with genetic algorithms. The core of the system is language independent, and its localization to one language or another was assured by specific resources (see section 3.2).

3.2 Localization of RARE

The adaptation of the general RARE machinery to different languages was done by localizing a number of resources, the most prominent being the collection of rules incorporating matching conditions between the anaphor (seen as a RE) and the antecedent (seen as a DE). These rules are responsible for deciding whether a referential expression refers (evokes) one discourse entity already mentioned or introduces a new one.

There are three types of rules put to work on a pair (REx, DEy):

- certifying rules: if such a rule is evaluated to TRUE on a pair (REx, DEy), it certifies without ambiguity the DEy as a referent for the REx. For instance, identical proper names usually denote the same person. In the example above, the second RE John Smith is deciphered

7 proprietary of UAIC-FII: http://nlptools.info.uaic.ro

(8)

to refer DE009 with the help of such a rule (which has included already this name among its features);

- demolishing rules: if such a rule is fired on a pair (REx, DEy), it filters out DEy as a referent candidate of REx. RARE includes a cabled demolishing rule, invalidating the attempt to establish a co-referential link between nested referential expressions. In the example above, this rule invalidates a co-referential link between computer science and a professor of computer science;

- promoting rules: if such a rule is evaluated to TRUE on a pair (REx, DEy), it increases a resolution score associated with the pair (REx, DEy). A match of the condition expressed in such a rule adds to the overall resolution score associated with the pair a positive value. If REx

has no certifying rule yet fired with any recorded DE, then that DEy against which it has the best overall score among those with which REx experiences no triggered demolishing rule will be chosen. Supposing the text includes a sentence like John Smith is a 70 year old man., such a rule could yield a coreferential link between the nominal predicate and the subject (more exactly, the DE the subject John Smith is referring to). Supposing this DE is DE009, the rule will add to the DE’s set of features {professor of computer science, John Smith} a new one: 70 year old man.

4 CLAUSE SEGMENTATION

4.1 The model

A clause is a grammatical unit of a sentence that includes, minimally, a predicate and an explicit or implied subject and expresses a proposition (Nguyen et al., 2009), a statement or an event. Clauses could be continuous or interrupted text spans. The identification of clause boundaries is important for a number of NLP applications, such as machine translation, text-to- speech systems, parallel text alignment, and building the discourse structure and automatic summarization. In a rule-based approach, such as (Leffa, 1988), the clauses are reduced to a noun, an adjective or an adverb. Parveen et al. (2011) and Orăsan (2000) describe hybrid methods, in which the results of a machine learning algorithm, trained on an annotated corpus, are processed by a shallow rule-based module intended to improve the accuracy. Pușcașu (2004) transfers the technique that Orăsan describes for English to Romanian sentences with good results. In Șoricuț and Marcu (2003), the discourse segmentation task is formulated as a

(9)

binary classification problem of deciding whether or not to insert a segment boundary after each word in the sentence. Subba and Di Eugenio (2007) use artificial neural networks to segment sentences into clauses, which are then used as edus by a discourse parser. In Hilbert et al. (2006), the list of discourse markers, which indicate possible rhetorical relations, is manually developed.

Many discourse parsing and summarization techniques make use of clauses, such as elementary discourse units of the discourse structure and the building blocks of summaries. Our approach to discourse segmentation starts from the assumption that a clause should be headed by a main verb or a verbal compound. As such, the delimitation of clauses starts from the identification of verbs and verb compounds and then the clause boundaries are looked for inbetween these pivots. Verb compounds are lexical sequences in which one is the main verb and the others are auxiliaries, infinitives or conjunctives that complement the main verb, such that the semantics of the main verb in the current context obliges taking the whole verbal construction together. An example is “like to swim” (Ex. 1): placing a clause boundary between “like” and “swim” would separate the verb from one of its compulsory arguments.

Ex. 1

The exact place of a clause boundary between verbal phrases is, in many cases, indicated by discourse markers (key words or expressions) like in Ex. 2.

Ex. 2

<Markers are good> <because they can give information on the discourse structure.>

Often, a discourse marker signals a rhetorical relation that glues together two text spans.

When they are missing, such as between the 1^st and the 2^nd clause in Ex. 3, boundaries can still be indicated by punctuation marks or other clues which, presumably, may be identified by statistical methods.

Ex. 3

<Although the snow was falling uninterruptedly,> <the slope was still in pretty good condition.>

The clause segmenter is trained on explicit annotations given in manually built files for all the languages under scrutiny. During the training of the segmenter, a window of n POS tags to the left of the candidate marker and m POS tags to the right defines the context. For the cases in

(10)

which we do not have any marker at the boundary between clauses, a symmetrical window of l POS tags is used. The values of the three parameters m, n and l are set at the calibration time.

Discourse markers may have one or several rhetorical functions. We have already mentioned that our discourse trees put in evidence only the nuclearity of arguments, while the name of relations is ignored. As such, to characterize markers, only the features relevant to their nuclearity patterns have to be retained and collected from the annotated corpus. The notations N_N, N_S and S_N represent the nuclearity (N = nucleus, S =satellite) of the two arguments around a marker. For instance, the “and” marker occurs in the English corpus 205 times with the N_N pattern and the “which” marker occurs 35 times with the N_S pattern and in only one case with the N_N pattern. The model is built using the MaxEnt⁸ library.

The Training module generates the markers’ model: for each marker, an attribute (TYPE) may take one of two values (FRONT and BACK), representing the position of the marker relative to the boundaries found in the corpus. If a marker is annotated both ways, the value of the attribute TYPE will be decided based on the highest frequency. For example, if the manually annotated corpus displays more cases of clause boundaries found in front of the ”and” marker than after it, then the value of the attribute TYPE will be FRONT. An example of such a segmentation is given in Ex. 4:

Ex. 4

<Verbs and verb compounds are considered pivots> <and clause boundaries are looked for inbetween them.>

The Segmenter module consists of two steps: First, it applies a machine learning algorithm to recognize if pairs of verbs can be taken as compound verbs and, second, it applies rules and heuristics based on pattern matching and machine learning algorithms to identify clause boundaries.

The training of the segmentation model is targeted at putting in evidence patterns of the markers’ uses upon which segmentation boundaries are decoded. Negative examples, optimally equal in number with the positive examples, are also searched for in the corpus in all cases of literals which can function as markers in some contexts and non markers in others. Positive and negative examples are also collected for clause boundaries which are not explicitly announced

8 The maximum Entropy Framework: http://maxent.sourceforge.net/about.html

(11)

by markers. Ex. 5 gives a couple of negative examples for the cue words and and that – marked with strikethrough. These are cue phrases that could play the role of discourse markers only in some cases. It also shows a case of clause boundary where a marker is missing (between units

[2] and ^[3]).

Ex. 5

<In times past there lived a king and queen,^[1]><who said to each other every day of their lives,^[2]> <“Would that we had a child”!^[3]>

If, in the sequence of tokens between two verbs, the system will neither detect a marker nor examples in the corpus of clause boundaries without markers, the text will not be segmented.

Finally, an Evaluation module (E in Figure 5) is used to compare a test file (the output from the Segmenter module) against a gold file (manually annotated at clause boundaries).

Two metrics have been considered in doing this comparison. The first one calculates Precision, Recall and F-Measure by comparing the number of boundaries in the test and gold files. The second metric, called Accuracy (A in the formulas below), is less restrictive and computes the inclusion of words in the proper segments:

– represents the length of the clause the word belongs to, in the test file;

– represents the length of the clause the word belongs to, in the gold file;

– represents a score attached to the word , the same for all words belonging to the same test clause;

– represents the total number of words in the test file;

– represents the sum of scores of all words.

In a multilingual system, an important issue is the calibration of the system for each of the targeted languages. The multitude of parameters make manual calibration a procedure that is very delicate, time consuming and prone to errors. In order to avoid this, we have developed an automatic calibration procedure intended to find that configuration of the system’s parameters

(12)

that would achieve the best segmentation results for each language. This makes the quality of a specific clause segmenter rely only on the quantity and quality of the manually annotated corpus and less on contextual data as reflected in the set of parameters. The Calibration module needs a configuration file, an input file, a corresponding gold file and a training corpus. It iterates the sequence of modules <Training, Segmenter, Evaluation> on the whole scope of preference parameters until the best possible results are obtained. The specific values of parameters associated with the best run will be frozen and considered in the life-long routine of the system, as shown in Figure 5.

Figure 5: The Calibration chain for Clause Segmentation

5 DISCOURSE PARSING

5.1 Incremental parsing at the discourse, paragraph and sentence level

(13)

Discourse parsing is the process of inferring the structure of a discourse from its basic elements (sentences or clauses), the same as one would build a parse of a sentence from its words (Bangalore and Stent, 2009).

Rhetorical Structure Theory (Mann and Thompson, 1988) is one of the most popular discourse theories. In RST the discourse segments (edus) are plain text units; their aggregation in larger segments configures an understanding about the meaning of their combination. This theory puts in evidence a whole class of relationships between segments of text that details the coherence of a text. A text segment assumes one of two roles in a relationship: nuclear or satellite. If a nuclear unit is deleted, the discourse may become incoherent, while if a satellite is lost, the discourse only loses some details. As many things at the level of discourse interpretation, where we talk about comprehensibility and degree of coherence, the distinction between nuclei and satellites is often subjective. The size of a text unit is arbitrary, but each should include a self-contained predication (De-Silva and Henderson, 2005). As in many other approaches (Taboada and Mann, 2006), in our model also the edus are clauses. Rhetorical relations (for simplification, binary), holding between non-overlapping text spans, are of two kinds: hypotactic, and paratactic. Hypotactic relations connect satellites to nuclei, while paratactic relations hold between text segments of equal importance; they are considered nuclear.

Discourse structures have a central role in several computational tasks, such as question- answering, dialogue generation, summarization, information extraction, etc. The HILDA discourse parser (Hernault et al., 2010) is a text-level discourse parser with state-of-the-art performance. The system was trained on a variety of lexical and syntactic features extracted from a manually annotated corpus. Some of HILDA’s features are borrowed from (Șoricuț and Marcu, 2003), where the discourse tree is built with the help of two classifiers in a cascade – a binary structure classifier to determine whether two adjacent text units should be merged to form a new sub-tree and a multi-class classifier to determine which discourse relation label should be assigned to the new sub-tree (Feng and Hirst, 2012).

Our Discourse Parser produces discourse trees that include nuclearity markings but lack rhetorical relation names. The terminal nodes of the discourse tree represent clauses (edus), while the intermediate nodes represent spans of text larger than an edu. It adopts an incremental policy in developing the trees, on three levels: the sentence level, the paragraph level and the discourse level (representing the whole text). At each level, the parser goes on with a forest of

(14)

developing trees in parallel, ranked by a global score that takes into consideration a number of heuristics (detailed in section 4.3). At each step in the process, the system retains only the best scored trees of the previous step. The aim of this pruning process is to master the exponential explosion of the developing structure.

This section gives a description of the basic incremental parsing approach. The input to the parser is the text augmented with information about: SEN (sentence boundaries), TOK + POS + LEMMA (tokens with specification of their part-of-speeches and lemmas), NP (noun phrases, acting as referential expressions), DE (discourse entities, acting as coreference chains), and CLAUSE (clause boundaries, acting as edus). As already seen, this complex annotation is a result of the prerequisites modules, the RARE module and the clause splitter module (see Figure 2 and Figure 3).

All generated trees observe the principle of sequentiality (Marcu, 2000): A left to right reading of the terminal frontier of the tree associated with a discourse must correspond to the span of text it analyses in the same left-to-right order.

Our incremental discourse parsing approach borrows the two operations used in (L)TAG (lexicalized tree-adjoining grammar) (Joshi and Schabes, 1997): adjunction and substitution.

The adjunction operation takes an initial tree or a developing tree (D-tree) and creates a new developing tree by combining it with an auxiliary tree (A-tree). The auxiliary tree includes a special node, called foot node (denoted by the * sign), which is placed on its terminal frontier.

The adjunction operation temporally dismounts the sub-tree headed by an adjunction node placed on the right frontier⁹ of a D-tree, attaches it to the A-tree by replacing the foot node, and finally replaces the adjunction node of the D-tree with the augmented A-tree. Figure 6 depicts this operation.

9 The right frontier represents the path from the root of the tree to its rightmost leaf node.

(15)

Figure 6: The adjunction operation

An auxiliary tree whose foot node is placed as a left child of its parent node is called a left footed auxiliary tree. As proved in Cristea (2005), only left footed auxiliary trees participating in adjunction operations on the right frontiers of D-trees maintain the correctness of intermediary D-trees at each step.

We start from the assumption that a discourse consists of several paragraphs, each paragraph has one or more sentences, and each sentence, in turn, has one or more clauses. At the paragraph and sentence levels parsing goes on incrementally by consuming, recursively, one entire structure of an inferior level. Treating these spans separately is possible based on the assumption that for each span there is a corresponding sub-tree in the overall discourse tree.

What this means, for instance, is that a clause belonging to the sentence Si cannot, by itself, complement the sentence Si-1 or a part of it (Șoricuț and Marcu, 2003).

Figure 7 displays all possible types of auxiliary trees. As it can be seen, the alpha and beta trees are left footed A-trees, therefore are appropriate for adjunction, and the gamma and delta, lacking foot nodes, are appropriate for substitution. On the one hand, when there are discourse clues suggesting an expectation, only the beta and gamma types can be used, because only these types include substitution nodes. On the other hand, the root nodes of any of these trees can have any of the nuclearity types: N_N, N_S and S_N.

(16)

Figure 7: Types of auxiliary trees

5.2 Combinatorics at the sentence level

For short sentences we apply a different approach (Cristea et. al, 2003) that exploits the markers’ patterns of arguments and uses combinatorics to explore the search space of solutions.

A corpus was used to extract possible patterns of arguments of discourse markers. Following the identification of markers in the sentence, a list of all possible combinations of arguments is computed, by taking into account all patterns of all markers. The patterns are sensible to the position of markers within the clauses in the corpus. For example, the marker because may have both arguments to the right (in which case the first is satellite and the second nuclear, as in Because it was raining, John took his umbrella.), or one to the left and one to the right, in which case the first is nuclear and the second – satellite (John took his umbrella, because it was raining.).

After the lists are computed, that combination of arguments should be determined, which gives rise to a consistent, well-formed, tree structure. The well-formedness is checked against a set of rules (Cristea et. al, 2003), as for instance: it is impossible to have two distinct markers which cover the same sequence of edus and it is impossible to have nested arguments on both sides of two markers.

The rules filter out the majority of combinations and what remains should be a list of valid trees. In the best cases, supposing a sentence contains ten clauses and nine markers and each of them has only one unique pattern, the system would generate minimum 9⁹= 387,420,489 combinations (because each marker has at least 9 lists), all having to be

(17)

validated for well-‐formedness. The number of possibilities in a sentence with n clauses is 3^n-‐1 (because a sequence of two edus can give rise to 3 structures: N_N, N_S and S_N).

Constraints dictated by the necessity to have a response in real-time have obliged us to apply the combinatorics method only on sentences shorter than 8 clauses. All long sentences are parsed using the incremental parsing approach described in section 5.1.

5.3 Heuristics

As mentioned already, the exponential explosion of partial trees is mastered by a ranking and pruning policy. Only the best ranked trees are retained in the process at each step. We describe in this section a number of heuristics used to assign scores to the developing trees, as ways to guide the elaboration of the final shape of the tree. For each tree t, a global score (GS^t) is computed by summing up weighted scores of each individual heuristic using the formula:

GS^t =

where si and wi are the scores of one heuristic and its corresponding weight, and N is the total number of heuristics applied. The score of each of the heuristics is normalized in the range 0 to 1. The weights themselves are established during a calibration process that resembles the one presented in Section 4.1. The only difference is that, given the lack of discourse gold files (extremely costly to produce), we have calibrated the discourse parser by directly comparing summaries.

Centering on veins. Centering Theory (CT) (Grosz et al., 1995; Brennan et al, 1987) is known as a theory of local discourse structure which models the interaction of cohesion and salience in the internal organization of a text. The four Centering transitions between subsequent utterances of a discourse segment (continuation, retaining, smooth shifting, abrupt shifting, to which the no Cb transition can also be added) can be taken as measures of discourse coherence (from the smoothest, easier to interpret one – to the most discontinuous, difficult to decipher).

There are known approaches that suggest that the granularity of Centering utterances can go down to clauses (Kameyana, 1997). By seeing in an utterance an edu, a still fragile bridge is opened towards considering Centering transitions as a criterion to appreciate the coherence of a given discourse. Thus, it remains one big barrier: the limitation of locality. Moreover, Veins Theory (VT) (Cristea et. al, 1998) puts in evidence a relationship between referentiality and

(18)

discourse structure that helps to identify coherent sub-sequences in the original discourse, called veins. These are exactly the segments looked for by CT. VT thus offers a way to extend the local conclusions of CT to the global discourse and, in so doing, a way to associate a measure of global coherence. However, our goal is to discover the best discourse structure characterizing a text. It seems natural to make the supposition that among all possible tree structures that can be associated with a discourse, the true one displays the best global coherence score¹⁰. We have expressed this in the form of a heuristic that guides the elaboration of the structure: the parser favors adjunction positions that maximize the scores of CT transitions on veins. Applied persistently, this heuristic is expected to produce that tree structure that reflects the overall smoothest understanding of the discourse.

Lower adjunction levels. The heuristic favors adjunctions operated on the lower part of the right frontier (or inner-most right frontier). The trees developed when this heuristic is persistently applied will be predominantly balanced to the right and downward. This corresponds to a discourse which most of the time adds details on the lastly mentioned issue.

On the contrary, a tree developing to the right and upward corresponds to a discourse that always comes back to the original idea, completing it with new details.

Opening minimum referentiality domains. If the material node m contains a reference that can be satisfied by antecedents belonging to the VT domains D1, ... Dk, give better scores to domains having fewer referents. The heuristic favors adjunctions on the upper levels of the right frontier. Indeed, supposing a predominantly left balanced D-tree (in which most of the hypotactic relations have the nuclear daughters on the left side, actually very common), an A- tree also left balanced opens for the new material node, a domain of referentiality which is longer if the adjunction node is lower on the right frontier. In other words, if I go on adding details to the most recent topic, I have access to the largest part of what has been said until now. On the contrary, if I go on adding details to an old topic, I have access only to the old discourse¹¹. Now look at this property the other way round: if an entity belonging to the new material node is bound to refer to a mention (which will become its antecedent) belonging to the old discourse, the material node can be attached anywhere on the RF, but if the reference

10 This supposition approximates empirical results on measuring the coherence of human produced discourses by Centering scores, as of Cristea and Iftene (2010): On average, human discourses have a degree of coherence which is slightly less than the highest possible.

11 This is also conformant with the stack referentiality of the Attentional State Theory (Grosz and Sidner, 1986).

(19)

link is directed towards a new mention (antecedent), the adjunction cannot be made except on the lower part of the RF. So, by favoring tree structures having minimal domains of referentiality, we force adjunctions to the upper levels of the RF. This heuristic will therefore counter-balance the tendency incurred by the previous one.

Maximum referentiality score. The heuristic favors adjunction positions where most referents of the material node find antecedents on the referentiality domains given by veins. In relation with the occurrence of the antecedent on the vein, the syntactic category of the anaphor also counts, because not all referential expressions are equal in evoking power. We have started from the original experimental findings of Ide and Cristea (2000) where, if the anaphor is:

α). a zero pronoun, then it is compulsory that the vein contains an antecedent; β). a clitic, then it is extremely desirable that an antecedent be on the vein; γ). an overt pronoun, then it is desirable that an antecedent be on vein; δ). a common noun, then it is good if an antecedent is on the vein; ε). a proper noun, then it is not necessary to have an antecedent on the vein.

Subsequently, we have defined three cases for placing an antecedent: a). the antecedent belongs to the previous unit (clause), where previous is considered with respect to the vein of the current unit; b). the antecedent belongs to one unit of the vein of the current unit which is not the previous one; c). the antecedent does not belong to any unit of the vein of the current unit.

Conforming to a combination of these criteria, a score is computed for each anaphor of the current unit, and these scores are summed up. The heuristic favors adjunction positions that maximize this score.

Consume substitution nodes first! The heuristic instructs, in case the D-tree includes an open substitution node, to consume this substitution node first (by using a gamma or delta A- tree) before proposing an alpha or beta A-tree (see Figure 7).

Invalidate unclosed expectations! The heuristic strongly discourages those structure development directions that leave unachieved trees when the whole text is consumed. It gives extremely low scores to trees which still have open expectation nodes.

(20)

6 THE SUMMARIZER

In this chapter we will call a short text a text spanning between ½ a page up to 6 pages.

Our summaries belong to the category usually known as excerpt type summaries¹², which are summaries that copy contiguous sequences of tokens from the original text. Actually, in our case, such a summary should contain elementary discourse units that are copied and pasted from the original text.

In truth, the structure of a discourse as a complete tree gives more information than properly needed for summarization purposes. However, by exploiting the discourse structure, we expect to add cohesion and coherence to our summaries. Also, three types of summaries can be extracted from the discourse structure:

1. a general summary – which tells, in short, what is the whole text about;

2. an entity focused summary – showing what the text does say about a certain entity;

and

3. a edu focused summary – the minimum text that is required to understand an elementary discourse unit in the context of the whole discourse.

The simplest way to obtain a general summary is to take the vein expression of the root node¹³. Similarly, an edu-focused summary is given by the vein expression of that edu. In short, because both the general and the edu-focused summaries are by themselves vein expressions, they inherit the coherence properties of veins.

The summaries focused on entities need some reflection. Suppose one discourse entity is traced and a summary focused on that entity is wanted. If there is only one edu in which the entity is mentioned, the vein expression of that edu gives a well-focused summary of the entity.

A problem appears if the entity is mentioned in more than just one edu. Because there is no a priori reason to prefer one clause to any of the others, among those in which the entity is mentioned, it is clear that a combination of the vein expressions of each edu in which the entity is mentioned should be considered. We proposed more methods of building a final summary in this case.

12 Contrary to an excerpt type summary is a rephrase type summary, which contains a reduced, freely produced, verbalization of the original text.

13 This is identical to Marcu’s (1997) method for text summarization based on nuclearity and selective retention of hierarchical fragments, because his salient units correspond to heads in VT, and the vein expression of the root is its head expression.

(21)

The first method takes the vein expression of the lowest node of the tree that covers all units in which the entity is mentioned¹⁴. Since the length of a vein expression of a node is dependent on the deepness of the node in the tree structure¹⁵, this method results in shorter summaries. The second method considers that particular summary (vein expression) which includes most of the mentions of the entity. The third method simply takes the union of all vein expressions of the units that mention the central entity. Finally, the fourth method builds a histogram out of all vein expressions of the units mentioning the central entity and selects all units above a certain threshold. The last two methods do not produce vein expressions and therefore are more prone to incoherent summaries than the first two methods, the last one being the most exposed.

In general, the commander of a summary also suggests a desired length (in terms of a percentage of the initial length of the short text¹⁶). But, as it was evident from above, the lengths of our summaries are dictated by the veins or the combination of veins they include, and, as such, there is not an obvious way in which they could be controlled. Moreover, we can make the observation that by pruning all satellite nodes of a tree and collapsing parent nodes with daughter nodes having only one descendent, a tree is obtained whose head/vein expression of the root is equal to that of the original tree. This tree obviously contains only nuclear nodes.

This shows that a general summary (as the one given by the vein expression of the root) cannot itself be further summarized using veins. To cope with the necessity to control the length of a summary down the original length which resulted from vein expressions, heuristics could be applied. However,,in all cases we enter an arena in which we are no more protected by the coherence properties of veins. Such heuristics could include: the elimination of clauses that do not contain referential expressions participating in coreference chains, or clauses whose deletion is not harmful (although may contain REs that are part of coreferential chains) simply because other coreferential REs with good evoking power still remain in the summary.

6.1 The Summary Evaluation system

14 Let’s note that this method could still produce a summary that ignores mentions of the chased entity. In this case one of the other methods should be used.

15 Consistent with the discussions in Section 5.3, but formal proof to this is not our concern here.

16 Let’s note that indication of the length of the summary as a percentage of the original length is an option only in the case of short texts. A summary of a book, for instance, should be drafted in totally different terms.

(22)

It is notoriously true that a gold corpus to be used in the evaluation of a discourse parser is very difficult to obtain, due to at least three factors: First, the determination of the discourse structure of a text involves choices which do not always have only one solution, because of subjectivity factors. That is, even very well trained human annotators could arrive at totally different structures for the same text. Secondly, the annotation process is extremely time- consuming. And, thirdly, the cost of such a process is generally high because of the complexity of the task and the high skills that are needed.

Considering all these factors, we have decided to use an indirect method for evaluating the DP module, which skips a direct confrontation of a discourse parse tree against a discourse gold tree and concentrates instead on the evaluation of a discourse structure and the evaluation of a summary. The idea is that a good summary cannot be due to anything but the result of a good discourse structure. Conversely, a poor summary reflects defects in the discourse structure.

But evaluation of summaries is in itself a tricky thing; and not surprisingly there is quite a bit of literature dealing with this topic. Since the summarization process is subjective, when it comes to building the summarization gold corpora we propose having more than one annotator for the same text. However, because our human annotators were instructed to produce only extract type summaries and a summary was a sequence of clause IDs, a very good automatically produced summary is perhaps much closer to one thought by a human than in cases when no such constraints would have been imposed. This is because the automatic summary will include entire clauses, the same as the gold summary does. Thus, if a clause is decided by both the program and the human to belong to the summary, then both summaries will include the whole sequence of tokens belonging to that clause. Unfortunately, this good news should be tempered to a certain extend by the possible errors of the segmentation module.

In order to also cope with the errors introduced by the Clause Segmenter, the Summary Evaluation module computes Precision, Recall and F-measure by comparing tokens (words) in the test against the ones in the gold summarization files.

To evaluate an artificial summary when there is more than one human-produced summary that counts as being gold, we adopted a measure based on a histogram, as proposed in Cristea et al. (2005). The human subjects received texts in which edus were already marked and numbered, and they were asked to indicate a 20% summary (word reductions rates). Then, a histogram was drawn by counting the number of times each edu from the original text was

(23)

mentioned by the subjects as belonging to their summaries. In these histograms the sequence of edu numbers is placed on the x-axis and the frequency of mentioning on the y-axis. A sliding horizontal line (threshold) is fixed in this histogram at a position such that the number of units above the line approximates the 20% reduction rate. The respective golden summary is given by all units whose corresponding frequencies were above the threshold (see Figure 8).

Figure 8: Approximating a gold summary out of a number of human-produced summaries

7 CORPORA AND RESULTS

Two different types of corpora have been used in our experiments: containing clause boundaries and annotation of markers, and containing summaries.

The corpora of all languages included short texts of 2 to 4 pages each, from different domains: fairy tales, financial news, political articles, geographical descriptions, etc. The pre- processing chain was launched on each of these texts, producing XML markers, added to the original text, to put in evidence: sentence, clause and token boundaries (these including POS and LEMMA) and markers. The markers’ attributes are: NUC having the values “N_N”,

“N_S”, “S_N”, “_NN”, “_NS”, “_SN”, “NN_”, “NS_”, “SN_”; TYPE – with one of the values “int” or “ext”, where TYPE=”int” means that the marker is internal to a sentence, and TYPE=”ext” means that the marker relates to a sentence from another span of text; for TYPE=”ext”, a CONNECT attribute was also filled in, with one of the values:

“expect”, “fulfill” or “relate”: CONNECT=”expect” would have the meaning

(24)

that the span unit the marker belongs to introduces an expectation, like the marker on the one hand in Ex. 6.

Ex. 6 from Cristea and Webber (1997)

<On the one hand, John is very generous.>^[1] | <For example, suppose you needed some money.>^[2] | <Then, you would just have to ask him for it.>^[3] | <On the other hand, he is very difficult to find.>^[4]

CONNECT=”fulfill” means that the span of text the marker belongs to fulfills (satisfies) an open expectation. This is the case with the on the other hand marker of Ex. 6, which indicates span [4] as fulfilling the expectation opened in [1]. CONNECT=”relate” means that the span of text the marker belongs to relates in some way to the previous discourse, but neither raising nor fulfilling an expectation.

At the end of this process a program collected all markings from the corpus and abstracted the information related to them in a file. Subsequently, this file could be edited manually by the annotator with new markers that are defined manually. In fact, it was this kind of file that was used by the discourse parser in the process of building the discourse trees of the texts.

Summaries produced manually were used two fold in our experiments: to calibrate the parameters of the discourse parser and to finally evaluate the whole summarization chain. Each text in the summarization corpus was manually annotated by minimum of 4 subjects and all texts used in the evaluation had a compression rate of 20%. As mentioned above, the summaries displayed a list of clause IDs, indicating the clauses considered by the human subjects to be part of the summary. Table 1 shows the dimension of these corpora and the clause segmentation evaluation results (achieved by comparing the number of boundaries) for each of the languages under experiment¹⁷. In the last column, the evaluation data represent averages over all languages.

17 The results are comparable with the state-of-the-art. For instance, Pușcașu (2004) reports P=93.37, R=91.43, F=92.38 for English: and P=95.59, R=95.03, F=95.30 for Romanian. Her method also uses rules to correct some boundaries.

Language BG DE EN GR PL RO TOTAL/AVG

# sentences 2,749 1,375 2,246 1,055 1,096 1,571 10,092

# tokens 51,116 31,839 53,504 30,207 21,377 47,016 235,059

(25)

Table 1: Segmentation corpora and evaluation

When comparing the quantitative data with the evaluation results, there seems to be evidence of a number of correlations. For instance, it is clear that the dimension of the corpus (#tokens, #clauses, #markers, etc.) influence the quality of the segmenter. If we plot on the same graph the F-measures of all languages, in correlation with the number of markers of their corresponding corpora, a diagram, like the one in Figure 9 results, on which a certain monotonicity tendency is observed. However, it can also be noticed that languages like GR and BG (whose F-measures are lower than the interpolation over all languages, which is marked by a thin line of the figure) seem to need more data for equivalent segmentation quality. We consider that the amount of data we have acquired are yet insufficient to risk any general, language independent, statement regarding a stiff correlation between the dimension of the corpus and the performance of the clause segmentation module for a new language. But such laws, if revealed, could guide the design of the corpora when a certain quality is envisioned.

Figure 9: The correlation between the #markers and F-measure

# clauses 6,468 2,726 4,880 2,778 2,574 3,720 23,146

# markers 2,507 396 1,832 1,493 698 947 7,873

# int markers 2,507 264 1,383 1,320 643 745 6,862

”expect” 0 16 117 1 1 183 183

”fulfill” 0 83 61 30 2 204 204

# ext markers

”relate” 0 16 270 133 0 544 544

P 0.97 0.93 0.98 0.90 0.89 0.91 0.93

R 0.77 0.66 0.94 0.84 0.97 0.88 0.84

Evaluation

F 0.86 0.77 0.96 0.87 0.82 0.89 0.86

RO

PL DE

EN

GR BG

(26)

Table 2 shows the dimension of the main parameters characterizing the summary corpora in the 6 languages.

Language BG DE EN GR PL RO Total

# of sentences 1,168 781 489 692 541 526 4,197

# of clauses 2,955 1,815 1,499 1,742 1,303 1,317 10,631 Table 2: The summaries corpora

Language BG DE EN GR PL RO AVG P (H) 0.19 0.23 0.27 0.23 0.17 0.22 0.22 R (H) 0.29 0.44 0.41 0.41 0.36 0.32 0.37 UAIC Veins

Theory approach

F (H) 0.23 0.30 0.32 0.29 0.23 0.25 0.27 P (H) 0.16 0.19 0.24 0.27 0.19 0.29 0.22 R (H) 0.25 0.20 0.22 0.33 0.21 0.06 0.21 Open Text

Summarizer approach

F (H) 0.19 0.20 0.23 0.27 0.20 0.10 0.20 P (H) 0.15 0.23 0.27 0.24 0.24 0.21 0.21 R (H) 0.18 0.25 0.25 0.22 0.24 0.22 0.18 LexRank

approach

F (H) 0.16 0.24 0.26 0.23 0.22 0.21 0.19 Table 3. Summaries evaluation and baselines

Finally, Table 3 contains a comparison of the summarizer results against two other known approaches. OTS¹⁸ is seldom used as a benchmark for other summarization systems.

LexRank (Erkan and Radev, 2004) computes the relative importance of textual units and sentences based on the concept of eigenvector centrality in a graph representation of sentences.

Instead of passing words to the summarizer, we were passing sequences of numbers – token IDs, NP IDs, NE IDs. In this way we made the input to the LexRank summarizer language independent. The figures in Table 3 are computed by comparing occurrences of IDs of words in the test against those in the gold summaries. As such, this metric is similar to the unigram- based ROUGE score (ROUGE-1), which is known to display the best agreement with human judgments (Lin and Hovy, 2003) among all other higher gram ROUGE scores. The H’s appearing in parenthesis after the three evaluation measures (precision, recall and F-measure)

18 http://libots.sourceforge.net/bench.html

(27)

signify that the gold data used for comparison have been approximated out of the ones indicated by humans, by using the histogram method described in Section 6.1. As can be noticed (the best values are marked in bold), our summarizer behaves better globally (in terms of F-scores) than both of the other methods.

8 CONCLUSIONS

Some of the most well-known summarization systems today work on the assumption that similarity of sentences also indicates their importance. As such, to get ranked highly and placed in a summary, a sentence must be similar to other sentences that are in turn also similar to many other sentences. But a text which has a low degree of repetition may mislead the summarization system, which finds few elements to hinge on. Also, frequency-based summarizers practically disregard any concern about the coherence properties of the obtained summaries.

Our method of extracting the summary places coherence criteria at its core base. By evidencing coreferential links and aligning them with the discourse structure, the most plausible discourse tree can be built. The summary is then a direct product of this tree. We expect, therefore, that the resulted summaries show a higher coherence than those produced by frequency methods. One of the most important properties of these summaries is the low occurrences of dangling pronouns. In the original text, a coreference chain is given by the list of REs attached to a DE. In principle, an antecedent of a RE could be considered any element of this list that is positioned to the left of the RE in the text. If the RE is a pronoun (a referential expression with a weak evoking power), most of the time the list of antecedents should also include elements with a higher evoking power (as proper nouns, for instance). In the process of text interpretation, the reader is able to recuperate the proper antecedents, while also linking the referents in the proper chain. On the other hand, an excerpt type of summary includes only part of the clauses of the original text and, as such, the DE lists are shorter, some of them actually disappearing completely. If the process of summarization chaotically deletes units or is not driven by coherence principles, it could evidently trigger the disappearance of all high-evoking power referential expressions in the antecedents’ chain of a pronoun. These are the dangling pronouns often mentioned in literature as negative side effects of summarization systems. The

(28)

reading of our summaries reveals that our system is much more robust when it comes to this danger¹⁹. This is due to the high scoring of discourse trees whose units include in their referentiality domains proper chosen antecedents. When this is not possible and a dangling pronoun escapes in the summary, it could be replaced, in a post-processing phase, by a high evoking power expression picked up from the pronoun’s original coreference chain.

Still, the architecture that we describe has some drawbacks, because the rather complex processing chain may induce errors. We have identified different causes of these errors: defects in the prerequisite chain, technical defects in a component module of the proper summarization chain, quantity and quality of the corpora used for training different modules and the improper fixing of the parameters of different modules. For instance, a malfunctioning of the POS- tagger, which may tag a token as an adjective instead of a verb, might induce a clause segmentation error (because a verb is a pivot in the segmentation; see Section 4.1); this triggers a discourse parsing error, which, in turn, rebounds during the summarization phase. As an another example, an anaphora resolution error that is dropped behind by RARE may trigger a low scoring of a discourse tree, which, although correct, would be mistakenly rejected. So, aiming at a high quality for all component modules is compulsory. In most cases, the quality of a module is a direct result of two elements: the corpora used to train the attached model and the set of parameters. In principle, the higher and more accurately annotated the corpus is and the richer its set of parameters, the finer can be designed the calibration process and, consequently, the better its accuracy will be.

Our tests, operating in a practical setting²⁰ (Karagiozov et al., 2012) have shown that the system produces rather useful summaries. And a number of enhancements, mentioned below, can be readily foreseen. For instance, the Clause Segmentation module could be placed before the Anaphora Resolution module in the processing chain. This way, it will become possible for RARE to also exploit the clause boundaries’ information (one such example are the pronouns in different persons which cannot co-refer if they are in the same clause). The Discourse Parser module evaluates thousands of trees, before establishing which one is the best candidate for a discourse structure of a text. This process is extremely time-consuming and the DP module uses a multi-threading launch of the evaluation procedure in order to speed up the

19 A thorough evaluation on these grounds will constitute the basis of a future study.

20 The ATLAS system i-Librarian (http://www.ATLASproject.eu) and the ATLAS service EUDocLib (http://eudoclib.ATLASproject.eu/).