“Al. I. Cuza” University of Iasi Faculty of Computer Science
Predication Driven Textual Entailment
PhD Student Mihai-Alex Moruz Supervisor Prof. Dr. Dan Cristea
Thesis version updated and modified according to the recommendations of the reviewers.
First of all, I would like to thank my supervisor, Prof. Dan Cristea for guiding my steps as a researcher, for helping me expand my horizons and for always pushing me to better myself. I would also like to thank him for the confidence he had in my abilities, which helped me through the highs and the lows of these years, and for the insights he provided with regards to my research.
Special thanks go out to my colleagues from the Natural Language Processing group within the faculty of Computer Science, with whom I have shared these past years of researching, attending conferences, talking and laughter, and who made these years such joy: Adi Iftene (to whom I would like to thank again for all the talks and collaboration), Maria Husarciuc (who is now Maria Moruz, my loving wife), Diana Trandabăţ, Ionuţ Pistol, Lucian Gâdioi, Iustin Dornescu and Marius Răschip. Thank you all for your help, for your advice and for all the things that I managed to learn from you all.
I am also very grateful to my colleagues at the Institute for Computer Science within the Romanian Academy, especially to Neculai Curteanu, who helped me become a better researcher by teaching me how to read and write scientific papers, how to be thorough and how follow through with my ideas. Thank you for the years of patience and the wonderful discussions.
Many thanks to prof. Dan Tufiş, for giving me such valuable advice throughout my PhD and for always taking the time to answer all of my questions, to prof. Bernardo Magnini, for helping me shape some of the fundamental ideas of my thesis, and to prof. Dorel Lucanu and prof. Cornelius Croitoru, for their insightful comments and advice.
I would also like to offer special thanks to my family, for their patience and understanding, for putting up with the long days and nights of work, and for the help they so graciously provided me, both scientifically and otherwise.
I would also like to acknowledge the financial support received from the reaserch grants
„Dezvoltarea oportunităţilor oferite doctoranzilor pentru traiectorii flexibile în cariera de cercetare” POSDRU- 6/1.5/S/25, PNCDI II: “SIR – RESDEC Sistem de Întrebare-Răspuns în limbile Română şi Engleză cu Spaţii Deschise de Căutare”, PNCDI II: “eDTLR – Dicţionarul Tezaur al Limbii Române în format electronic” and PNCDI II: “INTELCHIM – Modelare şi
conducere automată utilizând instrumente ale inteligenţei artificiale pentru aplicaţii în chimie şi inginerie de proces”.
One of the most relevant phenomena in natural language is that of variability, which means expressing the same thing using different surface representations. In order to address this issue, the notion of textual entailment was introduced (the notion of whether a text can be deduced from another). This thesis describes our contribution to the field of textual entailment.
Our approach is based on the notion of predicational semantics, as we believe that deep semantic understanding of natural language utterances can be more effectively deduced from the analysis of predicates and their arguments, and that deep semantic understanding is one of the best solutions for the problem of textual entailment. We have given a novel interpretation for the definition of textual entailment that builds upon our intuition. We have also described an algorithm for solving textual entailment on the basis of predicational semantics and argument structure unification. The approach described in this paper is both novel (there are, to our knowledge, no entailment systems based on predicational semantics) and effective, as our approach solves a large majority of entailment pairs (based on manual analysis of corpora) and the implementation based on our approach proved to have good results in evaluation campaigns.
The approach to solving textual entailment proposed in this thesis is language independent, given the apropriate lexical semantic resources available for that language. In order to extend existing resources, and to create new ones, we make use of our results with dictionary entry parsing, (mainly eDTLR), to propose ways of extending existing lexical semantic resources for Romanian, and for creating of new ones. The creation of these resources allow for the adaptation of our system to the Romanian language, without changing the TE algorithm we proposed.
The thesis is structured as follows: chapter 1 presents the general framework of textual entailment and describes the RTE challenges; chapter 2 gives the current state of the art for textual entailment; chapter 3 describes the theoretical basis of our approach; chapter 4 describes the implementation and results of our entailment system; chapter 5 discusses possible extensions to Romanian lexical-semantic resources based on eDTLR, together with the description of the parsing of the DLR, and chapter 6 discusses conclusions and future work, and provides an outline of the personal contributions of the thesis.
Table of Contents
Acknowledgements ... i
Abstract ... iii
Table of Contents ... iv
1. Introduction ... 1
1.1. Operational Definitions for Textual Entailment ... 4
1.2. Recognising Textual Entailment Challenge (RTE) ... 6
1.2.1. First Recognising Textual Entailment Challenge ... 7
1.2.2. The Second PASCAL Recognising Textual Entailment Challenge ... 9
1.2.3. The Third PASCAL Recognising Textual Entailment Challenge... 10
1.2.4. TAC 2008 Recognizing Textual Entailment (RTE) Track ... 12
1.2.5. PASCAL Recognizing Textual Entailment Challenge (RTE-5) at TAC 2009 ... 13
1.2.6. PASCAL Recognizing Textual Entailment Challenge (RTE-6) at TAC 2010 ... 15
1.3. Conclusions ... 17
2. State of the Art in Textual Entailment ... 19
2.1. Lexical Representation ... 19
2.2. Syntactic Graph Distance ... 23
2.3. Tree Edit Distance Algorithms ... 25
2.4. Logical Inference ... 28
2.5. Atomic Propositions ... 30
2.6. Machine learning ... 31
2.7. Entailment Rules ... 36
2.8. Cross Lingual Textual Entailment ... 44
2.9. Discourse knowledge ... 45
2.10. Uses of Textual Entailment ... 48
3. Solving Textual Entailment by Semantic Means ... 52
3.1. Levin‟s classes and VerbNet ... 55
3.1.1. Syntactic Frames in VN ... 55
3.1.2. Semantic predicates ... 56
3.1.3. Statistics about VerbNet ... 56
3.2. A Predication Based Algorithm for Soving Textual Enailement ... 56
3.3. Corpus analysis ... 59
3.3.1. Pairs solved using VN... 60
3.3.2. Pairs solved using NE and NA... 64
3.3.3. Pairs solved using OS ... 66
3.4. Conclusions ... 69
3.5. Comparison with Current Entailment Methods ... 71
4. System Description, Experimental Results and Analysis ... 74
4.1. General System Architecture ... 74
4.1.1. Pre-Processing ... 74
4.1.2. Main Module ... 75
4.1.3. Mapping Rules for Entailment Cases... 76
4.1.4. Mapping Rules for Negative Entailment Cases ... 78
4.2. Results in RTE-5 ... 83
4.2.1. Ablation Tests ... 85
4.2.2. Pilot Task ... 87
4.3. Results in RTE-6 ... 88
4.3.1. Ablation Tests ... 89
4.4. Error Analysis ... 91
4.5. Conclusions ... 98
5. Romanian Lexical-Semantic Resources for Entailment ... 100
5.1. The Romanian WordNet ... 100
5.2. The Electronic Format of the Romanian Thesaurus Dictionary (eDTLR) ... 104
5.2.1. Dictionary Sense Segmentation and Dependency (DSSD) ... 105
5.2.2. Advantages of DSSD over Standard Dictionary Entry Parsing (DEP) ... 107
5.2.3. DTLR Marker Classes, their Dependency Structure, and the DSSD Parsing Algorithm ... 109
5.2.4. DSSD Parser Applied on DTLR Entries ... 111
5.2.5. Result Analysis ... 117
5.3. Extending the Romanian WordNet with eDTLR ... 120
5.4. Aligning DEX and eDTLR ... 122
5.5. Conclusions ... 123
6. Conclusions ... 124
7. Bibliography ... 130
8. Annexes ... 149
8.1. Annex 1: Example of a Parsed DLR Entry ... 149
8.2. Annex 2: VerbNet Class “eat-39.1” ... 155
8.3. Annex 3: RoWN Entries for Noun “casă” ... 157
8.4. Annex 4: VerbNet – FrameNet Mapping ... 160
One of the most relevant phenomena in natural language is that of variability, which can be loosely defined as stating the same ideas in different ways. The issue of natural language variability needs to be addressed by various robust natural language processing applications, such as machine translation systems, summarization systems, question answering systems (QA) information retrievers (IR) or information extractors (IE), as they need to recognize the various forms in which information in the input and output texts is expressed. Most of the applications given above solve natural language variability, to a limited extent, by incorporating some form of shallow semantic parsing, but they do not employ a general framework for modelling variability in an independent manner (Dagan and Glickman, 2004).
The general solution for the issue of natural language variability is deep semantic understanding of natural language utterances, which would be used as an independent module that brings the input and output of the tools mentioned above to a common form (either as surface representations or as semantic descriptions). Obtaining deep semantic understanding of natural language utterances is one of the most difficult tasks in natural language processing (its difficulty has been compared to that of the Turing test (Bos and Markert, 2005)), and, as such, the common approach of extracting semantic representations for texts within each application is unfeasible. (Dagan and Glickman, 2004) describe a novel approach to removing the issue of natural language variability by defining the notion of textual entailment. According to (Dagan and Glickman, 2004), “textual entailment (entailment, in short) is defined as a relationship between a coherent text T and a language expression, which is considered as a hypothesis, H. We say that T entails H (H is a consequent of T), denoted by T⇒H, if the meaning of H, as interpreted in the context of T, can be inferred from the meaning of T.” The entailment relation defined above is directional, since even if the meaning of one text may imply the meaning of another, the opposite does not always hold true.
The motivation of our thesis is to give a novel solution for the problem of textual entailment, based on the notion of predicational semantics and argument unification, which would lead to a deeper semantic understanding of natural language texts, and to provide a generalized, language independent framework for solving textual entailment, thus providing a robust entailment solver that can be attached to various natural language applications.
Since we consider that the one of the goals of textual entailment is to provide machine understanding for natural language at a level as close as possible to human understanding, we have also briefly examined (as an extension of the approach described in this thesis) a novel way of representing the semantic knowledge extracted from the text and the hypothesis, which would lead to deep semantic understanding of text, and is based on the idea of semantic triples represented in a graph format.
For a better understanding of the notion of textual entailment, we will refer to the Recognizing Textual Entailment Challenges (RTE), which will be described in detail in this section. Table 1 gives a sample of text – hypothesis pairs, taken from the Fifth Recognizing Textual Entailment (RTE-5) Challenge of 2009:
ID Text Hypothesis Task Solution
1 LOSAIL, Qatar (AFP) - Torrential rain caused the season-opening Qatar MotoGP to be cancelled on Sunday, leaving officials and teams in a frenzy before deciding to race on Monday instead at this floodlit desert venue. Monsoon-like conditions, accompanied by swirling winds, arrived just moments before Australia's Casey Stoner, on pole position, was due to lead defending world champion Valentino Rossi and the other riders away on the warm-up lap. "It's just unlucky with the weather," said Australian Ducati rider Stoner, the 2007 world champion, who was bidding for a third successive win here.
Valentino Rossi won the season- opening Qatar MotoGP.
2 Euro MPs have voted overwhelmingly to cut the cost of texting and using the internet on mobiles abroad.
The cap for a "roaming" text will fall to 11 euro cents (10p; 14 US cents), from about 29 cents on average today. The EU-wide caps, excluding VAT, will take effect in July. They cover text messages and data roaming services, such as checking e-mails while abroad. The current price cap of 46 euro cents per
A roaming text cost 46 euro cents.
ID Text Hypothesis Task Solution
minute for an outgoing voice call will also fall to 43 cents in July.
82 Currently, there is no specific treatment available against dengue fever, which is the most widespread tropical disease after malaria. Sanofi Pasteur is collaborating with the Communicable Disease Center in Singapore and the Pasteur Institute in Vietnam to conduct these clinical studies in children and adults.
"Controlling the mosquitoes that transmit dengue is necessary but not sufficient to fight against the disease. A safe and effective vaccine has been long awaited to prevent dengue epidemics," said Professor Leo Yee Sin, director of the Communicable Disease Center in Singapore. "Clinical studies in Singapore are critical steps to advance the development of a vaccine for the prevention of dengue in Asia. We are happy to contribute to scientific research that would benefit the entire region."
Malaria is the most widespread disease
transmitted by mosquitoes.
207 The explosion happened at about 9:45pm local time, at the 'Rizk Plaza' shopping mall. It is understood from local media that a bomb was placed in the underground parking area below the complex, although there has been no official confirmation from security services. Sources on the ground say that the actual commercial center occupied the ground floor, with upper stories occupied by inhabited apartments.
Located in the Metn range of mountains east of Beirut, Broummana is about half-hour's drive from the capital and is popular with tourists, particularly those from the Gulf states.
Beirut is near Broummana.
403 It is possible that it could help McCain earn the Ronald Regan IR Unknown
ID Text Hypothesis Task Solution
support of conservatives who have not always viewed him as aligning with the party on certain issues. At the same time, it could help to align McCain with former President Ronald Reagan, who attracted Republican and Democratic voters.
Table 1: Examples of text – hypothesis pairs from RTE-5
The table above shows text – hypothesis pairs taken from the test set of RTE-5 (http://www.nist.gov/tac/2009/RTE/). The first column gives the pair ID from the original corpus; the second and third columns are the pair itself (text and hypothesis), while the fifth column is the solution for the entailment problem of the given pair. Column four represents the type of natural language task from which the pairs were derived. There are three possible solutions for entailment pairs: ENTAILMENT, which means that the text entails the hypothesis, CONTRADICTION, denoting that the hypothesis contradicts some part of the text, and UNKNOWN, which means that no pertinent judgment can be drawn for the truth value of the hypothesis on the basis of the information in the text.
The following section gives a series of operational definitions for textual entailment, which are needed for formalizing the original definition given above to the point where it can be used in computational approaches, and section 1.2 gives a detailed description of the RTE challenges, which we consider relevant because of the significant impact they had on the textual entailment problem.
1.1. Operational Definitions for Textual Entailment
Even though the definition for textual entailment given above is complete and correct, it is too abstract to be directly used in practical applications of Natural Language Processing.
Therefore, there have been a number of adaptations of the initial definition for textual entailment, in order to make it more practical for real life implementations, as can be seen in the list of definition variants given below:
We say that a text T entails a hypothesis H if, typically, a human reading T would infer that H is most likely true. (Dagan et al., 2005). This version of the TE
definition is aimed at human annotators, and is used for creating gold corpora for the task of textual entailment.
T entails H if there exists a subgraph of XDGT (the syntactic graph of T) that is in an isomorphism relation with XDGH
T entails H if there exists a sequence of transformations applied to T such that H can be obtained with an overall cost below a certain threshold, empirically estimated on the training data
T entails H, if H does not introduce new information compared to T, or the combined model T+H is not more informative then T
T entails H if all the atomic propositions in H are supported by at least one atomic proposition in T
Our intuition for solving textual entailment was also modelled within an operational definition. Given a text T and a hypothesis H, we say that T entails H, denoted by T→H, if and only if:
Predicate matching: each of the predicates in H is entailed by at least one predicate in T. We say that a predicate 𝑝 ∈ 𝑇 entails a predicate 𝑞 ∈ 𝐻, denoted 𝑝 → 𝑞 if q is a consequence of p, or p and q are synonyms, or q is a subevent of p;
Argument matching: given alignments in T for all of the predicates in H (we consider a predicate p in T aligned to a predicate q in H if there is an entailment relation between p and q), entailment holds if and only if the arguments of the aligned predicates are in an entailment relation. Two arguments are in an entailment reation if they both refer to similar entities (e.g., the heads of the arguments are synonyms, or the arguments are in a part-of relation, etc.), and if the unification of their feature sets is successful and is equal to the feature set of that argument in the text. Formally, if 𝑝 ∈ 𝑇 and 𝑞 ∈ 𝐻 and 𝑝 → 𝑞 and arg 𝑝 = 𝑎, 𝑏 , arg 𝑞 = 𝑎′, 𝑏′ , then entailment holds if and only if 𝑎 → 𝑎′ and 𝑏 → 𝑏′.
Argument entailment is defined as the result of the unification between the argument feature structures in H and T (as described in chapter 3 in greater detail).
The advantage of our operational definition of textual entailment is more semantically focused than previous versions, which leads to a deeper semantic understanding of the text and the hypothesis. Because of this we consider that our operational definition is closer to the original definition than many previous versions.
1.2. Recognising Textual Entailment Challenge (RTE)
The main proving ground for the various systems and methods of determining textual entailment is the Recognising Textual Entailment Challenge (RTE), which was initially put forth in 2005 by PASCAL (Pattern Analysis, Statistical Modelling and Computational Learning - http://www.pascal-network.org/) - the European Commission‟s IST-funded Network of Excellence for Multimodal Interfaces. Starting with 2008 (RTE-4), the RTE challenges were held within the Text Analysis Conference (TAC - http://www.nist.gov/tac/). TAC is a series of evaluation workshops which encourage research in Natural Language Processing by providing large test collections, common evaluation procedures and tools and opportunities for researchers to compare and share results.
The RTE Challenge has grown in time, increasing in complexity with each edition.
Firstly, the size of the text in the entailment pairs grew steadily, from just one or two sentences in RTE-1, to an entire paragraph in RTE-4; RTE-5 went one step further, by using text which was not guaranteed to be grammatically correct.
With regards to the type of entailment pairs, RTE-1 had datasets created using such applications as Information Retrieval (IR), Comparable Documents (CD), Reading Comprehension (RC), Question Answering (QA), Information Extraction (IE), Machine Translation (MT), and Paraphrase Acquisition (PP). Starting with RTE-2, datasets consisted only of pairs extracted using Information Retrieval (IR), Information Extraction (IE), Question Answering (QA), and multi-document summarization (SUM), so as to create more realistic test data. By RTE-5, only Information Retrieval (IR), Information Extraction (IE) and Question Answering (QA) techniques were being used to create test pairs for the textual entailment engines.
Another innovation of the RTE Challenge was the introduction of pilot tasks, in RTE-3, RTE-5 and RTE-6. The RTE-3 pilot task consisted of having to classify the test pairs in three clusters: positive entailments, contradictions and unknown entailment. The main task of RTE-4
and RTE-5 became twofold: the initial 2-way task, first defined in RTE-1 and the new 3-way task, as described in the RTE-3 pilot task. For RTE-5, organizers created a new pilot task, focusing on extracting, from a collection of texts, those sentences that entail a given hypothesis.
In RTE-6, the pilot task of the previous challenge became the new main task; in addition to this, a new pilot task was defined, the knowledge base population task.
1.2.1. First Recognising Textual Entailment Challenge1
According to (Dagan et al., 2005), “the RTE task is defined as recognizing, given two text fragments, whether the meaning of one text can be inferred from (entailed by) the other. This application independent task is suggested as capturing major inferences about the variability of semantic expression which are commonly needed across multiple applications.” Based on the work of (Dagan and Glickman, 2004), the task of recognizing textual entailment is defined as deciding whether given two fragments of text, the meaning of one can be deduced from the meaning of the other.
In order to create a framework for solving textual entailment, the organizers created a data set consisting of text-hypothesis pairs using snippets of text extracted from news corpora.
The entailment value of these pairs was then manually annotated, and then separated into training and test data sets. The goal of the participating entailment systems was to automatically decide whether entailment holds for the given pairs, and the results were then compared to the manually obtained gold standard. The data set was populated with text-hypothesis pairs corresponding to different textual processing applications, such as information retrieval (IR), comparable documents (CD – sentence alignment over similar documents), reading comprehension (RC), question answering (QA), information extraction (IE), machine translation (MT) and paraphrase acquisition (PP). The text-hypothesis pairs were selected in such a way as to highlight possible cases of success or failure for each of these applications, in order to better judge the effect an entailment system might have over their output. Also, the phenomena modelled by the entailment pairs range from simple lexical overlap to syntactic variability, logical deduction and world knowledge. The pairs have also been specifically chosen so that the
distribution of the True and False examples is balanced (50%-50%), which does not apply to real world situations.
Since this was the first instance of the RTE challenge, the task definition, the evaluation methods and the corpus generation procedures were not yet mature, and much of the organizers‟
effort went into defining the methodologies that underlie the challenge. Because of this, this first challenge was aimed at a more exploratory setting, rather than a competitive one, and attempted to set a series of baselines, in terms of system performance and corpus generation. The table below gives the results of the top scoring systems in RTE-1 (based on the results reported in (Dagan, Glickman and Magnini, 2005)).
Submission Accuracy Method
(Pérez and Alfonseca, 2005) 0.7 Word overlap
(Delmonte et al., 2005) 0.606 WordNet, syntactic matching, logical inference (Bayer et al., 2005) 0.586 Statistical lexical relations
(Glickman et al., 2005) 0.586 Statistical lexical relations Table 2: Top scoring systems from RTE-1
The most basic method used for solving textual entailment in this setting was lexical overlap over words, lemmas, stems or parts of speech, using some sort of weighting (most commonly, inverse document frequency). More complex approaches used various relations between words, such as statistically derived relations or the WordNet taxonomy. Some systems used even more advanced methods, such as the degree of syntactic match, world knowledge or logical provers over semantically enriched representations. The decision mechanisms are also varied, and include probabilistic models, probabilistic machine translation models, supervised learning methods or specific scoring mechanisms, such as rules. Interestingly, system complexity does not fully correlate to performance, as some of the best results are obtained by rather naïve lexical-based systems. The RTE-1 challenge was quite successful, with 17 teams taking part.
1.2.2. The Second PASCAL Recognising Textual Entailment Challenge2
“The main goal of the second challenge was to support the continuation of research on textual entailment” (Bar-Haim et al., 2006). The general outline of the RTE-2 task was similar to that of RTE-1, in the sense that the purpose of the participating systems was to determine whether a hypothesis H is entailed by a text T. The data set was very different, as the organizers wanted to provide more realistic T-H pairs, based on outputs extracted from existing systems.
The data collection and annotation process was also improved, as the guidelines for collection and annotation were improved (the data was cross-annotated by three different annotators).
The RTE-2 data set consists of 1600 entailment pairs, built using the basic setting of RTE-1. The data set was split into two 800 pair sets, for training and testing. In terms of application setting, the organizer only chose four of the original seven applications from RTE-1:
IR, IE, QA and multi-document summarization (CD in RTE-1). Within each application setting, annotators selected both positive and negative examples, in equal number; the various application settings are equally represented in the corpus.
The main task of RTE-2 was classification of entailment pairs as either positive or negative cases; the evaluation measure used was accuracy (percentage of correctly classified pairs). A secondary task consisted of ranking pairs according to their entailment confidence; this means that the first ranked pair would be the one for which entailment is most certain, while the last pair is the one for which the “no entailment” judgement was most certain. The top scoring systems are given below (in the table below, and in subsequent tables, BK denotes Background Knowledge):
Submission Accuracy Method
(Hickl et al., 2006)
0.7538 Lexical relations, subsequence overlap, syntactic matching, SRL, web statistics, ML classification
(Tatu et al, 2006) 0.7375 Lexical relations, logical inference, BK (Zanzotto et al.,
0.6388 Lexical relations, syntactic matching, web statistics, ML classification
Submission Accuracy Method
(Adams, 2006) 0.586 Lexical relations, ML classification Table 3: Top scoring systems from RTE-2
The results for RTE-2 are considerably higher than those for RTE-1, with top accuracies fro RTE-2 more than 10% above those for RTE-1. The important thing about the results of RTE- 2 is that they correlate with the complexity of the method, which is to say that more complex systems perform better on this data set. However, simple lexical systems can still achieve an accuracy of over 60% (Bar-Haim et al., 2006). Also, the average system complexity grew, as did the number of participants, proving the appeal of the task.
1.2.3. The Third PASCAL Recognising Textual Entailment Challenge3
“RTE-3 followed the same basic structure of the previous campaigns, in order to facilitate the participation of newcomers and to allow "veterans" to assess the improvements of their systems in a comparable test exercise. Nevertheless, some innovations were introduced, on the one hand to make the challenge more stimulating and, on the other, to encourage collaboration between system developers” (Giampiccolo et al., 2007). Among the innovations are a limited number of entailment pairs for which the text was longer (up to a paragraph long), in order to make the scenario more similar to real life applications; it is worth noting, however, that the majority of the entailment pairs still had relatively short texts. Another innovation was the introduction of a resource pool, where participants could share the resources used. This was motivated by the conclusion drawn at the end of RTE-2, that modelling entailment requires large knowledge sources corresponding to different types of entailment reasoning. Also, entailment systems use various NLP tools, and thus the portal serves as a place for publicising and tracking resources, and for reporting on their usefulness.
As in the RTE-2 challenge, the RTE-3 corpus consisted of 1600 entailment pairs, equally divided into training and test sets. The pairs were drawn from the same application settings (IE, IR, QA, SUM) and the numbers of positive and negative examples were similar.
RTE-3 also saw the introduction of a pilot task set up by the US National Institute of Standards and Technology (NIST)4. The purpose of the pilot task is to explore two notions for that are closely related to textual entailment: separating unknown cases from contradictions and providing justifications for system decisions. The first subtask requires that the entailment systems make more precise distinctions by making a three way decision between YES, NO and UNKNOWN, thus distinguishing between a hypothesis over which no significant judgement can be made and a hypothesis that explicitly contradicts the text. The second subtask was aimed at exploring ways in which users of entailment systems can receive justification for the entailment decision, as end users are unlikely to trust a result they cannot test. The pilot task used the existing infrastructure of the RTE-3 challenge, by expanding the data sets with annotation for the three-way task. The results for both the main and pilot tasks are given below (BK is the acronym for Background Knowledge).
Submission Accuracy Method
(Hickl and Bensley, 2007)
0.8 Lexical relations, subsequence overlap, logical inference, ML classification, anaphora resolution, BK (Tatu and Moldovan,
0.7225 Lexical relations, logical inference, anaphora resolution, BK
(Iftene and Balahur- Dobrescu, 2007)
0.6913 Lexical relations, syntactic matching, BK
(Adams et al., 2007) 0.67 Lexical relations, subsequence overlap, web-based statistics, ML classification
Table 4: Top scoring systems from RTE-3 main task
Submission Accuracy Method
(Hickl and Bensley, 2007)
0.73 Lexical relations, subsequence overlap, logical inference, ML classification, anaphora resolution, BK (Tatu and Moldovan, 0.71 Lexical relations, logical inference, anaphora resolution,
Submission Accuracy Method
(Chambers et al. 2007) 0.59 Lexical relations, syntactic matching, logical inference, ML classification, anaphora resolution
(Iftene and Balahur- Dobrescu, 2007)
0.57 Lexical relations, syntactic matching, BK
Table 5: Top scoring systems from RTE-3 pilot task 1.2.4. TAC 2008 Recognizing Textual Entailment (RTE) Track5
On the basis of the positive feedback received from researchers interested in the RTE challenge, the organizers of the RTE challenges introduced a series of new elements, in order to keep the task affordable and stimulating (Giampiccolo et al. 2008).
The first major change in the RTE challenge was the decision to join efforts with the National Institute of Standards and Technology6, which proposed the pilot task in RTE-3 and CELCT, which took part in all the previous RTE challenges; thus, the RTE challenge became a track of the Text Analysis Conference7. The other major innovation was the introduction of the RTE-3 pilot task as a main task for RTE-4.
For the RTE-4 challenge, participants were given 1000 text pairs, called Text and Hypothesis, for which they were required to determine whether T entailed H. Textual entailment is defined as a directional relation between two text fragments – T, the entailing text and H, the entailed text – so that a human being, with common understanding of language and common background knowledge, can infer that H is most likely true on the basis of the content of T (Giampiccolo et al. 2008). Unlike previous challenges, the entailment systems had to make a three way decision, by separating the cases for which there was no entailment by whether the truth of H was contradicted by T or it remained unknown on the basis of T. In short, the three- way task requires the system to decide whether:
T entailed H - in which case the pair was marked as ENTAILMENT
T contradicted H - in which case the pair was marked as CONTRADICTION
The truth of H could not be determined on the basis of T - in which case the pair was marked as UNKNOWN
The test set contained 1000 entailment pairs (300 each for IE and IR, as they were considered more difficult on the basis of previous evaluations, and 200 each for SUM and QA).
50% of the pairs were in an entailment relation, 35% were unknown and 15% were contradictions. On the basis of the three way challenge, results for the two-way challenge can also be extracted. The results for the RTE-4 challenge are given below.
Submission Accuracy for the 2-way task Accuracy for the 3-way task
(Bensley and Hickl, 2008) 0.746 -
(Iftene, 2008) 0.721 0.685
(Wang and Neumann, 2008) 0.706 0.614
(Li et al., 2008) 0.659 0.588
Table 6: Top scoring systems from RTE-4
Compared to the RTE-3 results, the top systems in RTE-4 had lower results. This may be due to the fact that the training and test sets were not developed concurrently.
1.2.5. PASCAL Recognizing Textual Entailment Challenge (RTE-5) at TAC 2009 The RTE-58 track at TAC 2009 (Bentivogli et al. 2010) continues the previous RTE Challenges that have aimed to focus research and evaluation on the underlying semantic inference task. The main structure of the RTE-5 challenge is mainly the same as that used in the RTE-4 campaign, but the data sets used had two significant changes: 1) the average length of the Texts was greater, and 2) texts came from a variety of sources and without any additional corrections or simplifications as compared to the source documents. These changed significantly increased the difficulty of the entailment task, but the reason behind the decision was that such texts are closer to real life applications of entailment.
The RTE-5 data set consists of 1200 pairs, of which half are used for training and the other half for testing. The pairs were drawn from the same application settings as those for previous challenges, with the exclusion of the summarization setting (entailment in the summarization setting was defined as a new pilot task). Also, the participants were required to submit ablation tests (runs of the entailment systems with components deliberately excluded) in order to judge the relevance of the knowledge sources employed by various systems.
The RTE-5 challenge also brought a new pilot task, defined by the organizers as solving textual entailment in a summarization setting and concerning the extraction of texts from a series of newspaper articles that yielded positive entailment for a given set of hypotheses. The difficulty of the task is twofold: first, the texts are not modified in any way as compared to the original source, so they may contain spelling errors, sentences with grammar errors, abbreviations and contractions, etc. The second problem is that there are a large numbers of candidate pairs, as for every one of the nine topics there are approximately ten hypotheses, and for every hypothesis in a topic the number of candidate pairs is equal to the number of sentences.
This leads to a very large search space, and the problem to reduce it becomes very important.
Submission Accuracy for the 2-way task Accuracy for the 3-way task
(Iftene and Moruz, 2010) 0.6833 0.735
(Wang et al, 2010) 0.6367 0.685
(Ferrandez et al., 2010) 0.6 -
(Malakasiotis, 2010) 0.575 -
(Mehdad et al, 2010b) - 0.6617
Table 7: Results for the RTE-5 main task
Submission Precision Recall F-measure (Mirkin et al, 2010, RTE) 0.4098 0.5138 0.4559 (MacKinlay and Baldwin, 2010) 0.4294 0.38 0.4032 (Mehdad et al, 2010a) 0.2254 0.6475 0.3344
(Iftene and Moruz, 2010) 0.5112 0.2288 0.3161
Table 8: Results for the RTE-5 pilot task
Even though the data sets for the RTE-5 challenge are significantly more complex than those of the previous challenges, the average score of the system has improved, and the top scores are similar, thus proving that entailment systems had reached maturity. Also, the introduction of the new pilot task offered a new perspective on the idea of textual entailment in a real life application, and highlighted the difficulties that arise when adapting systems built for entailment in an artificial setting.
1.2.6. PASCAL Recognizing Textual Entailment Challenge (RTE-6) at TAC 2010 The feedback provided through the RTE challenges by the NLP community has given the opportunity for the application of RTE systems to various settings and has pointed the RTE community towards more realistic scenarios. In particular, the RTE-5 Pilot Search Task represented a step forward, as for the first time textual entailment recognition was performed on a real text corpus. Furthermore, it was set up in the Summarization setting, attempting to analyze the potential impact of textual entailment on a real NLP application.
According to the RTE-6 guidelines9 the goals of the challenge are:
to advance the state of the art in RTE, by proposing a data set which reflects the natural distribution of entailment in a corpus and presents all the problems that can arise while detecting textual entailment in a natural setting, such as the interpretation of sentences in their discourse context;
to further explore the contribution that RTE engines can make to Summarization applications. In a general summarization setting, correctly extracting all the sentences entailing a given candidate statement for the summary (similar to Hypotheses in RTE) corresponds to identifying all its mentions in the text, which is useful to assess the importance of that candidate statement for the summary and, at the same time, to detect those sentences which contain redundant information and should probably not be included in the summary. Furthermore, if
automatic summarization is performed in the Update scenario (where systems are required to write a short summary of a set of newswire articles, under the assumption that the user has already read a given set of earlier articles) it is important to distinguish between novel and non-novel information. In such a setting, RTE engines which are able to detect the novelty of H‟s can help Summarization systems filter out non-novel sentences from their summaries.
In the new setting of RTE-6, the challenge of identifying textual entailment changes in the following way: given a corpus, a hypothesis H and a set of candidate entailing sentences retrieved from the corpus using Lucene, the systems need to identify all the sentences among the candidates that entail the hypothesis. This setting for the entailment problem is very different from the one originally proposed in the previous RET challenges; it is concerned with entailment within a corpus, with both the text and the hypothesis needing to be interpreted in the larger setting of the available knowledge, and it reflects the natural distribution of entailment (which is to say there are a lot more negative examples than positive ones).
In the RTE-6 main task scenario (Bentivogli et al., 2011), the test set is defined as a set of topics, each containing document clusters. For each of the topics, a set of up to 30 standalone sentences is extracted, and those sentences will serve as hypotheses for the topic. For each H in the topic a Lucene query is run (all the words in the hypothesis connected by the OR logical operator), and the top 100 ranked sentences are selected as candidates (it has been empirically determined that 80% of all the entailing sentences in the corpus are found in this manner). Based on the Main Task, a subtask is also defined. It is focused on Novelty Detection, which means that RTE systems are required to judge whether the information contained in each H is novel with respect to (i.e., not entailed by) the information contained in the corpus. If entailing sentences are found for a given H, it means that the content of the H is not new; in contrast, if no entailing sentences are detected, it means that information contained in the H is novel. Although the Novelty Detection Task has the same structure as the Main Task, it is separated out as a subtask to allow participants to optimize their RTE engines differently (i.e., for novelty detection).
Submission Precision Recall F-measure (Jia et al, 2011) 0.6857 0.3693 0.4801
Submission Precision Recall F-measure (Majumdar and Bhattacharyya, 2011) 0.5343 0.4286 0.4756 (Tateishi and Ishikawa, 2011 IKOMA) 0.3971 0.5143 0.4481 (Kouylekov et al., 2011) 0.4346 0.4603 0.4471
Table 9: Results for the RTE-6 main task
Submission Precision Recall F-measure (Jia et al, 2011) 0.7239 0.97 0.8291 (Tateishi and Ishikawa, 2011) 0.7944 0.85 0.8213 (Pakray et al, 2011) 0.8058 0.83 0.8177 (Volokh et al., 2011) 0.735 0.86 0.7926 (Iftene and Moruz, 2011) 0.7328 0.85 0.787
Table 10: Results for the RTE-6 novelty detection subtask 1.3. Conclusions
In this section we have provided an overview of the RTE challenges, starting with the first, introduced in 2005, up to the latest, in 2010. Through the evolution of the data sets, the tasks and the results from year to year, the evolution of textual entailment systems can be followed. Because of this, and because the RTE challenge provides a unified testing ground for textual entailment engines, its importance cannot be understated.
As can be seen from the large number of participants (the number of competitors has grown steadily until RTE-5), the interest of the NLP community for this problem is significant, which leads us to believe that improved textual entailment engines are still needed.
The rest of the thesis is structured as follows: chapter 2 describes the state of the art in textual entailment, chapter 3 describes the method proposed by us for solving textual entailment, chapter 4 gives evaluation results for the system we have built for solving textual entailment,
along with a brief description, chapter 5 discusses the adaptation of the entailment system we have built to the Romanian language, and conclusions are given in chapter 6.
2. State of the Art in Textual Entailment
The very first approaches to solving textual entailment were based on word overlap (Herrera, 2005), statistical lexical relations (Bayer et al., 2005), WordNet similarities (Herrera, 2005), syntactic matching (Delmonte et al., 2005), world knowledge (Bayer et al., 2005), logical inference (Akhmatova, 2005), inversion transduction grammars (Wu, 2005), (Jijkoun and de Rijke, 2005) or edit distance between parsing trees (Kouylekov and Magnini, 2005). Eventually, systems evolved to using a series of new directions, which were starting to turn away from the initial shallow approaches, with systems making use of semantic role labelling (Hickl et al., 2006), Machine Learning classification (Inkpen et al., 2006 and Kozareva, 2006), using of background knowledge (Tatu et al., 2006), acquisition of entailment corpora (Hickl et al., 2006), and rule based entailment (Iftene, 2009).
2.1. Lexical Representation
One of the first successful systems for recognizing textual entailment was an application of the BLEU algorithm, described in (Pérez and Alfonseca, 2005). The BLEU (BiLingual Evaluation Understudy) was first described in (Papineni et al., 2001), and was initially used for ranking machine translation systems. It provides a score on the validity of a given translation candidate (usually obtained automatically, by means of machine translation) on the basis of its similarity to a manually obtained translation of the same source text. In order to cope with variability, the candidate is compared to a set of manually obtained translations, and the top score is chosen; BLEU boosts the similarity score for a given candidate in the case of matching n-grams. It is worth noting that after computing the score of the co-occurring n-grams, high scores of very short candidates are penalized on the basis of their length, as it is considered that a translation variant should not be significantly different in length than the text. This drawback, coupled with the fact that the BLEU algorithm is inherently symmetrical, make the adaptation for the task of recognizing textual entailment not trivial.
The application of the BLEU algorithm for textual entailment consists of comparing the entailing text (T) and the hypothesis (H); according to the score provided by BLEU, the entailment is then judged to be true or false. According to (Pérez and Alfonseca, 2005), using T as the reference and H as the candidate is best, as T is usually longer and therefore contains more
information, which helps the BLEU processing, as the quality of the references is very important.
This adaptation, which essentially consists of only comparing part of T against H, circumvents both problems outlined above with the usage of the BLEU algorithm.
The very good result obtained by the system (it was ranked first in RTE-1, with an accuracy of over 70%) proves the potential of the approach. However, this method is not without drawbacks: it does not make any use of syntactic information, and in the case of hypotheses that have many words in common with the texts, the answer will always be positive entailment.
Another significant drawback is that this approach does not work well on long texts (such as those used in the RTE-5 challenge), as it becomes very difficult to extract significant references that are both relevant and valid for determining entailment (not extracted from different sentences for example).
With the growing complexity of the task of recognizing textual entailment, lexical based approaches have grown significantly more complex, by making use of wider and wider lexical- semantic resources. A brief overview of some of the most significant approaches in recent years is given below.
BLUE-Lite (Clark and Harrison, 2011) is a derivative of the RTE5 system BLUE (Clark and Harrison, 2010), and is characterized by the following features: 1. The first step taken to compare T and H is to parse them with SAPIR, a broad coverage chart parser (Harrison and Maxwell, 1986), and to transform both T and H into bags of words which also include multiwords present in WordNet; all the functional words are ignored, as well as some light semantic verbs, given by the authors. 2. This approach could thus be described as "knowledge- based lexical entailment" that makes use of a series of relations from WordNet, such as the WN morphosemantic database, that relates nouns to verbs (build#v1 -agent→ maker#n1), the
“pertains-to” relation (rapidly#r1 ←pertains-to→ quick#a1), the “similar-to”
relation for adjectives (nice#a1 ←similar-to→ pleasant#s2), the “part-of” relation, and the “substance-of” relation. The approach also uses the DIRT paraphrase database. 3. In order to account for coreference and discourse context, the system attempts to match those words in H that cannot be matched to words in T to words in the previous sentence. 4. Some topics were intrinsically harder to find entailments for than others. To account for this, the system gradually reduces the restrictions on the matching of the words in H, by allowing for at most 2
words to be matched with sentences preceding T. The most important thing to note is that, unlike BLUE, BLUE-Lite does not use of any structural (parse-based) information in entailment decisions, but still obtains reasonably good results (in relative terms), with an f-score of more than 44% on the RTE-6 test set10.
(Majumdar and Bhattacharyya, 2011) describes a lexical based system for solving entailment that was used for the RTE-6 competition, which is similar to the one proposed by (Clark and Harrison 2011), yielding good results (47%). The system draws on lexical knowledge from WordNet (for measuring the length of the path between words) and VerbOcean (for determining entailment between verbs), and uses the Stanford named entity recognition system, enhanced with acronym detection. The system is built to take into account various types of coreference, but the reported tests do not include the coreference recognition module.
(Perini, 2011) is a lexical based entailment system used in RTE-6, and it is an evolution of the system described in (Perini, 2010) that uses directional text relatedness conditions for detecting textual entailment between sentences. The system is centred on computing the text relatedness score between T and H, with respect to both T and H. In general, the directional relation of entailment can be described as 𝑠𝑖𝑚 𝑇, 𝐻 𝐻 > 𝑠𝑖𝑚 𝑇, 𝐻 𝑇 (the similarity between T and H with regards to H is greater than the similarity between T and H with regards to T), if T entails H, where sim is the measure of similarity between H and T. The approach is based on the method proposed by (Tătar et al., 2009), which describes a directional approach to solving textual entailment, based on the notion of lexical-semantic similarity.The system described in (Perini, 2011) uses WordNet to compute similarity between words, by assigning scores on the basis of the relation detected using WN (if two words are in the same synset, the score is 0.7).
This approach yielded middle range results in RTE-5, and comparatively high results in RTE-6 (40% f-measure).
(Tateishi and Ishikawa, 2011) proposes a new method for identifying textual entailment by making use of local-novelty detection. The method is based on determining whether a hypothesis H is local-novel. „H being local-novel‟ can be described as the fact that the information described by H first appeared in TH , and TH denotes the text (T) that entails H.
10 It is important to note that results in different RTE competitions cannot be compared due to differences in datasets.
Based on this description, it follows that H is local-novel, the first possible entailing text has to be TH, and therefore any T that appeared before it has to be non-entailing. The authors define six rules for determining the local-novel feature; on the basis of this feature, they provide a threshold of similarity, which describes the minimum required similarity of T and H in order to obtain entailment.
(Blake et al., 2011) explores the impact that semantics alone has on the accurate detection of entailment, in order to better judge the impact of this type of component on an entailment system. The authors considered various lexical-semantic knowledge bases and resources, and eventually used WordNet and YAGO. The reason for using YAGO, developed as part of the YAGO-NAGA system, is that it seems to be more comprehensive in terms of names than WordNet. YAGO is a bootstrapped ontology, which uses information given in WordNet to infer further information from the web. The disadvantage of this approach as compared to manually constructed resources such as WordNet is the fact that automatic bootstrapping can lead to errors. Names of persons and organizations, locations and numbers are compared on the basis of the results obtained by the lexical knowledge bases, and a similarity score is computed. The rest of the words (those not recognized as NEs or co-reference, for example) are matched lexically, on the basis of the lemma form, or semantically, by means of synonymy in WordNet.
(Pakray et al. 2010) describes a system based on the composition of six lexical matching methods: WordNet based unigram match, bigram match, longest common sub-sequence, skip- gram (any combination of n words in the order in which they appear in the text, allowing for arbitrary gaps), stemming and named entity matching. The authors then trained a number of classifiers on the features described above, which they then used as a voting system for determining entailment. Since the classifiers had to be in complete agreement in order for the system to make a positive decision, the best results are obtained when only two classification methods are used.
(Breck, 2010) proposes a novel approach to detecting entailment, on the basis of the observation that if T entails H, it follows that all of the information in H must be contained within T. Therefore, the author proposes a system that can determine whether all the units of information in H can be found in T; if there exists at least one unit of information in H that cannot be found in T, the result for that entailment pair is Unknown. If all the units of
information in H can be found in T, it is also possible that the result can be Contradiction, and so a second, simpler, system is used for determining contradictions. The units of information taken into account are words, but the author only takes into account open-class words such as common nouns, proper nouns, verbs, adjectives, adverbs, and numbers. Based on experiments carried out by the author, it has been proven that the system gains most from using common nouns, proper nouns, and numbers. The matching methods employed are exact string match, edit distance, acronyms, and lexicon based matching (WordNet and Dekang Lin‟s automatically derived thesaurus, which experiments showed did not have high enough accuracy11).
(MacKinlay and Baldwin, 2010) proposes a system specifically designed for solving the task of textual entailment in a summarization setting, on the basis of variants of standard document comparison techniques. The system computes the lexical overlap of candidate texts for a given hypothesis by means of cosine similarity on bag of words features, weighed by inverse document frequency. The system also uses WordNet for determining synonymy and derivationally related forms, and the results obtained are above the median for the RTE-5 pilot task.
This subsection described the most successful approaches to solving textual entailment by using lexical and lexical-semantic features. Even though this type of approach is hindered by the lack of more complex analyses, such as syntactic parsing for example, the results described are quite good, especially for the summarization setting of textual entailment. This proves that, in a significant number of entailment cases, the measure of lexical similarity is still significant.
2.2. Syntactic Graph Distance
Graph distance is a method that is widely recognized as a powerful tool for solving computer vision problems and for performing pattern matching (Bunke and Shearer, 1998), and was the method first used by (Pazienza et al., 2005) for creating a textual entailment system.
Syntactic graph distance based approaches work by transforming the objects that need comparing (in the case of textual entailment, the text and the hypothesis) into graphs, thus transforming the entailment problem into a graph matching problem.
Since any natural language sentence can be represented by a syntactic graph, textual entailment can be reduced to a graph similarity problem, albeit with some particular properties (Pazienza et al, 2005): this problem of textual entailment is not symmetric, node similarity cannot be reduced to similarity between node labels (we cannot compare pre-terminal nodes in the syntactic trees), and linguistic transformations need to be taken into account. Therefore, on the basis of graph similarity, entailment can be defined as a non-symmetric transitive relation that can be defined as follows (Pazienza et al., 2005):
T is syntactically subsumed by H (e.g., in H: [John is walking.] and T: [John is walking fast.], T contains an adverb, making the text more specific than the hypothesis).
T is semantically subsumed by H (e.g., in H: [John runs home.] and T: [John sprints home], where the verb “run” is more general than the verb “sprint”).
The information in T is directly implied by the information in H (e.g., T: [John is snoring] and H: [John is sleeping.], the fact that John is snoring cannot take place unless John is sleeping).
The system described in (Pazienza et al., 2005) attempts to find the maximal subgraph within the syntactic representation of T that is isomorphic with the syntactic representation of H.
This method is more advanced than the lexical similarity based approaches, and has the advantage of taking into account both the syntactic and the semantic levels. Graph similarity has been used by many systems, including those of (Katrenko and Adriaans, 2006), (Zanzotto et al.
2006), (Burchardt et al., 2007), (Ferrés and Rodríguez, 2007) and (Padó et al., 2008). More recent graph distance based approaches are given below.
(Mehdad et al., 2010b) proposes a textual entailment engine based on the notion of syntactic/semantic kernels. Syntactic kernels were first explored by the authors in (Zanzotto and Moschitti, 2006), where the basis of the approach to solving entailment was the alignment of syntactic subtrees from T and H by means of cross-pair syntactic kernels, which were then used to train an SVM classifier. Standard tree kernel functions, measure the similarity of two trees in terms of the number of tree fragments (subtrees or substructures) that they have in common, with the addition of so called “placeholders”, which are the means used to track the movement of aligned words in the text and hypothesis. However, this first approach, even though it performed
quite well on the RTE-1 dataset, is limited by the fact that two syntactically identical subtrees which have different leaves do not match even if the leaves, i.e. the words, are synonyms.
(Mehdad et al. 2010b) proposes the augmentation of the previous syntactic kernels with the use of semantic knowledge based on Wikipedia, thus creating an entailment system based on syntactic/semantic kernels. More specifically, Wikipedia was used to enrich the similarity measure between pairs of text and hypothesis (i.e. the tree kernel for text and hypothesis pairs), with a lexical similarity (i.e. the similarity between the leaves of the trees). The lexical similarity is defined as a proximity matrix, which is filled in by considering that words are semantically related if they co-occur frequently in Wikipedia articles. The Syntactic Semantic Tree Kernel (SSTK) is the augmentation of the Syntactic Tree Kernel defined in (Zanzotto and Moschitti 2006) with the similarity measure derived from Wikipedia. In order to reduce the search space, only the first 200,000 most visited articles in Wikipedia were used for extracting similarity between terms, and similarity was computed only for those words that actually appeared in the text-hypothesis pairs. The kernel used for the system is Max Similarity Kernel, though the authors also propose the Placeholder Kernel as an alternative. The MSK-SSTK combination was used in the system for RTE5, which achieved 66% accuracy in the 2 way recognition task, well above the median and 6% below the top scoring system in the competition. A more detailed description of the tree kernels employed is given in (Mehdad et al. 2010c). The coverage and usefulness of WordNet and of the British National Corpus (BNC is a balanced synchronic text corpus containing 100 million words with morpho-syntactic annotation) are also explored, and proven to be much less than those of Wikipedia, by quite a large margin (Wikipedia coverage is almost double than that of both WordNet and British National Corpus).
2.3. Tree Edit Distance Algorithms
One of the first uses of tree edit distance algorithms for textual entailment is the one proposed by (Kouylekov and Magnini, 2005), which describes a system based on a tree edit algorithm applied on the dependency trees of both the text and the hypothesis. According to (Kouylekov and Magnini, 2005), “T entails H if there exists a sequence of transformations applied to T such that we can obtain H with an overall cost below a certain threshold”. This is based on the intuition that pairs where the entailment relation holds have low transformation costs, in a similar fashion to the approach proposed in subsection 2.2. The difference in this type