unanswered right 0 19 11 12

(1)

Evaluation metrics and Question-Answering systems Slides offered by Adrian Iftene

(2)

}  NLP Systems Evaluation

}  Information Retrieval

}  Information Extraction

}  Question Answering

◦  Introduction

◦  System components

  Background knowledge indexing

  Index creation and information retrieval

  Answer extraction

◦  Results

◦  Error analysis

}  Conclusions

(3)

}  “An important recent development in NLP has been the use of much more rigorous

standards for the evaluation of NLP systems”

Manning and Schutze

}  To be published, all research must:

◦  establish a baseline, and

◦  quantitatively show that it improves on the baseline and the state-of-the-art

(4)

}  “How well does the system work?”

}  Possible domains for evaluation

◦  Processing time of the system

◦  Space usage of the system

◦  Human satisfaction

◦  Correctness of results

}  Measures: (Accuracy, Error), (Precision, Recall, F-measure)

(5)

}  Comparing the output of the system with a gold standard we can verify what is correct

}  The results of a system are marked as:

◦  Correct: matches the gold standard

◦  Incorrect: otherwise

(6)

}  Accuracy = 66.66 %

}  Error = 33.33 %

(7)

}  Precision and Recall are set-based measures

}  They evaluate the quality of some set membership, based on a reference set membership

}  Precision: what proportion of the retrieved documents is relevant?

}  Recall: what proportion of the relevant documents is retrieved?

(8)

relevant documents retrieved documents

Precision = 4 / 10 = 40 % Recall = 4 / 14 = 28.57 %

(9)

Precision = 14 / 20 = 70 % Recall = 14 / 14 = 100 %

(10)

Precision = 0 / 6 = 0 % Recall = 0 / 14 = 0 %

(11)

}  F-measure is a measure of a test's accuracy, and it considers both the precision p and the recall r

}  General formula:

}  F1-measure:

}  F2-measure = ?

(12)

}  Question Answering (QA): a QA system takes as input a question in natural language and produces one or more ranked answers from a collection of documents.

(13)

}  QA systems normally adhere to the pipeline architecture composed of three main modules (Harabagiu and Moldovan, 2003):

◦  question analysis – the results are keywords, answer and question type, focus

◦  paragraph retrieval - the results are a set of relevant candidate paragraphs/sentences from the document collection

◦  answer extraction – the results are a set of candidate answers ranked using likelihood measures

(14)

}  Harabagiu and Moldovan, 2003:

◦  Factoid – “Who discovered the oxygen?”, “When did Hawaii become a state?” or “What football team won the World Coup in 1992?”

◦  List – “What countries export oil?” or “What are the regions preferred by the Americans for

holidays?”.

◦  Definition – “What is a quasar?” or “What is a question-answering system?”

◦  How, Why, hypothetical, semantically constrained, polar (Yes/No) and cross-lingual questions

(15)

}  Person - "What”, "Who”, "Whom", "With who"

}  Location (City, Country, and Region) - "What state/city“, "From where”, "Where“

}  Organization - "Who produced“, "Who made“

}  Temporal (Date and Year) – “When”

}  Measure (Length, Surface and Other) – “How much”

}  Count - "How many“

}  Yes/No – “Did the girl fear that?”, “Is the car blue?”

(16)

}  Local collections, internal organization documents, newspapers, Internet

}  Closed-domain - deals with questions from a specific domain (medical, baseball, etc.). Can exploit domain-specific knowledge

(ontologies, rules, disambiguation)

}  Open-domain – general question about

anything. Can use general knowledge about the world

(17)

}  The first QA systems have been created in the 60s:

◦  BASEBALL (Green 1963) - answer questions about baseball games

}  LUNAR (Woods, 1977) – geological analysis of rocks returned by the Apollo moon missions

}  IURES (Cristea, Tufiş, Mihăiescu, 1985) –

medical domain, querying national programs library

}  QUERNAL (Cristea, Tufiș, 1987) – personnel database, drilling and extraction, metallurgy, geography

(18)

}  Powerset: http://www.powerset.com/ ( http://www.bing.com/)

}  Assimov the chat bot: http://talkingrobot.org/b/

}  AnswerBus: http://www.answerbus.com/index.shtml

}  NSIR: http://tangra.si.umich.edu/clair/NSIR/html/nsir.cgi

}  START (The first question answering system):

http://start.csail.mit.edu/

(19)

(20)

(21)

(22)

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

(23)

}  CLEF (Cross Language Evaluation Forum) started in 2000 -

http://www.clef-campaign.org/ European languages in both monolingual and cross- language contexts

◦  Coordination: Istituto di Scienza e Tecnologie dell'Informazione, Pisa, Italy

◦  Romanian Institute for Computer Science, Romania

}  TREC (Text REtrieval Conference) - started in 1992 http://trec.nist.gov/

◦  National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, USA

(24)

An excerpt from the gold standard file

(25)

}  Our group participates at CLEF exercises since 2006:

◦  2006 – Ro–En (English collection) – 9.47% right answers

◦  2007 – Ro–Ro (Romanian Wikipedia) – 12 %

◦  2008 – Ro–Ro (Romanian Wikipedia) – 31 %

◦  2009 – Ro–Ro, En–En (JRC-Acquis) – 47.2 % (48.6%)

◦  2010 – Ro-Ro, En-En, Fr-Fr (JRC-Acquis, Europarl) – 47.5% (42.5%, 27 %)

5 10 15 20 25 30 35 40 45 50

(26)

Lucene index 1

Lucene indexes 2 Background

knowledge Test data (documents,

questions, possible answers)

Questions processing:

- Lemmatization - Stop words elimination - NEs identification - Lucene query

Answers processing:

- Lemmatization - Stop words elimination - NEs identification - Lucene query

Identify relevant documents

documents

Partial and global scores

per answers

(27)

}  The Romanian background knowledge has 161,279 documents in text format

◦  25,033 correspond to the AIDS topic

◦  51,130 to Climate Change topic

◦  85,116 to Music and Society topic

}  The indexing component considers the name of the file and the text from it => Lucene index 1

(28)

}  Test data was an XML file with 12 test documents

◦  4 documents for each of the three topics (12 in total)

◦  10 questions for each document (120 in total)

◦  5 possible answers for each question (600 in total)

}  Test data processing involved 3 operations:

◦  extracting documents

◦  processing questions

◦  processing possible answers

(29)

}  The content of <doc> => <topic id>\<reading test id>\1..10

topic id

reading test id

(30)

}  Stop words elimination

}  Lemmatization

}  Named Entity identification

}  Lucene query building

(31)

}  Similar to processing questions +

}  We use ontology (Iftene and Balahur, 2008) for

elimination of possible answers with low probability to be final answer (relation [is_located_in])

}  In which European cities has Annie Lennox performed?

}  We eliminate from the list of possible answers the answers with non-European cities (we replace non- European cities with the value XXXXX)

(32)

}  For every question we index relevant documents returned by Lucene at the previous step and relevant documents saved from the initial test file

(33)

}  Then in every index, we performed searches using Lucene queries associated to possible answers

}  For every answer, we obtained a list of documents with Lucene relevance scores

}  Score2(d, a) is the relevance score for document d when we search with the Lucene query associated to the answer a

(34)

}  Results of UAIC’s runs at question answering level

Ro-Ro En-En

answered right 30 11 19 25

answered wrong 85 19 43 47

total answered 115 30 62 72

unanswered right 0 19 11 12

unanswered wrong 0 66 42 34

unanswered empty 5 5 5 2

total unanswered 5 90 58 48

Overall accuracy 0.25 0.09 0.16 0.21 C@1 measure 0.26 0.16 0.23 0.29

(35)

}  Yes–no question: http://en.wikipedia.org/wiki/Yes%E2%80%93no_question

}  Question Answering: http://en.wikipedia.org/wiki/Question_answering

}  Lecture 13: Evaluation: Precision and Recall

http://courses.washington.edu/ling473/Lecture13.pdf

}  Precision and Recall of Five Search Engines for Retrieval of Scholarly Information in the Field of Biotechnology:

http://www.webology.org/2005/v2n2/a12.html