• Nu S-Au Găsit Rezultate

What is Text-Mining?

N/A
N/A
Protected

Academic year: 2022

Share "What is Text-Mining?"

Copied!
125
0
0

Text complet

(1)

Text-Mining Tutorial

Marko Grobelnik, Dunja Mladenic

J. Stefan Institute, Slovenia

(2)

What is Text-Mining?

„

“…finding interesting regularities in

large textual datasets…”

(Usama Fayad, adapted)

„

…where interesting means: non-trivial,

hidden, previously unknown and potentially useful

„

“…finding semantic and abstract

information from the surface form of

textual data…”

(3)

Which areas are active in Text Processing?

Data Analysis Computational

Linguistics

Search & DB Knowledge Rep. &

Reasoning

(4)

Tutorial Contents

„ Why Text is Easy and Why Tough?

„ Levels of Text Processing

„ Word Level

„ Sentence Level

„ Document Level

„ Document-Collection Level

„ Linked-Document-Collection Level

„ Application Level

„ References to Conferences, Workshops, Books, Products

„ Final Remarks

(5)

Why Text is Tough?

(M.Hearst 97)

„

Abstract concepts are difficult to represent

„

“Countless” combinations of subtle, abstract relationships among concepts

„

Many ways to represent similar concepts

„ E.g. space ship, flying saucer, UFO

„

Concepts are difficult to visualize

„

High dimensionality

„

Tens or hundreds of thousands of

features

(6)

Why Text is Easy?

(M.Hearst 97)

„

Highly redundant data

„ …most of the methods count on this property

„

Just about any simple algorithm can get

“good” results for simple tasks:

„ Pull out “important” phrases

„ Find “meaningfully” related words

„ Create some sort of summary from documents

(7)

Levels of Text Processing 1/6

„ Word Level

„ Words Properties

„ Stop-Words

„ Stemming

„ Frequent N-Grams

„ Thesaurus (WordNet)

„ Sentence Level

„ Document Level

„ Document-Collection Level

„ Linked-Document-Collection Level

„ Application Level

(8)

Words Properties

„ Relations among word surface forms and their senses:

„ Homonomy: same form, but different meaning (e.g. bank:

river bank, financial institution)

„ Polysemy: same form, related meaning (e.g. bank: blood bank, financial institution)

„ Synonymy: different form, same meaning (e.g. singer, vocalist)

„ Hyponymy: one word denotes a subclass of an another (e.g.

breakfast, meal)

„ Word frequencies in texts have power distribution:

„ …small number of very frequent words

„ …big number of low frequency words

(9)

Stop-words

„

Stop-words are words that from non-linguistic view do not carry information

„ …they have mainly functional role

„ …usually we remove them to help the methods to perform better

„

Natural language dependent – examples:

„ English: A, ABOUT, ABOVE, ACROSS, AFTER, AGAIN,

AGAINST, ALL, ALMOST, ALONE, ALONG, ALREADY, ALSO, ...

„ Slovenian: A, AH, AHA, ALI, AMPAK, BAJE, BODISI, BOJDA, BRŽKONE, BRŽČAS, BREZ, CELO, DA, DO, ...

„ Croatian: A, AH, AHA, ALI, AKO, BEZ, DA, IPAK, NE, NEGO, ...

(10)

After the stop-words removal

Information Systems Asia Web provides research IS-related commercial materials

interaction research sponsorship interested

corporations focus Asia Pacific region

Survey Information Retrieval guide IR emphasis web-based

projects Includes glossary pointers interesting papers

Original text

Information Systems Asia Web - provides research, IS-related commercial materials,

interaction, and even research sponsorship by interested

corporations with a focus on Asia Pacific region.

Survey of Information Retrieval -

guide to IR, with an emphasis on web-based projects. Includes a glossary, and pointers to

interesting papers.

(11)

Stemming (I)

„

Different forms of the same word are usually problematic for text data

analysis, because they have different spelling and similar meaning (e.g.

learns, learned, learning,…)

„

Stemming is a process of transforming

a word into its stem (normalized form)

(12)

Stemming (II)

„

For English it is not a big problem - publicly available algorithms give good results

„ Most widely used is Porter stemmer at

http://www.tartarus.org/~martin/PorterStemmer/

„

E.g. in Slovenian language 10-20 different forms correspond to the same word:

„ E.g. (“to laugh” in Slovenian): smej, smejal, smejala, smejale, smejali, smejalo, smejati, smejejo, smejeta, smejete,

smejeva, smeješ, smejemo, smejiš, smeje, smejoč, smejta, smejte, smejva

(13)

Example cascade rules used in English Porter stemmer

„ ATIONAL -> ATE relational -> relate

„ TIONAL -> TION conditional -> condition

„ ENCI -> ENCE valenci -> valence

„ ANCI -> ANCE hesitanci -> hesitance

„ IZER -> IZE digitizer -> digitize

„ ABLI -> ABLE conformabli -> conformable

„ ALLI -> AL radicalli -> radical

„ ENTLI -> ENT differentli -> different

„ ELI -> E vileli - > vile

„ OUSLI -> OUS analogousli -> analogous

(14)

Rules automatically obtained for Slovenian language

„

Machine Learning applied on Multext-East dictionary (http://nl.ijs.si/ME/)

„

Two example rules:

„ Remove the ending “OM” if 3 last char is any of HOM, NOM, DOM, SOM, POM, BOM, FOM. For instance, ALAHOM, AMERICANOM, BENJAMINOM, BERLINOM, ALFREDOM, BEOGRADOM, DICKENSOM,

JEZUSOM, JOSIPOM, OLIMPOM,... but not ALEKSANDROM (ROM -> ER)

„ Replace CEM by EC. For instance, ARABCEM, BAVARCEM,

BOVCEM, EVROPEJCEM, GORENJCEM, ... but not FRANCEM (remove EM)

(15)

Phrases in the form of frequent N-Grams

„ Simple way for generating phrases are frequent n- grams:

„ N-Gram is a sequence of n consecutive words (e.g. “machine learning” is 2-gram)

„ “Frequent n-grams” are the ones which appear in all observed documents MinFreq or more times

„ N-grams are interesting because of the simple and efficient dynamic programming algorithm:

„ Given:

„ Set of documents (each document is a sequence of words),

„ MinFreq (minimal n-gram frequency),

„ MaxNGramSize (maximal n-gram length)

„ for Len = 1 to MaxNGramSize do

„ Generate candidate n-grams as sequences of words of size Len using frequent n-grams of length Len-1

„ Delete candidate n-grams with the frequency less then MinFreq

(16)

Generation of frequent n-grams for 50,000 documents from Yahoo

# features 1.6M1.4M 1.2M 1M 800 000 600 000 400 000 200 000 0

1-grams 2-grams 3-grams 4-grams 5-grams

318K->70K 1.4M->207K 742K->243K 309K->252K 262K->256K

(17)

Document represented by n-grams:

1."REFERENCE LIBRARIES LIBRARY

INFORMATION SCIENCE (\#3 LIBRARY INFORMATION SCIENCE) INFORMATION RETRIEVAL (\#2 INFORMATION

RETRIEVAL)"

2."UK"

3."IR PAGES IR RELATED RESOURCES COLLECTIONS LISTS LINKS IR SITES"

4."UNIVERSITY GLASGOW INFORMATION

RETRIEVAL (\#2 INFORMATION RETRIEVAL) GROUP INFORMATION RESOURCES (\#2 INFORMATION RESOURCES) PEOPLE GLASGOW IR GROUP"

5."CENTRE INFORMATION RETRIEVAL (\#2 INFORMATION RETRIEVAL)"

6."INFORMATION SYSTEMS ASIA WEB RESEARCH COMMERCIAL MATERIALS RESEARCH ASIA PACIFIC REGION"

7."CATALOGING DIGITAL DOCUMENTS"

8."INFORMATION RETRIEVAL (\#2

INFORMATION RETRIEVAL) GUIDE IR EMPHASIS INCLUDES GLOSSARY INTERESTING"

9."UNIVERSITY INFORMATION RETRIEVAL (\#2 INFORMATION RETRIEVAL) GROUP"

Original text on the Yahoo Web page:

1.Top:Reference:Libraries:Library and Information Science:Information Retrieval

2.UK Only

3.Idomeneus - IR \& DB repository - These pages mostly contain IR related resources such as test collections, stop lists, stemming

algorithms, and links to other IR sites.

4.University of Glasgow - Information Retrieval Group - information on the resources and people in the Glasgow IR group.

5.Centre for Intelligent Information Retrieval (CIIR).

6.Information Systems Asia Web - provides research, IS-related commercial materials, interaction, and even research sponsorship by interested corporations with a focus on Asia Pacific region.

7.Seminar on Cataloging Digital Documents 8.Survey of Information Retrieval - guide to IR,

with an emphasis on web-based projects.

Includes a glossary, and pointers to interesting papers.

9.University of Dortmund - Information Retrieval Group

(18)

WordNet – a database of lexical relations

„

WordNet is the most well developed and widely used lexical database for English

„ …it consist from 4 databases (nouns, verbs, adjectives, and adverbs)

„

Each database consists from sense entries consisting from a set of synonyms, e.g.:

„ musician, instrumentalist, player

„ person, individual, someone

„ life form, organism, being

5677 4546

Adverb

29881 20170

Adjective

22066 10319

Verb

116317 94474

Noun

Number of Senses Unique

Forms Category

(19)

WordNet relations

Each WordNet entry is connected with other entries in a graph through relations.

Relations in the database of nouns:

leader -> follower Opposites

Antonym

course -> meal From parts to wholes

Part-Of

table -> leg From wholes to parts

Has-Part

copilot -> crew From members to their

groups Member-Of

faculty -> professor From groups to their

members Has-Member

meal -> lunch From concepts to

subtypes Hyponym

breakfast -> meal From concepts to

subordinate Hypernym

Example Definition

Relation

(20)

Levels of Text Processing 2/6

„

Word Level

„

Sentence Level

„

Document Level

„

Document-Collection Level

„

Linked-Document-Collection Level

„

Application Level

(21)

Levels of Text Processing 3/6

„

Word Level

„

Sentence Level

„

Document Level

„ Summarization

„ Single Document Visualization

„ Text Segmentation

„

Document-Collection Level

„

Linked-Document-Collection Level

„

Application Level

(22)

Summarization

(23)

Summarization

„

Task: the task is to produce shorter, summary version of an original

document.

„

Two main approaches to the problem:

„

Knowledge rich – performing semantic analysis, representing the meaning and generating the text satisfying length

restriction

„

Selection based

(24)

Selection based summarization

„

Three main phases:

„ Analyzing the source text

„ Determining its important points

„ Synthesizing an appropriate output

„

Most methods adopt linear weighting model – each text unit (sentence) is assessed by:

„ Weight(U)=LocationInText(U)+CuePhrase(U)+Statis tics(U)+AdditionalPresence(U)

„ …a lot of heuristics and tuning of parameters (also with ML)

„

…output consists from topmost text units

(sentences)

(25)

Selected units Selection threshold Example of selection based approach from MS Word

(26)

Visualization of a single

document

(27)

Why visualization of a single document is hard?

„

Visualizing of big text corpora is easier task because of the big amount of information:

„ ...statistics already starts working

„ ...most known approaches are statistics based

„

Visualization of a single (possibly short) document is much harder task because:

„ ...we can not count of statistical properties of the text (lack of data)

„ ...we must rely on syntactical and logical structure of the document

(28)

Simple approach

1. The text is split into the sentences.

2. Each sentence is deep-parsed into its logical form

z we are using Microsoft’s NLPWin parser

3. Anaphora resolution is performed on all sentences

z ...all ‘he’, ‘she’, ‘they’, ‘him’, ‘his’, ‘her’, etc. references to the objects are replaced by its proper name

4. From all the sentences we extract [Subject-Predicate- Object triples] (SPO)

5. SPOs form links in the graph

6. ...finally, we draw a graph.

(29)

Clarence Thomas article

(30)

Alan Greenspan article

(31)

Text Segmentation

(32)

Text Segmentation

„

Problem: divide text that has no given structure into segments with similar

content

„

Example applications:

„

topic tracking in news (spoken news)

„

identification of topics in large,

unstructured text databases

(33)

Algorithm for text segmentation

„

Algorithm:

„ Divide text into sentences

„ Represent each sentence with words and phrases it contains

„ Calculate similarity between the pairs of sentences

„ Find a segmentation (sequence of delimiters), so that the similarity between the sentences inside the same segment is maximized and minimized

between the segments

„

…the approach can be defined either as

optimization problem or as sliding window

(34)

Levels of Text Processing 4/6

„ Word Level

„ Sentence Level

„ Document Level

„ Document-Collection Level

„ Representation

„ Feature Selection

„ Document Similarity

„ Representation Change (LSI)

„ Categorization (flat, hierarchical)

„ Clustering (flat, hierarchical)

„ Visualization

„ Information Extraction

„ Linked-Document-Collection Level

„ Application Level

(35)

Representation

(36)

Bag-of-words document

representation

(37)

Word weighting

„

In bag-of-words representation each word is represented as a separate variable having numeric weight.

„

The most popular weighting schema is normalized word frequency TFIDF:

„ Tf(w) – term frequency (number of word occurrences in a document)

„ Df(w) – document frequency (number of documents containing the word)

„ N – number of all documents

„ Tfidf(w) – relative importance of the word in the document

) ) log( (

. )

( df w

tf N w

tfidf =

The word is more important if it appears several times in a target document

The word is more important if it appears in less documents

(38)

Example document and its vector representation

„ TRUMP MAKES BID FOR CONTROL OF RESORTS Casino owner and real estate Donald Trump has offered to acquire all Class B common shares of Resorts International Inc, a spokesman for Trump said. The estate of late Resorts chairman James M. Crosby owns 340,783 of the 752,297 Class B shares.

Resorts also has about 6,432,000 Class A common shares outstanding. Each Class B share has 100 times the voting power of a Class A share, giving the Class B stock about 93 pct of Resorts' voting power.

„ [RESORTS:0.624] [CLASS:0.487] [TRUMP:0.367] [VOTING:0.171]

[ESTATE:0.166] [POWER:0.134] [CROSBY:0.134] [CASINO:0.119]

[DEVELOPER:0.118] [SHARES:0.117] [OWNER:0.102] [DONALD:0.097]

[COMMON:0.093] [GIVING:0.081] [OWNS:0.080] [MAKES:0.078] [TIMES:0.075]

[SHARE:0.072] [JAMES:0.070] [REAL:0.068] [CONTROL:0.065]

[ACQUIRE:0.064] [OFFERED:0.063] [BID:0.063] [LATE:0.062]

[OUTSTANDING:0.056] [SPOKESMAN:0.049] [CHAIRMAN:0.049]

[INTERNATIONAL:0.041] [STOCK:0.035] [YORK:0.035] [PCT:0.022]

[MARCH:0.011]

(39)

Feature Selection

(40)

Feature subset selection

(41)

Feature subset selection

„

Select only the best features (different ways to define “the best”-different

feature scoring measures)

„

the most frequent

„

the most informative relative to the all class values

„

the most informative relative to the

positive class value,…

(42)

Scoring individual feature

∑ ∑

=WW =

F C posneg P C

F C F P

C P F

P

, , ( )

)

| log (

)

| ( )

(

„

InformationGain:

„

CrossEntropyTxt:

„

MutualInfoTxt:

„

WeightOfEvidTxt:

„

OddsRatio:

„

Frequency:

=

pos neg

C P C

W C W P

C P W

P

, ( )

)

| log (

)

| ( )

(

=

pos neg

C P W

C W C P

P

, ( )

)

| log (

) (

)

| ( 1

)(

(

) ( 1

)(

| log (

) ( ) (

, P C P C W

C P W

C W P

P C P

neg pos

C

=

)

| ( ))

| ( 1

(

))

| ( 1

( )

| log (

neg W

P pos

W P

neg W

P pos

W P

×

×

) (W Freq

(43)

Example of the best features

Information Gain

feature score [P(F|pos), P(F|neg)]

LIBRARY 0.46 [0.015, 0.091]

PUBLIC 0.23 [0, 0.034]

PUBLIC LIBRARY 0.21 [0, 0.029]

UNIVERSITY 0.21 [0.045, 0.028]

LIBRARIES 0.197 [0.015, 0.026]

INFORMATION 0.17 [0.119, 0.021]

REFERENCES 0.117 [0.015, 0.012]

RESOURCES 0.11 [0.029, 0.0102]

COUNTY 0.096 [0, 0.0089]

INTERNET 0.091 [0, 0.00826]

LINKS 0.091 [0.015, 0.00819]

SERVICES 0.089 [0, 0.0079]

Odds Ratio

feature score [P(F|pos), P(F|neg)]

IR 5.28 [0.075, 0.0004]

INFORMATION RETRIEVAL 5.13...

RETRIEVAL 4.77 [0.075, 0.0007]

GLASGOW 4.72 [0.03, 0.0003]

ASIA 4.32 [0.03, 0.0004]

PACIFIC 4.02 [0.015, 0.0003]

INTERESTING 4.02[0.015, 0.0003]

EMPHASIS 4.02 [0.015, 0.0003]

GROUP 3.64 [0.045, 0.0012]

MASSACHUSETTS 3.46 [0.015, ...]

COMMERCIAL 3.46 [0.015,0.0005]

REGION 3.1 [0.015, 0.0007]

(44)

Document Similarity

(45)

Cosine similarity

between document vectors

„

Each document is represented as a vector of weights D = <x>

„

Similarity between vectors is estimated by the similarity between their vector representations (cosine of the angle between vectors):

= ∑

k k

j j

i

i i

x x

x x D

D

Sim

2 2

2 1 2

1

, )

(

(46)

Representation Change:

Latent Semantic Indexing

(47)

Latent Semantic Indexing

„

LSI is a statistical technique that attempts to estimate the hidden

content structure within documents:

„

…it uses linear algebra technique Singular- Value-Decomposition (SVD)

„

…it discovers statistically most significant

co-occurences of terms

(48)

LSI Example

Original document-term mantrix

1 0

1 0

0 0

truck

0 1

1 0

0 1

car

0 0

0 0

1 1

moon

0 0

0 0

1 0

astronaut

0 0

0 1

0 1

cosmonaut

d6 d5

d4 d3

d2

d1 Rescaled document matrix,

Reduced into two dimensions

0.65 0.35

1.00 -0.30

-0.84 -0.46

Dim2

-0.26 -0.71

-0.97 -0.04

-0.60 -1.62

Dim1

d6 d5

d4 d3

d2 d1

1.00 0.7

0.9 -0.9

-0.5 0.1

d6

1.00 0.9

-0.3 0.2

0.7 d5

1.00 -0.6

-0.2 0.5

d4

1.00 0.9

0.4 d3

1.00 0.8

d2

1.00 d1

d6 d5

d4 d3

d2 d1

High correlation although d2 and d3 don’t share any word

Correlation matrix

(49)

Text Categorization

(50)

Document categorization

unlabeled document

???

labeled documents

Machine learning

document category (label)

(51)

Automatic Document Categorization Task

„ Given is a set of documents labeled with content categories.

„ The goal is: to build a model which would automatically assign right content categories to new unlabeled

documents.

„ Content categories can be:

„ unstructured (e.g., Reuters) or

„ structured (e.g., Yahoo, DMoz, Medline)

(52)

Algorithms for learning document classifiers

„

Popular algorithms for text categorization:

„

Support Vector Machines

„

Logistic Regression

„

Perceptron algorithm

„

Naive Bayesian classifier

„

Winnow algorithm

„

Nearest Neighbour

„

....

(53)

Perceptron algorithm

Input: set of pre-classified documents

Output: model, one weight for each word from the vocabulary

Algorithm:

„

initialize the model by setting word weights to 0

„

iterate through documents N times

„ classify the document X represented as bag-of-words predict positive class

else predict negative class

„ if document classification is wrong then adjust weights of all words occurring in the document

sign(positive) = 1 sign(negative) =-1

0

; )

1 = + ( >

+ w sign trueClass β β wt t

Vi=1 xi w i 0

(54)

Measuring success -

Model quality estimation

argetC) Recall(M,t

M,targetC) Precision(

β

argetC) Recall(M,t

) (M,targetC )Precision

β ) (1

(M,targetC F

) M,C Precision(

) C P(

) Accuracy(M

|targetC) targetC

P(

argetC) Recall(M,t

) targetC P(targetC|

M,targetC) Precision(

2 2 β

i i

i

+

×

= +

×

=

=

=

The truth, and ..the whole truth

„

Classification accuracy

„

Break-even point (precision=recall)

„

F-measure (precision, recall

= sensitivity

)

(55)

Reuters dataset –

Categorization to flat categories

„

Documents classified by editors into one or more categories

„

Publicly available set of Reuter news mainly from 1987:

„

120 categories giving the document content, such as: earn, acquire, corn, rice, jobs,

oilseeds, gold, coffee, housing, income,...

„

…from 2000 is available new dataset of

830,000 Reuters documents available fo

research

(56)

Distribution of documents

(Reuters-21578)

Top 20 categories of Reuter news in 1987-91

0 1000 2000 3000 4000

earnacq money-fx

crudegrain trade

interest

wheatship

corn dlr oilseed mone

y-supp ly

sugar gnp coffee

veg-oil gold

nat-gas

soybean bop Category

Number of Documents

(57)

Example of Perceptron model for Reuters category “Acquisition”

Feature Positive

Class Weight ---

STAKE 11.5

MERGER 9.5

TAKEOVER 9

ACQUIRE 9

ACQUIRED 8

COMPLETES 7.5

OWNERSHIP 7.5

SALE 7.5

OWNERSHIP 7.5

BUYOUT 7

ACQUISITION 6.5

UNDISCLOSED 6.5

BUYS 6.5

ASSETS 6

BID 6

BP 6

DIVISION 5.5

(58)

SVM, Perceptron & Winnow

text categorization performance on

Reuters-21578 with different representations

Comparison of algorithms

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

.\1grams-nostem

.\2-5grams-nostem

.\5grams-nostem

.\prox-3gr-w10

.\su bob

jpred-strings

Representation

Break-even point

SVM Perceptron Winnow

(59)

Comparison on using

SVM on stemmed 1-grams

with related results

Comparison on Lewis-split

0.500.55 0.600.65 0.700.75 0.800.85 0.90 0.951.00

acq

corn

crude

earn

grain

interest

money-fx ship

trade

wheat

MicroAvg. category

Break-even point

Us Thorsten Susan

(60)

Text Categorization into hierarchy of categories

„

There are several hierarchies (taxonomies) of textual documents:

„ Yahoo, DMoz, Medline, …

„

Different people use different approaches:

„ …series of hierarchically organized classifiers

„ …set of independent classifiers just for leaves

„ …set of independent classifiers for all nodes

(61)

Yahoo! hierarchy (taxonomy)

„ human constructed hierarchy of Web-documents

„ exists in several languages

(we use English)

„ easy to access and regularly updated

„ captures most of the Web topics

„ English version includes over 2M pages categorized into 50,000 categories

„ contains about 250Mb of HTML files

(62)

Document to categorize:

CFP for CoNLL-2000

(63)

Some predicted categories

(64)

Feature construction

System architecture

Web

vectors of n-grams

Subproblem definition Feature selection

Classifier construction

labeled documents

(from Yahoo! hierarchy)

unlabeled document document category (label)

??

Document Classifier

(65)

Content categories

„ For each content

category generate a separate classifier that predicts probability for a new document to belong to its category

(66)

Considering promising categories only

(classification by Naive Bayes)

∑ ∏

=

i W Doc

Doc W

Freq i

l i

Doc W

Doc W Freq

l

C l

W P C

P

C W P C

P Doc

C

P ( , )

) , (

)

| (

) (

)

| ( )

( )

| (

„ Document is represented as a set of word sequences W

„ Each classifier has two distributions: P(W|pos), P(W|neg)

„ Promising category:

„ calculated P(pos|Doc) is high meaning that the classifier has

P(W|pos)>0 for at least some W from the document (otherwise, the prior probability is returned, P(neg) is about 0.90)

(67)

Summary of experimental results

Domain probability rank precision recall Entertain.

0.96 16 0.44 0.80

Arts

0.99 10 0.40 0.83

Computers

0.98 12 0.40 0.84

Education

0.99 9 0.57 0.65

Reference

0.99 3 0.51 0.81

(68)

Document Clustering

(69)

Document Clustering

„

Clustering is a process of finding natural groups in data in a unsupervised way (no class labels preassigned to documents)

„

Most popular clustering methods are:

„ K-Means clustering

„ Agglomerative hierarchical clustering

„ EM (Gaussian Mixture)

„

(70)

K-Means clustering

„

Given:

„ set of documents (e.g. TFIDF vectors),

„ distance measure (e.g. cosine)

„ K (number of groups)

„

For each of K groups initialize its centroid with a random document

„

While not converging

„ Each document is assigned to the nearest group (represented by its centroid)

„ For each group calculate new centroid (group mass point, average document in the group)

(71)

Visualization

(72)

Why text visualization?

„ ...to have a top level view of the topics in the corpora

„ ...to see relationships between the topics in the corpora

„ ...to understand better what’s going on in the corpora

„ ...to show highly structured nature of textual contents in a simplified way

„ ...to show main dimensions of highly dimensional space of textual documents

„ ...because it’s fun!

(73)

Examples of Text Visualization

„

Text visualizations

„

WebSOM

„

ThemeScape

„

Graph-Based Visualization

„

Tiling-Based Visualization

„

„

… collection of approaches at

http://nd.loopback.org/hyperd/zb/

(74)

WebSOM

„

Self-Organizing Maps for Internet Exploration

„ An ordered map of the information space is

provided: similar documents lie near each other on the map

„ …algorithm that automatically organizes the

documents onto a two-dimensional grid so that related documents appear close to each other

„ … based on Kohonen’s Self-Organizing Maps

„ Demo at http://websom.hut.fi/websom/

(75)

WebSOM visualization

(76)

ThemeScape

„

Graphically displays images based on word similarities and themes in text

„

Themes within the document spaces appear on the computer screen as a relief map of natural terrain

„ The mountains in indicate where themes are dominant - valleys indicate weak themes

„ Themes close in content will be close visually based on the many relationships within the text spaces.

„

… similar techniques for visualizing stocks

(

http://www.webmap.com./trademapdemo.html

)

(77)

ThemeScape Document visualization

(78)

Graph based visualization

„

The sketch of the algorithm:

1.

Documents are transformed into the bag-of- words sparse-vectors representation

Words in the vectors are weighted using TFIDF

2.

K-Means clustering algorithm splits the documents into K groups

Each group consists from similar documents

Documents are compared using cosine similarity

3.

K groups form a graph:

Groups are nodes in graph; similar groups are linked

Each group is represented by characteristic keywords

4.

Using simulated annealing draw a graph

(79)

Example of visualizing Eu IST projects corpora

„

Corpus of 1700 Eu IST projects descriptions

„

Downloaded from the web http://www.cordis.lu/

„

Each document is few hundred words long describing one project financed by EC

„

...the idea is to understand the structure and relations between the areas EC is funding

through the projects

„

...the following slides show different

visualizations with the graph based approach

(80)

Graph based visualization of 1700 IST project descriptions into 2 groups

(81)

Graph based visualization of 1700 IST project descriptions into 3 groups

(82)

Graph based visualization of 1700 IST project descriptions into 10 groups

(83)

Graph based visualization of 1700 IST project descriptions into 20 groups

(84)

How do we extract keywords?

„

Characteristic keywords for a group of

documents are the most highly weighted words in the centroid of the cluster

„ ...centroid of the cluster could be understood as an

“average document” for specific group of documents

„ ...we are using the effect provided by the TFIDF

weighting schema for weighting the importance of the words

„ ...efficient solution

(85)

TFIDF words weighting in vector representation

„

In Information Retrieval, the most

popular weighting schema is normalized word frequency TFIDF:

„ Tf(w) – term frequency (number of word occurrences in a document)

„ Df(w) – document frequency (number of documents containing the word)

„ N – number of all documents

„ Tfidf(w) – relative importance of the word in the document

) ) log( (

. )

( df w

tf N w

tfidf =

(86)

Tiling based visualization

„

The sketch of the algorithm:

1.

Documents are transformed into the bag-of- words sparse-vectors representation

Words in the vectors are weighted using TFIDF

2.

Hierarchical top-down two-wise K-Means clustering algorithm builds a hierarchy of clusters

The hierarchy is an artificial equivalent of hierarchical subject index (Yahoo like)

3.

The leaf nodes of the hierarchy (bottom level) are used to visualize the documents

Each leaf is represented by characteristic keywords

Each hierarchical binary split splits recursively the rectangular area into two sub-areas

(87)

Tiling based visualization of 1700 IST project descriptions into 2 groups

(88)

Tiling based visualization of 1700 IST project descriptions into 3 groups

(89)

Tiling based visualization of 1700 IST project descriptions into 4 groups

(90)

Tiling based visualization of 1700 IST project descriptions into 5 groups

(91)

Tiling visualization (up to 50 documents per group) of 1700 IST project descriptions (60 groups)

(92)

ThemeRiver

„

System that visualizes thematic variations over time across a collection of documents

„ The “river” flows through time, changing width to visualize changes in the thematic strength of

documents temporally collocated

„ Themes or topics are represented as colored

“currents” flowing within the river that narrow or widen to indicate decreases or increases in the strength of a topic in associated documents at a specific point in time.

„ Described in paper at

http://www.pnl.gov/infoviz/themeriver99.pdf

(93)

ThemeRiver topic stream

(94)

Information Extraction

(slides borrowed from

William Cohen’s Tutorial on IE)

(95)

Extracting Job Openings from the Web

foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.htm OtherCompanyJobs: foodscience.com-Job1

(96)

IE from Research Papers

(97)

What is “Information Extraction”

As a task: Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a

"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a

Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

(98)

What is “Information Extraction”

As a task: Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gatesrailed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a

"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a

MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founderof the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft..

IE

(99)

What is “Information Extraction”

As a family

of techniques: Information Extraction =

segmentation + classification + clustering + association

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a

"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a

Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft Corporation CEO

Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP

Richard Stallman founder

Free Software Foundation

aka “named entity extraction”

(100)

What is “Information Extraction”

As a family

of techniques: Information Extraction =

segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gatesrailed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a

"cancer" that stifled technological innovation.

Today, Microsoftclaims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoftwill gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a

MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founderof the Free Software Foundation, countered saying…

Microsoft Corporation CEO

Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP

Richard Stallman founder

Free Software Foundation

(101)

What is “Information Extraction”

As a family

of techniques: Information Extraction =

segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft CorporationCEOBill Gatesrailed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a

"cancer" that stifled technological innovation.

Today, Microsoftclaims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoftwill gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a

MicrosoftVP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founderof the Free Software Foundation, countered saying…

Microsoft Corporation CEO

Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP

Richard Stallman founder

Free Software Foundation

Referințe

DOCUMENTE SIMILARE

Seen as three directions of discursive argumentation, they allow in-depth discussions on every specific activity pertaining to someone on the job: putting out in

For the calibration of the connection between the integral intensity of the band “amide I” and the acetylation degree, five mixes were prepared with different proportions of

Agro-industry is of great importance for the Republic of Serbia regarding its economic and social structure, directly within the total industry, as well as together with

If accounts receivable decrease during the time period, this means customers have paid off some accounts, (the company received cash payments) and so, net income should be increased

Let there be color!: joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Transactions on Graphics 35

I will argue that the condition for the licensing of vreun when it appears with clausemate negation is that the latter bears the semantic role of denial, as opposed to

– Players, Objectives, Procedures, Rules, Resources, Conflict, Boundaries, Outcome. •

However, any language with functionality that can be expressed in terms of a valid class file can be hosted by the Java Virtual Machine..