Text-Mining Tutorial
Marko Grobelnik, Dunja Mladenic
J. Stefan Institute, Slovenia
What is Text-Mining?
“…finding interesting regularities in
large textual datasets…”
(Usama Fayad, adapted)
…where interesting means: non-trivial,
hidden, previously unknown and potentially useful
“…finding semantic and abstract
information from the surface form of
textual data…”
Which areas are active in Text Processing?
Data Analysis Computational
Linguistics
Search & DB Knowledge Rep. &
Reasoning
Tutorial Contents
Why Text is Easy and Why Tough?
Levels of Text Processing
Word Level
Sentence Level
Document Level
Document-Collection Level
Linked-Document-Collection Level
Application Level
References to Conferences, Workshops, Books, Products
Final Remarks
Why Text is Tough?
(M.Hearst 97)
Abstract concepts are difficult to represent
“Countless” combinations of subtle, abstract relationships among concepts
Many ways to represent similar concepts
E.g. space ship, flying saucer, UFO
Concepts are difficult to visualize
High dimensionality
Tens or hundreds of thousands of
features
Why Text is Easy?
(M.Hearst 97)
Highly redundant data
…most of the methods count on this property
Just about any simple algorithm can get
“good” results for simple tasks:
Pull out “important” phrases
Find “meaningfully” related words
Create some sort of summary from documents
Levels of Text Processing 1/6
Word Level
Words Properties
Stop-Words
Stemming
Frequent N-Grams
Thesaurus (WordNet)
Sentence Level
Document Level
Document-Collection Level
Linked-Document-Collection Level
Application Level
Words Properties
Relations among word surface forms and their senses:
Homonomy: same form, but different meaning (e.g. bank:
river bank, financial institution)
Polysemy: same form, related meaning (e.g. bank: blood bank, financial institution)
Synonymy: different form, same meaning (e.g. singer, vocalist)
Hyponymy: one word denotes a subclass of an another (e.g.
breakfast, meal)
Word frequencies in texts have power distribution:
…small number of very frequent words
…big number of low frequency words
Stop-words
Stop-words are words that from non-linguistic view do not carry information
…they have mainly functional role
…usually we remove them to help the methods to perform better
Natural language dependent – examples:
English: A, ABOUT, ABOVE, ACROSS, AFTER, AGAIN,
AGAINST, ALL, ALMOST, ALONE, ALONG, ALREADY, ALSO, ...
Slovenian: A, AH, AHA, ALI, AMPAK, BAJE, BODISI, BOJDA, BRŽKONE, BRŽČAS, BREZ, CELO, DA, DO, ...
Croatian: A, AH, AHA, ALI, AKO, BEZ, DA, IPAK, NE, NEGO, ...
After the stop-words removal
Information Systems Asia Web provides research IS-related commercial materials
interaction research sponsorship interested
corporations focus Asia Pacific region
Survey Information Retrieval guide IR emphasis web-based
projects Includes glossary pointers interesting papers
Original text
Information Systems Asia Web - provides research, IS-related commercial materials,
interaction, and even research sponsorship by interested
corporations with a focus on Asia Pacific region.
Survey of Information Retrieval -
guide to IR, with an emphasis on web-based projects. Includes a glossary, and pointers to
interesting papers.
Stemming (I)
Different forms of the same word are usually problematic for text data
analysis, because they have different spelling and similar meaning (e.g.
learns, learned, learning,…)
Stemming is a process of transforming
a word into its stem (normalized form)
Stemming (II)
For English it is not a big problem - publicly available algorithms give good results
Most widely used is Porter stemmer at
http://www.tartarus.org/~martin/PorterStemmer/
E.g. in Slovenian language 10-20 different forms correspond to the same word:
E.g. (“to laugh” in Slovenian): smej, smejal, smejala, smejale, smejali, smejalo, smejati, smejejo, smejeta, smejete,
smejeva, smeješ, smejemo, smejiš, smeje, smejoč, smejta, smejte, smejva
Example cascade rules used in English Porter stemmer
ATIONAL -> ATE relational -> relate
TIONAL -> TION conditional -> condition
ENCI -> ENCE valenci -> valence
ANCI -> ANCE hesitanci -> hesitance
IZER -> IZE digitizer -> digitize
ABLI -> ABLE conformabli -> conformable
ALLI -> AL radicalli -> radical
ENTLI -> ENT differentli -> different
ELI -> E vileli - > vile
OUSLI -> OUS analogousli -> analogous
Rules automatically obtained for Slovenian language
Machine Learning applied on Multext-East dictionary (http://nl.ijs.si/ME/)
Two example rules:
Remove the ending “OM” if 3 last char is any of HOM, NOM, DOM, SOM, POM, BOM, FOM. For instance, ALAHOM, AMERICANOM, BENJAMINOM, BERLINOM, ALFREDOM, BEOGRADOM, DICKENSOM,
JEZUSOM, JOSIPOM, OLIMPOM,... but not ALEKSANDROM (ROM -> ER)
Replace CEM by EC. For instance, ARABCEM, BAVARCEM,
BOVCEM, EVROPEJCEM, GORENJCEM, ... but not FRANCEM (remove EM)
Phrases in the form of frequent N-Grams
Simple way for generating phrases are frequent n- grams:
N-Gram is a sequence of n consecutive words (e.g. “machine learning” is 2-gram)
“Frequent n-grams” are the ones which appear in all observed documents MinFreq or more times
N-grams are interesting because of the simple and efficient dynamic programming algorithm:
Given:
Set of documents (each document is a sequence of words),
MinFreq (minimal n-gram frequency),
MaxNGramSize (maximal n-gram length)
for Len = 1 to MaxNGramSize do
Generate candidate n-grams as sequences of words of size Len using frequent n-grams of length Len-1
Delete candidate n-grams with the frequency less then MinFreq
Generation of frequent n-grams for 50,000 documents from Yahoo
# features 1.6M1.4M 1.2M 1M 800 000 600 000 400 000 200 000 0
1-grams 2-grams 3-grams 4-grams 5-grams
318K->70K 1.4M->207K 742K->243K 309K->252K 262K->256K
Document represented by n-grams:
1."REFERENCE LIBRARIES LIBRARY
INFORMATION SCIENCE (\#3 LIBRARY INFORMATION SCIENCE) INFORMATION RETRIEVAL (\#2 INFORMATION
RETRIEVAL)"
2."UK"
3."IR PAGES IR RELATED RESOURCES COLLECTIONS LISTS LINKS IR SITES"
4."UNIVERSITY GLASGOW INFORMATION
RETRIEVAL (\#2 INFORMATION RETRIEVAL) GROUP INFORMATION RESOURCES (\#2 INFORMATION RESOURCES) PEOPLE GLASGOW IR GROUP"
5."CENTRE INFORMATION RETRIEVAL (\#2 INFORMATION RETRIEVAL)"
6."INFORMATION SYSTEMS ASIA WEB RESEARCH COMMERCIAL MATERIALS RESEARCH ASIA PACIFIC REGION"
7."CATALOGING DIGITAL DOCUMENTS"
8."INFORMATION RETRIEVAL (\#2
INFORMATION RETRIEVAL) GUIDE IR EMPHASIS INCLUDES GLOSSARY INTERESTING"
9."UNIVERSITY INFORMATION RETRIEVAL (\#2 INFORMATION RETRIEVAL) GROUP"
Original text on the Yahoo Web page:
1.Top:Reference:Libraries:Library and Information Science:Information Retrieval
2.UK Only
3.Idomeneus - IR \& DB repository - These pages mostly contain IR related resources such as test collections, stop lists, stemming
algorithms, and links to other IR sites.
4.University of Glasgow - Information Retrieval Group - information on the resources and people in the Glasgow IR group.
5.Centre for Intelligent Information Retrieval (CIIR).
6.Information Systems Asia Web - provides research, IS-related commercial materials, interaction, and even research sponsorship by interested corporations with a focus on Asia Pacific region.
7.Seminar on Cataloging Digital Documents 8.Survey of Information Retrieval - guide to IR,
with an emphasis on web-based projects.
Includes a glossary, and pointers to interesting papers.
9.University of Dortmund - Information Retrieval Group
WordNet – a database of lexical relations
WordNet is the most well developed and widely used lexical database for English
…it consist from 4 databases (nouns, verbs, adjectives, and adverbs)
Each database consists from sense entries consisting from a set of synonyms, e.g.:
musician, instrumentalist, player
person, individual, someone
life form, organism, being
5677 4546
Adverb
29881 20170
Adjective
22066 10319
Verb
116317 94474
Noun
Number of Senses Unique
Forms Category
WordNet relations
Each WordNet entry is connected with other entries in a graph through relations.
Relations in the database of nouns:
leader -> follower Opposites
Antonym
course -> meal From parts to wholes
Part-Of
table -> leg From wholes to parts
Has-Part
copilot -> crew From members to their
groups Member-Of
faculty -> professor From groups to their
members Has-Member
meal -> lunch From concepts to
subtypes Hyponym
breakfast -> meal From concepts to
subordinate Hypernym
Example Definition
Relation
Levels of Text Processing 2/6
Word Level
Sentence Level
Document Level
Document-Collection Level
Linked-Document-Collection Level
Application Level
Levels of Text Processing 3/6
Word Level
Sentence Level
Document Level
Summarization
Single Document Visualization
Text Segmentation
Document-Collection Level
Linked-Document-Collection Level
Application Level
Summarization
Summarization
Task: the task is to produce shorter, summary version of an original
document.
Two main approaches to the problem:
Knowledge rich – performing semantic analysis, representing the meaning and generating the text satisfying length
restriction
Selection based
Selection based summarization
Three main phases:
Analyzing the source text
Determining its important points
Synthesizing an appropriate output
Most methods adopt linear weighting model – each text unit (sentence) is assessed by:
Weight(U)=LocationInText(U)+CuePhrase(U)+Statis tics(U)+AdditionalPresence(U)
…a lot of heuristics and tuning of parameters (also with ML)
…output consists from topmost text units
(sentences)
Selected units Selection threshold Example of selection based approach from MS Word
Visualization of a single
document
Why visualization of a single document is hard?
Visualizing of big text corpora is easier task because of the big amount of information:
...statistics already starts working
...most known approaches are statistics based
Visualization of a single (possibly short) document is much harder task because:
...we can not count of statistical properties of the text (lack of data)
...we must rely on syntactical and logical structure of the document
Simple approach
1. The text is split into the sentences.
2. Each sentence is deep-parsed into its logical form
z we are using Microsoft’s NLPWin parser
3. Anaphora resolution is performed on all sentences
z ...all ‘he’, ‘she’, ‘they’, ‘him’, ‘his’, ‘her’, etc. references to the objects are replaced by its proper name
4. From all the sentences we extract [Subject-Predicate- Object triples] (SPO)
5. SPOs form links in the graph
6. ...finally, we draw a graph.
Clarence Thomas article
Alan Greenspan article
Text Segmentation
Text Segmentation
Problem: divide text that has no given structure into segments with similar
content
Example applications:
topic tracking in news (spoken news)
identification of topics in large,
unstructured text databases
Algorithm for text segmentation
Algorithm:
Divide text into sentences
Represent each sentence with words and phrases it contains
Calculate similarity between the pairs of sentences
Find a segmentation (sequence of delimiters), so that the similarity between the sentences inside the same segment is maximized and minimized
between the segments
…the approach can be defined either as
optimization problem or as sliding window
Levels of Text Processing 4/6
Word Level
Sentence Level
Document Level
Document-Collection Level
Representation
Feature Selection
Document Similarity
Representation Change (LSI)
Categorization (flat, hierarchical)
Clustering (flat, hierarchical)
Visualization
Information Extraction
Linked-Document-Collection Level
Application Level
Representation
Bag-of-words document
representation
Word weighting
In bag-of-words representation each word is represented as a separate variable having numeric weight.
The most popular weighting schema is normalized word frequency TFIDF:
Tf(w) – term frequency (number of word occurrences in a document)
Df(w) – document frequency (number of documents containing the word)
N – number of all documents
Tfidf(w) – relative importance of the word in the document
) ) log( (
. )
( df w
tf N w
tfidf =
The word is more important if it appears several times in a target document
The word is more important if it appears in less documents
Example document and its vector representation
TRUMP MAKES BID FOR CONTROL OF RESORTS Casino owner and real estate Donald Trump has offered to acquire all Class B common shares of Resorts International Inc, a spokesman for Trump said. The estate of late Resorts chairman James M. Crosby owns 340,783 of the 752,297 Class B shares.
Resorts also has about 6,432,000 Class A common shares outstanding. Each Class B share has 100 times the voting power of a Class A share, giving the Class B stock about 93 pct of Resorts' voting power.
[RESORTS:0.624] [CLASS:0.487] [TRUMP:0.367] [VOTING:0.171]
[ESTATE:0.166] [POWER:0.134] [CROSBY:0.134] [CASINO:0.119]
[DEVELOPER:0.118] [SHARES:0.117] [OWNER:0.102] [DONALD:0.097]
[COMMON:0.093] [GIVING:0.081] [OWNS:0.080] [MAKES:0.078] [TIMES:0.075]
[SHARE:0.072] [JAMES:0.070] [REAL:0.068] [CONTROL:0.065]
[ACQUIRE:0.064] [OFFERED:0.063] [BID:0.063] [LATE:0.062]
[OUTSTANDING:0.056] [SPOKESMAN:0.049] [CHAIRMAN:0.049]
[INTERNATIONAL:0.041] [STOCK:0.035] [YORK:0.035] [PCT:0.022]
[MARCH:0.011]
Feature Selection
Feature subset selection
Feature subset selection
Select only the best features (different ways to define “the best”-different
feature scoring measures)
the most frequent
the most informative relative to the all class values
the most informative relative to the
positive class value,…
Scoring individual feature
∑ ∑
=WW =
F C posneg P C
F C F P
C P F
P
, , ( )
)
| log (
)
| ( )
(
InformationGain:
CrossEntropyTxt:
MutualInfoTxt:
WeightOfEvidTxt:
OddsRatio:
Frequency:
=
∑
pos negC P C
W C W P
C P W
P
, ( )
)
| log (
)
| ( )
(
=
∑
pos negC P W
C W C P
P
, ( )
)
| log (
) (
)
| ( 1
)(
(
) ( 1
)(
| log (
) ( ) (
, P C P C W
C P W
C W P
P C P
neg pos
C −
∑
−=
)
| ( ))
| ( 1
(
))
| ( 1
( )
| log (
neg W
P pos
W P
neg W
P pos
W P
×
−
−
×
) (W Freq
Example of the best features
Information Gain
feature score [P(F|pos), P(F|neg)]
LIBRARY 0.46 [0.015, 0.091]
PUBLIC 0.23 [0, 0.034]
PUBLIC LIBRARY 0.21 [0, 0.029]
UNIVERSITY 0.21 [0.045, 0.028]
LIBRARIES 0.197 [0.015, 0.026]
INFORMATION 0.17 [0.119, 0.021]
REFERENCES 0.117 [0.015, 0.012]
RESOURCES 0.11 [0.029, 0.0102]
COUNTY 0.096 [0, 0.0089]
INTERNET 0.091 [0, 0.00826]
LINKS 0.091 [0.015, 0.00819]
SERVICES 0.089 [0, 0.0079]
Odds Ratio
feature score [P(F|pos), P(F|neg)]
IR 5.28 [0.075, 0.0004]
INFORMATION RETRIEVAL 5.13...
RETRIEVAL 4.77 [0.075, 0.0007]
GLASGOW 4.72 [0.03, 0.0003]
ASIA 4.32 [0.03, 0.0004]
PACIFIC 4.02 [0.015, 0.0003]
INTERESTING 4.02[0.015, 0.0003]
EMPHASIS 4.02 [0.015, 0.0003]
GROUP 3.64 [0.045, 0.0012]
MASSACHUSETTS 3.46 [0.015, ...]
COMMERCIAL 3.46 [0.015,0.0005]
REGION 3.1 [0.015, 0.0007]
Document Similarity
Cosine similarity
between document vectors
Each document is represented as a vector of weights D = <x>
Similarity between vectors is estimated by the similarity between their vector representations (cosine of the angle between vectors):
∑
∑
= ∑
k k
j j
i
i i
x x
x x D
D
Sim
2 22 1 2
1
, )
(
Representation Change:
Latent Semantic Indexing
Latent Semantic Indexing
LSI is a statistical technique that attempts to estimate the hidden
content structure within documents:
…it uses linear algebra technique Singular- Value-Decomposition (SVD)
…it discovers statistically most significant
co-occurences of terms
LSI Example
Original document-term mantrix1 0
1 0
0 0
truck
0 1
1 0
0 1
car
0 0
0 0
1 1
moon
0 0
0 0
1 0
astronaut
0 0
0 1
0 1
cosmonaut
d6 d5
d4 d3
d2
d1 Rescaled document matrix,
Reduced into two dimensions
0.65 0.35
1.00 -0.30
-0.84 -0.46
Dim2
-0.26 -0.71
-0.97 -0.04
-0.60 -1.62
Dim1
d6 d5
d4 d3
d2 d1
1.00 0.7
0.9 -0.9
-0.5 0.1
d6
1.00 0.9
-0.3 0.2
0.7 d5
1.00 -0.6
-0.2 0.5
d4
1.00 0.9
0.4 d3
1.00 0.8
d2
1.00 d1
d6 d5
d4 d3
d2 d1
High correlation although d2 and d3 don’t share any word
Correlation matrix
Text Categorization
Document categorization
unlabeled document
???
labeled documents
Machine learning
document category (label)
Automatic Document Categorization Task
Given is a set of documents labeled with content categories.
The goal is: to build a model which would automatically assign right content categories to new unlabeled
documents.
Content categories can be:
unstructured (e.g., Reuters) or
structured (e.g., Yahoo, DMoz, Medline)
Algorithms for learning document classifiers
Popular algorithms for text categorization:
Support Vector Machines
Logistic Regression
Perceptron algorithm
Naive Bayesian classifier
Winnow algorithm
Nearest Neighbour
....
Perceptron algorithm
Input: set of pre-classified documents
Output: model, one weight for each word from the vocabulary
Algorithm:
initialize the model by setting word weights to 0
iterate through documents N times
classify the document X represented as bag-of-words predict positive class
else predict negative class
if document classification is wrong then adjust weights of all words occurring in the document
sign(positive) = 1 sign(negative) =-1
0
; )
1 = + ( >
+ w sign trueClass β β wt t
∑
Vi=1 xi w i ≥ 0Measuring success -
Model quality estimation
argetC) Recall(M,t
M,targetC) Precision(
β
argetC) Recall(M,t
) (M,targetC )Precision
β ) (1
(M,targetC F
) M,C Precision(
) C P(
) Accuracy(M
|targetC) targetC
P(
argetC) Recall(M,t
) targetC P(targetC|
M,targetC) Precision(
2 2 β
i i
i
+
×
= +
×
=
=
=
∑
The truth, and ..the whole truth
Classification accuracy
Break-even point (precision=recall)
F-measure (precision, recall
= sensitivity)
Reuters dataset –
Categorization to flat categories
Documents classified by editors into one or more categories
Publicly available set of Reuter news mainly from 1987:
120 categories giving the document content, such as: earn, acquire, corn, rice, jobs,
oilseeds, gold, coffee, housing, income,...
…from 2000 is available new dataset of
830,000 Reuters documents available fo
research
Distribution of documents
(Reuters-21578)
Top 20 categories of Reuter news in 1987-91
0 1000 2000 3000 4000
earnacq money-fx
crudegrain trade
interest
wheatship
corn dlr oilseed mone
y-supp ly
sugar gnp coffee
veg-oil gold
nat-gas
soybean bop Category
Number of Documents
Example of Perceptron model for Reuters category “Acquisition”
Feature Positive
Class Weight ---
STAKE 11.5
MERGER 9.5
TAKEOVER 9
ACQUIRE 9
ACQUIRED 8
COMPLETES 7.5
OWNERSHIP 7.5
SALE 7.5
OWNERSHIP 7.5
BUYOUT 7
ACQUISITION 6.5
UNDISCLOSED 6.5
BUYS 6.5
ASSETS 6
BID 6
BP 6
DIVISION 5.5
…
SVM, Perceptron & Winnow
text categorization performance on
Reuters-21578 with different representations
Comparison of algorithms
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
.\1grams-nostem
.\2-5grams-nostem
.\5grams-nostem
.\prox-3gr-w10
.\su bob
jpred-strings
Representation
Break-even point
SVM Perceptron Winnow
Comparison on using
SVM on stemmed 1-grams
with related results
Comparison on Lewis-split
0.500.55 0.600.65 0.700.75 0.800.85 0.90 0.951.00
acq
corn
crude
earn
grain
interest
money-fx ship
trade
wheat
MicroAvg. category
Break-even point
Us Thorsten Susan
Text Categorization into hierarchy of categories
There are several hierarchies (taxonomies) of textual documents:
Yahoo, DMoz, Medline, …
Different people use different approaches:
…series of hierarchically organized classifiers
…set of independent classifiers just for leaves
…set of independent classifiers for all nodes
Yahoo! hierarchy (taxonomy)
human constructed hierarchy of Web-documents
exists in several languages
(we use English)
easy to access and regularly updated
captures most of the Web topics
English version includes over 2M pages categorized into 50,000 categories
contains about 250Mb of HTML files
Document to categorize:
CFP for CoNLL-2000
Some predicted categories
Feature construction
System architecture
Web
vectors of n-grams
Subproblem definition Feature selection
Classifier construction
labeled documents
(from Yahoo! hierarchy)
unlabeled document document category (label)
??
Document Classifier
Content categories
For each content
category generate a separate classifier that predicts probability for a new document to belong to its category
Considering promising categories only
(classification by Naive Bayes)
∑ ∏
∏
∈
= ∈
i W Doc
Doc W
Freq i
l i
Doc W
Doc W Freq
l
C l
W P C
P
C W P C
P Doc
C
P ( , )
) , (
)
| (
) (
)
| ( )
( )
| (
Document is represented as a set of word sequences W
Each classifier has two distributions: P(W|pos), P(W|neg)
Promising category:
calculated P(pos|Doc) is high meaning that the classifier has
P(W|pos)>0 for at least some W from the document (otherwise, the prior probability is returned, P(neg) is about 0.90)
Summary of experimental results
Domain probability rank precision recall Entertain.
0.96 16 0.44 0.80
Arts
0.99 10 0.40 0.83
Computers
0.98 12 0.40 0.84
Education
0.99 9 0.57 0.65
Reference
0.99 3 0.51 0.81
Document Clustering
Document Clustering
Clustering is a process of finding natural groups in data in a unsupervised way (no class labels preassigned to documents)
Most popular clustering methods are:
K-Means clustering
Agglomerative hierarchical clustering
EM (Gaussian Mixture)
…
K-Means clustering
Given:
set of documents (e.g. TFIDF vectors),
distance measure (e.g. cosine)
K (number of groups)
For each of K groups initialize its centroid with a random document
While not converging
Each document is assigned to the nearest group (represented by its centroid)
For each group calculate new centroid (group mass point, average document in the group)
Visualization
Why text visualization?
...to have a top level view of the topics in the corpora
...to see relationships between the topics in the corpora
...to understand better what’s going on in the corpora
...to show highly structured nature of textual contents in a simplified way
...to show main dimensions of highly dimensional space of textual documents
...because it’s fun!
Examples of Text Visualization
Text visualizations
WebSOM
ThemeScape
Graph-Based Visualization
Tiling-Based Visualization
…
… collection of approaches at
http://nd.loopback.org/hyperd/zb/
WebSOM
Self-Organizing Maps for Internet Exploration
An ordered map of the information space is
provided: similar documents lie near each other on the map
…algorithm that automatically organizes the
documents onto a two-dimensional grid so that related documents appear close to each other
… based on Kohonen’s Self-Organizing Maps
Demo at http://websom.hut.fi/websom/
WebSOM visualization
ThemeScape
Graphically displays images based on word similarities and themes in text
Themes within the document spaces appear on the computer screen as a relief map of natural terrain
The mountains in indicate where themes are dominant - valleys indicate weak themes
Themes close in content will be close visually based on the many relationships within the text spaces.
… similar techniques for visualizing stocks
(
http://www.webmap.com./trademapdemo.html)
ThemeScape Document visualization
Graph based visualization
The sketch of the algorithm:
1.
Documents are transformed into the bag-of- words sparse-vectors representation
– Words in the vectors are weighted using TFIDF
2.
K-Means clustering algorithm splits the documents into K groups
– Each group consists from similar documents
– Documents are compared using cosine similarity
3.
K groups form a graph:
– Groups are nodes in graph; similar groups are linked
– Each group is represented by characteristic keywords
4.
Using simulated annealing draw a graph
Example of visualizing Eu IST projects corpora
Corpus of 1700 Eu IST projects descriptions
Downloaded from the web http://www.cordis.lu/
Each document is few hundred words long describing one project financed by EC
...the idea is to understand the structure and relations between the areas EC is funding
through the projects
...the following slides show different
visualizations with the graph based approach
Graph based visualization of 1700 IST project descriptions into 2 groups
Graph based visualization of 1700 IST project descriptions into 3 groups
Graph based visualization of 1700 IST project descriptions into 10 groups
Graph based visualization of 1700 IST project descriptions into 20 groups
How do we extract keywords?
Characteristic keywords for a group of
documents are the most highly weighted words in the centroid of the cluster
...centroid of the cluster could be understood as an
“average document” for specific group of documents
...we are using the effect provided by the TFIDF
weighting schema for weighting the importance of the words
...efficient solution
TFIDF words weighting in vector representation
In Information Retrieval, the most
popular weighting schema is normalized word frequency TFIDF:
Tf(w) – term frequency (number of word occurrences in a document)
Df(w) – document frequency (number of documents containing the word)
N – number of all documents
Tfidf(w) – relative importance of the word in the document
) ) log( (
. )
( df w
tf N w
tfidf =
Tiling based visualization
The sketch of the algorithm:
1.
Documents are transformed into the bag-of- words sparse-vectors representation
– Words in the vectors are weighted using TFIDF
2.
Hierarchical top-down two-wise K-Means clustering algorithm builds a hierarchy of clusters
– The hierarchy is an artificial equivalent of hierarchical subject index (Yahoo like)
3.
The leaf nodes of the hierarchy (bottom level) are used to visualize the documents
– Each leaf is represented by characteristic keywords
– Each hierarchical binary split splits recursively the rectangular area into two sub-areas
Tiling based visualization of 1700 IST project descriptions into 2 groups
Tiling based visualization of 1700 IST project descriptions into 3 groups
Tiling based visualization of 1700 IST project descriptions into 4 groups
Tiling based visualization of 1700 IST project descriptions into 5 groups
Tiling visualization (up to 50 documents per group) of 1700 IST project descriptions (60 groups)
ThemeRiver
System that visualizes thematic variations over time across a collection of documents
The “river” flows through time, changing width to visualize changes in the thematic strength of
documents temporally collocated
Themes or topics are represented as colored
“currents” flowing within the river that narrow or widen to indicate decreases or increases in the strength of a topic in associated documents at a specific point in time.
Described in paper at
http://www.pnl.gov/infoviz/themeriver99.pdf
ThemeRiver topic stream
Information Extraction
(slides borrowed from
William Cohen’s Tutorial on IE)
Extracting Job Openings from the Web
foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.htm OtherCompanyJobs: foodscience.com-Job1
IE from Research Papers
What is “Information Extraction”
As a task: Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATION
What is “Information Extraction”
As a task: Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft CorporationCEOBill Gatesrailed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a
MicrosoftVP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founderof the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft..
IE
What is “Information Extraction”
As a family
of techniques: Information Extraction =
segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft Corporation CEO
Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP
Richard Stallman founder
Free Software Foundation
aka “named entity extraction”
What is “Information Extraction”
As a family
of techniques: Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft CorporationCEOBill Gatesrailed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoftclaims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoftwill gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a
MicrosoftVP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founderof the Free Software Foundation, countered saying…
Microsoft Corporation CEO
Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP
Richard Stallman founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques: Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft CorporationCEOBill Gatesrailed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoftclaims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoftwill gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a
MicrosoftVP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founderof the Free Software Foundation, countered saying…
Microsoft Corporation CEO
Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP
Richard Stallman founder
Free Software Foundation