Towards a semantic peer-to-peer information retrieval system
Mihai Lupu & Cornelius Croitoru
Motivation
Exponential increase of available data
Bad results of term matching techniques (due to synonymy and polysemy)
High computational needs for statistical (“semantical”) approaches.
Already existing experience in p2p file- sharing (napster, gnutella, kazaa,
iMesh…)
Aim
To help the system “understand” the
“meaning” of documents and queries To exploit available files on many
machines
To take advantage of user behavior (i.e.
scientific communities)
Vector Space Model
Legend:
document query
most “relevant”
document because it is the closest one
Latent semantic indexing (LSI)
Generate term-document matrix
Apply Singular Value Decomposition Î 3 new matrices with “special” features
Use the new matrices to compute an
approximation of the original TD matrix
Compute similarity between documents
and query (as in the Vector Space Model)
Sort and return “relevant” documents
Singular value decomposition (SVD)
Original Matrix
U Terms vectors’
matrix
Σ eigen-
values matrix
V
documents’
vectors’
matrix
x x
=
m x n m x r r x r r x n
the folklore view of the eigen-values is that they represent the “features” of the original
Singular value decomposition (SVD)
k
take the first k columns of U, the first k eigen-values and the first k lines of V the k “most important features” of the matrix
Approxi- mated
Matrix
U Terms vectors’
matrix
Σ eigen-
values matrix
V
documents’
vectors’
matrix
m x n m x r r x r r x n
= x x
k k
k
Peer-to-peer
client-server architecture
server
client client
client
client
client
client
client
request reply
Peer-to-peer
Peer-to-peer architecture
server
peer peer peer
peer
peer peer
peer
PeerVOIRE
Objective:
a P2P IR application that overlays a semantic level (LSI) over a p2p infrastructure
Assumption:
the user community behaves according to the small worlds paradigm
PeerVOIRE – Small Worlds
scientific communities, the internet etc.
PeerVOIRE - P2P infrastructure
based on the CAN approach Vector Space Model:
peers, documents or queries are just vectors in some space
documents are lines in the TD matrix
peers are assigned coordinates according to some policy
PeerVOIRE – “semantic” overlay
idea:
instead of using the TD matrix, use its approximation, computed by SVD
problem:
we need information from all the nodes (all terms and all documents in the network!)
solution:
keep an image of the network on each node and update it periodically
small worlds
document1 document2 document3 document4 document5
document1 document3 document6 document2
document6 document7
document4 document5 document9
document10
document12 document10 document9
small worlds
document1 document2 document3 document4 document5
document1 document3 document6 document2
document6 document7
document4 document5 document9
document10 document11 document9
document12 document10 document9
peers
small worlds
document1 document2 document3 document4 document5
document1 document3 document6 document2
document6 document7
document4 document5 document9
document8
document12 document10 document9
documents on a peer may belong to different
areas 1
2 3
4 6
5
peer address
Q: what address (vector) should we assign to these nodes? blue or red?
A: both!
we use multiple realities.
each node may have more than one address.
document1 document2 document3 document4 document5
1
document4 document5 document9
6
adding a new node
adding a new node
adding a new node
compute the distance between the new
adding a new node
if bigger than half the maximum on a certain dimension, slice in half the zone
adding a new node
if bigger than half the maximum on a certain
adding a new node
otherwise, do not slice in half. let both nodes manage the zone
adding a new node
otherwise, do not slice in half; let both
border nodes
all nodes in a zone know everything except the neighboring zones
we need border nodes
routing
complete graph inside the area
reach outside through border nodes
node deletion
voluntary or involuntary
involuntary deletion is treated in a lazy manner
node manages area alone or with others
merging of areas is required in the first case
node is a border node or not
a replacement must be found