PROTEIN-PROTEIN INTERACTION NETWORK

(1)

CLUSTERING METHODS IN

PROTEIN-PROTEIN INTERACTION NETWORK

Chuan Lin, Young-rae Cho, Woo-chang Hwang, Pengjun Pei, Aidong Zhang

Department of Computer Science and Engineering State University of New York at Buffalo

Email:{chuanlin, ycho8, whwng2, ppei, azhang}@cse.buffalo.edu

With completion of a draft sequence of the human genome, the field of genetics stands on the threshold of significant theoretical and practical advances. Crucial to furthering these investigations is a comprehensive understanding of the expression, function, and regulation of the proteins encoded by an organism. It has been observed that proteins seldom act as single isolated species in the performance of their functions; rather, proteins involved in the same cellular processes often interact with each other. Therefore, the functions of uncharacterized proteins can be predicted through comparison with the interactions of similar known proteins. A detailed examination of the protein-protein interaction (PPI) network can thus yield significant new understanding of protein function. Clustering is the process of grouping data objects into sets (clusters) which demonstrate greater similarity among objects in the same cluster than in different clusters. Clustering in the PPI network context groups together proteins which share a larger number of interactions. The results of this process can illuminate the structure of the PPI network and suggest possible functions for members of the cluster which were previously uncharacterized.

This chapter will begin with a brief introduction of the properties of protein-protein interaction networks, including a review of the data which has been generated by both experimental and computational approaches. A variety of methods which have been employed to cluster these networks will then be presented. These approaches are broadly characterized as either distance-based or

Knowledge Discovery in Bioinformatics: Techniques, Methods and Application . By c

°2006 John Wiley & Sons, Inc.

1

(2)

graph-based clustering methods. Techniques for validating the results of these approaches will also be discussed.

1.1 INTRODUCTION

1.1.1 Protein-Protein Interaction 1.1.1.1 Proteome in Bioinformatics

With the completion of a draft sequence of the human genome, the field of genetics stands on the threshold of significant theoretical and practical advances. Crucial to furthering these investigations is a comprehensive understanding of the expression, function, and regulation of the proteins encoded by an organism [96]. This understanding is the subject of the discipline of proteomics. Proteomics encompasses a wide range of approaches and applications intended to explicate how complex biological processes occur at a molecular level, how they differ in various cell types, and how they are altered in disease states.

Defined succinctly, proteomics is the systematic study of the many and diverse properties of proteins with the aim of providing detailed descriptions of the structure, function, and control of biological systems in health and disease [68]. The field has burst onto the scientific scene with stunning rapidity over the past several years. Figure 1.1. shows the trend of the number of occurrences of the term “proteome” found in Pubmed bioinformatics citations over the past decade. This figure strikingly illustrates the rapidly-increasing role played by proteomics in bioinformatics research in recent years.

A particular focus of the field of proteomics is the nature and role of interactions between proteins. Protein-protein interactions play diverse roles in biology and differ based on the composition, affinity, and lifetime of the association. Non-covalent contacts between residue sidechains are the basis for protein folding, protein assembly, and protein-protein interaction [65]. These contacts facilitate a variety of interactions and associations within and between proteins. Based on their diverse structural and functional characteristics, protein- protein interactions can be categorized in several ways [64]. On the basis of their interaction surface, they may be homo-oligomeric or hetero-oligomeric; as judged by their stability, they may be obligate or non-obligate; and as measured by their persistence, they may be transient or permanent. A given protein-protein interaction can fall into any combination of these three categorical pairs. An interaction may also require reclassification under certain conditions; for example, it may be mainly transient in vivo but become permanent under certain cellular conditions.

1.1.1.2 Significance of Protein-protein Interaction

It has been observed that proteins seldom act as single isolated species while perform- ing their functions in vivo [91]. The analysis of annotated proteins reveals that proteins involved in the same cellular processes often interact with each other [86]. The function of unknown proteins may be postulated on the basis of their interaction with a known protein target of known function. Mapping protein-protein interactions has not only provided insight into protein function but has facilitated the modeling of functional pathways to elu- cidate the molecular mechanisms of cellular processes. The study of protein interactions is fundamental to understanding how proteins function within the cell. Characterizing the interactions of proteins in a given cellular proteome will be the next milestone along the road to understanding the biochemistry of the cell.

(3)

Figure 1.1. Number of results found in PubMed for proteome.

The result of two or more proteins interacting with a specific functional objective can be demonstrated in several different ways. The measurable effects of protein interactions have been outlined by Phizicky and Fields [74]. Protein interactions can:

• alter the kinetic properties of enzymes; this may be the result of subtle changes at the level of substrate binding or at the level of an allosteric effect;

• act as a common mechanism to allow for substrate channeling;

• create a new binding site, typically for small effector molecules;

• inactivate or destroy a protein; or

• change the specificity of a protein for its substrate through interaction with different binding partners; e.g., demonstrate a new function that neither protein can exhibit alone.

Protein-protein interactions are much more widespread than once suspected, and the degree of regulation that they confer is large. To properly understand their significance in the cell, one needs to identify the different interactions, understand the extent to which they take place in the cell, and determine the consequences of the interaction.

1.1.2 Experimental Approaches for PPI Detection

In early reviews, physicochemical approaches for detecting protein-protein interactions included site-directed mutagenesis or chemical modification of amino acid groups participat- ing in such interactions [52, 66, 79, 84]. The following subsections will discuss these bioin- formatic and functional proteomic methods. These include predictions of protein-protein interaction via the yeast two-hybrid system, mass spectrometry, and protein microarrays.

(4)

1.1.2.1 Yeast Two-hybrid System

One of the most common approaches to the detection of pairs of interacting proteins in vivo is the yeast two-hybrid (Y2H) system [7, 36]. The Y2H system, which was developed by Fields and Song [23], is a molecular-genetic tool which facilitates the study of protein- protein interactions [1]. The interaction of two proteins transcriptionally activates a reporter gene, and a color reaction is seen on specific media. This indication can track the interaction between two proteins, revealing “prey” proteins which interact with a known “bait” protein.

The yeast two-hybrid system enables both highly-sensitive detection of protein-protein interactions and screening of genome libraries to ascertain the interaction partners of certain proteins. The system can also be used to pinpoint protein regions mediating the interactions [37]. However, the classic Y2H system has several limitations. First, it cannot, by definition, detect interactions involving three or more proteins and those depending on post- translational modifications except those applied to the budding yeast itself [37]. Second, since some proteins (for example, membrane proteins) cannot be reconstructed in the nu- cleus, the yeast two-hybrid system is not suitable for the detection of interactions involving these proteins. Finally, the method does not guarantee that an interaction indicated by Y2H actually takes place physiologically.

Recently, numerous modifications of the Y2H approach have been proposed which characterize protein-protein interaction networks by screening each protein expressed in a eu- karyotic cell [24]. Drees [19] has proposed a variant which includes the genetic information of a third protein. Zhang et al. [92] have suggested the use of RNA for the investigation of RNA-protein interactions. Vidal et al. [85] used the URA3 gene instead of GAL4 as the reporter gene; this two-hybrid system can be used to screen for ligand inhibition or to dissociate such complexes. Johnson and Varshavsky [43] have proposed a cytoplasmic two-hybrid system which can be used for screening of membrane protein interactions.

Despite the various limitations of the Y2H system, this approach has revealed a wealth of novel interactions and has helped illuminate the magnitude of the protein interactome.

In principle, it can be used in a more comprehensive fashion to examine all possible binary combinations between the proteins encoded by any single genome.

1.1.2.2 Mass Spectrometry Approaches

Another traditional approach to PPI detection is to use quantitative mass spectrometry to analyze the composition of a partially-purified protein complex together with a control purification in which the complex of interest is not enriched.

Mass spectrometry-based protein interaction experiments have three basic components:

bait presentation, affinity purification of the complex, and analysis of the bound proteins [2]. Two large-scale studies [25, 35] have been published on the protein-protein interaction network in yeast. Each study attempted to identify all the components that were present in “naturally”-generated protein complexes, which requires essentially pure preparations of each complex [49]. In both approaches, bait proteins were generated that carried a particular affinity tag. In the case studied by Gavin et al. [25], 1,739 TAP-tagged genes were introduced into the yeast genome by homologous recombination. Ho et al. [35]

expressed 725 proteins modified to carry the FLAG epitope. In both cases, the proteins were expressed in yeast cells, and complexes were purified using a single immunoaffinity purification step. Both groups resolved the components of each purified complex with a one-dimensional denaturing polyacrylamide gel electrophoresis (PAGE) step. From the 1,167 yeast strains generated by Gavin et al. [25], 589 protein complexes were purified, 232 of which were unique. Ho et al. [35] used 725 protein baits and detected 3,617 interactions that involved 1,578 different proteins.

(5)

Mass-spectrometry-based proteomics can be used not only in protein identification and quantification [16, 50, 72, 89] ,but also for protein analysis, which includes protein profiling [51], post-translational modifications (PTMs) [55, 56] and, in particular, identification of protein-protein interactions.

Compared with two-hybrid approaches, mass-spectrometry-based methods are more effective in characterizing highly abundant, stable complexes. MS-based approaches permit the isolation of large protein complexes and the detection of networks of protein interactions. The two-hybrid system is more suited to the characterization of binary interactions, particularly to the detection of weak or transient interactions.

1.1.2.3 Protein Microarray

Microarray-based analysis is a relatively high-throughput technology which allows the simultaneous analysis of thousands of parameters within a single experiment. The key advantage of the microarray format is the use of a nonporous solid surface, such as glass, which permits precise deposition of capturing molecules (probes) in a highly dense and ordered fashion. The early applications of microarrays and detection technologies were largely centered on DNA-based applications. Today, DNA microarray technology is a robust and reliable method for the analysis of gene function [12]. However, gene expression arrays provide no information on protein post-translational modifications (such as phosphorylation or glycosylation) that affect cell function. To examine expression at the protein level and acquire quantitative and qualitative information about proteins of interest, the protein microarray was developed.

A protein microarray is a piece of glass on which various molecules of protein have been affixed at separate locations in an ordered manner, forming a microscopic array [54].

These are used to identify protein-protein interactions, to identify the substrates of protein kinases, or to identify the targets of biologically-active small molecules. The experimental procedure for protein microarray involves choosing solid supports, arraying proteins on the solid supports, and screening for protein-protein interactions.

Experiments with the yeast proteome microarray have revealed a number of protein- protein interactions which had not previously been identified through Y2H or MS-based approaches. Global protein interaction studies were performed with a yeast proteome chip.

Ge [26] has described a universal protein array which permits quantitative detection of protein interactions with a range of proteins, nucleic acids, and small molecules. Zhu et al. [95] generated a yeast proteome chip from recombinant protein probes of 5,800 open-reading frames.

1.1.3 Computational Methods to Predict Protein-protein Interaction

The yeast two-hybrid system and other experimental approaches provide a useful tool for the detection of protein-protein interactions occurring in many possible combinations between specified proteins. The widespread application of these methods has generated a substantial bank of information about such interactions. However, the data generated can be erroneous, and these approaches are often not completely inclusive of all possible protein-protein interactions. In order to form an understanding of the total universe of potential interactions, including those not detected by these methods, it is useful to develop an approach to predict possible interactions between proteins. The accurate prediction of protein-protein interactions is therefore an important goal in the field of molecular recognition.

A number of approaches to PPI prediction are based on the use of genome data. Pellegrini et al. [71] introduced the first such method, which predicts an interaction between two

(6)

proteins in a given organism if these two proteins have homologs in another organism. A subsequent extension proposed by Marcotte et al. [57, 58] detects co-localization of two genes in different genomes. Two proteins in different organisms are predicted to interact if they have consecutive homologs in a single organism. Dandekar et al. [17] used the adjacency of genes in various bacterial genomes to predict functional relationships between the corresponding proteins. Proteins whose genes are physically close in the genomes of various organisms are predicted to interact.

Jasen et al. [40] investigated the relationship between protein-protein interaction and mRNA expression levels by analyzing existing yeast data from a variety of sources and identifying general trends. Two different approaches were used to analyze the two types of available expression data; normalized differences were computed for absolute expression levels, while a more standard analysis of profile correlations was applied to relative expression levels. This investigation indicated that a strong relationship exists between expression data and most permanent protein complexes.

Some researchers have used data-mining techniques to extract useful information from large data sources. Oyama et al. [67] used a method termed Association Rules Discovery to identify patterns and other features from accumulated protein-protein interaction data.

This research mined data from four different sources. This aggregated data included 4,307 unique protein interaction pairs. General rules were derived from 5,241 features extracted from the functional, primary-structural, and other aspects of proteins. After transforming the traditional protein-based transaction data into interaction-based transaction data, Oyama was able to detect and articulate 6,367 rules. Of these, 5,271 rules had at least one feature pertaining to sequences. As this potential had been suggested by other researchers, these results confirmed the efficacy of this method.

As mentioned above, experimental and computational approaches have generated significant quantities of PPI data, but these data sets are typically incomplete, contradictory, and include many false positives. It is therefore necessary for improved accuracy to integrate evidence from many different sources for evaluating protein-protein interactions. Jansen et al. [39] proposed a Bayesian approach for integrating interaction information that allows for the probabilistic combination of multiple data sets and demonstrates its application to yeast data. This approach assesses each source for interactions by comparison with samples of known positives and negatives, yielding a statistical measure of reliability. The likelihood of possible interactions for every protein pair is then predicted by combing each independent data source, weighted according to its reliability. The predictions were validated by TAP (tandem affinity purification) tagging experiments. It was observed that, at given levels of sensitivity, the predictions were more accurate than the existing high-throughput experimental data sets.

1.2 PROPERTIES OF PPI NETWORK

Although reductionism has long been the prevailing paradigm guiding the interpretation of experimental results, it has become increasingly evident that a discrete biological function can only rarely be attributed to an individual molecule. Rather, many biological characteristics arise from complex interactions between numerous cellular constituents, such as proteins, DNA, RNA, and small molecules [4, 34, 44, 46]. Therefore, understanding the structure and dynamics of the complex intercellular web of interactions has become a central focus of biological investigation.

(7)

Figure 1.2. A graph in which a node has a degree of 5. Figure is adapted from [9]

1.2.1 PPI Network Representation

An investigation of protein-protein interaction mechanisms begins with the representation and characterization of the PPI network structure. The simplest representation takes the form of a mathematical graph consisting of nodes and edges (or links) [88]. Proteins are represented as nodes in such a graph; two proteins which interact physically are represented as adjacent nodes connected by an edge.

Degree

The degree (or connectivity) of a node is the number of other nodes with which it is connected [9]. It is the most elementary characteristic of a node. For example, in the undirected network, Figure 1.2., node A has degreek= 5.

Path, shortest path and mean path

The path between two nodes is a sequence of adjacent nodes. The number of edges in this path is termed the path length, and distances within network are measured in terms of path length. As there are many alternative paths between two nodes, the shortest path between the specified nodes refers to the path with the smallest number of links. The mean path length of the network represents the average over the shortest paths between all pairs of nodes.

Degree distribution

Graph structures can be described according to numerous characteristics, including the distribution of path lengths, the number of cyclic paths, and various measures to compute clusters of highly-connected nodes [88]. Barabasi and Oltvai [9] introduced the concept of degree distribution,P(k), to quantify the probability that a selected node will have exactlyklinks. P(k)is obtained by tallying the total number of nodesN(k)withklinks and dividing this figure by the total number of nodesN. Different network classes can be distinguished by the degree distribution. For example, a random network follows a Poisson distribution. By contrast, a scale-free network has a power-law degree distribution, indicating that a few hubs bind numerous small nodes. Most biological networks are scale-free, with degree distributions approximating a power law,P(k)∼k^−γ. When2≤γ≤3, the

(8)

hubs play a significant role in the network [9].

Clustering coefficient

In many networks, if nodeAis connected toB, andBis connected toC, thenAhas a high probability of direct linkage toC. Watts [90] quantified this phenomenon using the clustering coefficient,CI = 2nI/kI(kI −1), wherenI is the number of links connecting thekI neighbors of nodeI to each other. In this coefficient,nI indicates the number of triangles that pass through nodeI, andkI(kI−1)/2is the total number of triangles that could pass through nodeI. For example, in Figure 1.2.,nA = 1andCA = 1/10, while nF = 0,CF = 0.

The average degree, average path length and average clustering coefficient depend on the number of nodes and links in the network. However, the degree distributionP(k) and clustering coefficientC(k)functions are independent of the size of the network and represent its generic features. These functions can therefore be used to classify various network types [9].

1.2.2 Characteristics of Protein-Protein Networks Scale-free network

Recent publications have indicated that protein-protein interactions have the features of a scale-free network [29, 41, 53, 87], meaning that their degree distribution approximates a power law,P(k)∼k^−γ. In scale-free networks, most proteins participate in only a few interactions, while a few (termed “hubs”) participate in dozens of interactions.

Small-world effect

Protein-protein interaction networks have an characteristic property known as the “small- world effect”, which states that any two nodes can be connected via a short path of a few links. The small-world phenomenon was first investigated as a concept in sociology [61]

and is a feature of a range of networks arising in nature and technology, including the Inter- net [3], scientific collaboration networks [63], the English lexicon [77], metabolic networks [22], and protein-protein interaction networks [78, 87]. Although the small-world effect is a property of random networks, the path length in scale-free networks is much shorter than that predicted by the small-world effect [14, 15]. Therefore, scale-free networks are

“ultra-small”. This short path length indicates that local perturbations in metabolite con- centrations could permeate an entire network very quickly.

Disassortativity

In protein-protein interaction networks, highly-connected nodes (hubs) seldom directly link to each other [59]. This differs from the assortative nature of social networks, in which well-connected people tend to have direct connections to each other. By contrast, all biological and technological networks have the property of disassortativity, in which highly-connected nodes are infrequently linked each other.

1.3 CLUSTERING APPROACHES

1.3.1 Significance of Clustering in PPI Network

A cluster is a set of objects which share some common characteristics. Clustering is the process of grouping data objects into sets (clusters) which demonstrate greater similarity

(9)

among objects in the same cluster than in different clusters. Clustering differs from classification; in the latter, objects are assigned to predefined classes, while clustering defines the classes themselves. Thus, clustering is an unsupervised classification method, which means that it does not rely on training the data objects in predefined classes.

In protein-protein interaction networks, clusters correspond to two types of modules:

protein complexes and functional modules. Protein complexes are groups of proteins that interact with each other at the same time and place, forming a single multi-molecular machine. Functional modules consist of proteins that participate in a particular cellular process while binding to each other at a different time and place.

Clustering in protein-protein interaction networks therefore involves identifying protein complexes and functional modules. This process has the following analytical benefits:

(1) clarification of PPI network structures and their component relationships;

(2) inference of the principal function of each cluster from the functions of its members;

(3) elucidation of possible functions of members in a cluster through comparison with the functions of other members.

1.3.2 Challenges of Clustering on PPI Networks

The classic clustering approaches follow a protocol termed “pattern proximity after feature selection” [38]. Pattern proximity is usually measured by a distance function defined for pairs of patterns. A simple distance measure can often be used to reflect dissimilarity between two patterns, while other similarity measures can be used to characterize the conceptual similarity between patterns. However, in protein-protein interaction networks, proteins are represented as nodes and interactions are represented as edges. The relationship between two proteins is therefore a simple binary value: 1 if they interact, 0 if they do not. This lack of nuance makes it difficult to define the distance between the two proteins.

Additionally, a high rate of false positives and the sheer volume of data render problematical to the reliable clustering of PPI networks.

Clustering approaches for PPI networks can be broadly characterized as distance-based or graph-based. Distance-based clustering uses classic clustering techniques and focuses on the definition of the distance between proteins. Graph-based clustering includes approaches which consider the topology of the PPI network. Based on the structure of the network, the density of each subgraph is maximized or the cost of cut-off minimized while separating the graph. This following section will discuss each of these clustering approaches in greater detail.

1.3.3 Distance-based Clustering

1.3.3.1 Distance Measure Based on Coefficient

As discussed in [30], the distance between two nodes (proteins) in a PPI network can be defined as follows. LetX be a set ofnelements and letdij =d(i, j)be a non-negative real functiond:X×X→R⁺, which satisfy

(1) dij >0fori6=j;

(2) dij = 0fori=j;

(10)

(3) dij = dji for alli,j where dis a distance measure andD = {dij} is a distance matrix; and

(4) ifdijsatisfies triangle inequalitydij≤dik+dkj, thendis a metric.

In PPI network, the binary vectorsXi = (xi1, xi2, ..., xiN)represent the set of protein purifications forN proteins, where xik is 1 if thei^thprotein interacts with k^thprotein (thek^thprotein is presented in thei^thpurification) and 0 otherwise. If a distance can be determined which fully accounts for known protein complexes, unsupervised hierarchical clustering methods can be used to accurately assemble protein complexes from the data.

Frequently, a distance can be easily obtained from a simple matching coefficient which calculates the similarity between two elements. The similarity valueSijcan be normalized between 0 and 1, and the distance can be derived fromdij= 1−Sij. If the similarity value of two elements is high, the spatial distance between them should be short.

Several suitable measures have been proposed for this purpose. These include the Jaccard coefficient [32]:

Smn= Xmn

Xmm+Xnn−Xmn

(1.1) Dice coefficient [32]:

Smn= 2Xmn

Xmm+Xnn

(1.2) Simpson coefficient [32]:

Smn= Xmn

min(Xmm, Xnn) (1.3)

Bader coefficient [8]:

Smn= X_mn² Xmm×Xnn

(1.4) Maryland Bridge coefficient [62]:

Smn= 1 2(Xmn

Xmm

+Xmn

Xnn

) (1.5)

Korbel coefficient [47]:

Smn=

pX_mm² +X_nn²

√2XmmXnn

Xmn (1.6)

Correlation coefficient [20]:

Smn= Xmn−nXmXn

q

(Xmm−nXm

2)(Xnn−nXn 2)

, (1.7)

whereXij =Xi•Xj(dot product of two vectors). The value ofSmnranges from 0 to 1.

Xij is equal to the number of bits “on” in both vectors, andXiiis equal to the number of bits “on” in one vector. For example, for the case illustrated in Figure 1.2., the matrixXis:

(11)

X =







0 1 1 1 0 0 1 1 1 0 1 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0







(1.8)

To calculate the distance betweenAandB,d12,X11=X1•X1= 5,X22=X2•X2= 3,X12=X1•X2= 1. The Jaccard coefficient is calculated as: S12= 1/(5 + 3−1) = 0.1429; the distance is thend12= 1−0.1429 = 0.8571.

This group of distance-based approaches uses classic distance measurements, which are not quite suitable for high dimensional spaces. In a high dimensional space, the distances between each pair of nodes are almost the same for a large data distribution [10]. Therefore, it is hard to attain ideal clustering results by the simplest distance measurements only.

1.3.3.2 Distance Measure by Network Distance

There are other definitions based on network distance which give more fine-grained distance measurements for these pairs. In the definition given above, the distance value will be0for any two proteins not sharing an interaction partner. In [75], each edge in the network is assigned a length of1. The length of the shortest path (distance) between every pair of vertices in the network is calculated to create an all-pairs-shortest-path distance matrix.

Each distance in this matrix is then transformed into an “association”, defined as1/d² wheredis the shortest-path distance. This transformation emphasizes local associations (short paths) in the subsequent clustering process. The resulting associations range from0 to1. The association of a vertex with itself is defined as1, while the association of vertices that have no connecting path is defined as0. Two vertices which are more widely separated in the network will have a longer shortest-path distance and thus a smaller association. The association value can be therefore served as the similarity measure for two proteins.

In [69], authors consider the paths of various lengths between two vertices in a weighted protein interaction network. The weight of an edge reflects its reliability and lies in the range between0and1. The PathStrength of a path is defined as the product of the weights of all the edges on the path. Then thek-length PathStrength between two vertices is defined as the sum of the PathStrength of allk-length paths between the two vertices. The PathStrength of a path captures the probability that a walk on the path can reach its ending vertex. By summing upon all these paths, thek-length PathStrength between two vertices captures the strength of connections between these two vertices by ak-step walk. Since paths of different lengths should have different impact on the connection between two vertices, the k-length PathStrength is normalized by thek-length maximum possible path stength to get the k- length PathRatio. Finally, the PathRatio measure between two vertices is defined as the sum of thek-length PathRatios between the two vertices for allk >1. Though this measure is mainly applied in assessing the reliability of detected interactions and predicting potential interactions that are missed by current experiments, it can also be used as a similarity measure for clustering.

Another network distance measure was developed by Zhou [94, 93]. He defined the distancedijfrom nodeito nodejas the average number of steps a Brownian particle takes to reachjfromi.

(12)

Consider a connected network ofN nodes andM edges. Its node set is denoted by V ={1, ..., N}and its connection pattern is specified by the generalized adjacency matrix A. If there is no edge between nodeiand nodej,Aij = 0; if there is an edge between those nodes,Aij =Aji > 0, and its value signifies the interaction strength. The set of nearest neighbors of nodeIis denoted byEi. As a Brownian particle moves throughout the network, at each time step it jumps from its present positionito a nearest-neighboring position j. When no additional information about the network is known, the jumping probabilityPij=Aij/PN

l=1Ailcan be assumed. MatrixPis called the transfer matrix.

The node-node distancedijfromitojis defined as the average number of steps needed for the Brownian particle to move fromithrough the network toj. Using simple linear- algebraic calculations, it is obvious that

dij =

n

X

l=1

( 1

I−B(j))il, (1.9)

whereIis theN×N identity matrix, and matrixB(j)equals the transfer matrixP, with the exception thatBlj(j)≡0for anyl∈V. The distances from all the nodes inV to node jcan thus be obtained by solving the linear algebraic equation

[I−B(j)]{d1j, ..., dnj}^T ={1, ...,1}^T. (1.10) For example, in the network shown in Figure 1.3., with the set nodesV = 1,2,3,4, the adjacency matrixAand transfer matrixP are:

A=







0 1 1 1 1 0 1 0 1 1 0 0 1 0 0 0





 ,P=







0 1/3 1/3 1/3 1/2 0 1/2 0 1/2 1/2 0 0

1 0 0 0





 .

B(j)can be derived fromP:

B(1) =







0 1/3 1/3 1/3

0 0 1/2 0

0 1/2 0 0

0 0 0 0







,B(2) =







0 0 1/3 1/3 1/2 0 1/2 0

1/2 0 0 0

1 0 0 0





 ,

B(3) =







0 1/3 0 1/3

1/2 0 0 0

1/2 1/2 0 0

1 0 0 0







,B(4) =







0 1/3 1/3 0 1/2 0 1/2 0 1/2 1/2 0 0

1 0 0 0





 .

The distance between any two nodes can be calculated with Equation1.9:

D={dij}=







8/3 2 2 1

10/3 4 8/3 13/3 10/3 8/3 4 13/3 23/11 27/11 9/11 34/11





 .

Based on the distance measure, Zhou [93] defined a dissimilarity index to quantify the relationship between any two nearest-neighboring nodes. Nearest-neighboring vertices of the same community tend to have small dissimilarity index, while those belonging to different communities tend to have high dissimilarity index.

(13)

Figure 1.3. Example of distance measure by Brownian particle.

Given two verticesiandjthat are nearest neighbors (Aij >0), the difference in their perspectives regarding the network can be quantitatively measured. The dissimilarity index Λ(i, j)is defined by the following expression:

Λ(i, j) = qPn

k6=i,j[dik−djk]²

n−2 . (1.11)

If two nearest-neighboring verticesiand j belong to the same community, then the average distancedikfromito any another vertexk(k 6=i, j)will be quite similar to the average distancedjk fromj tok. This indicates that the perspectives of the network as viewed fromiandj will be quite similar. Consequently,Λ(i, j)will be small ifiandj belong to the same community and large if they belong to different communities.

When this approach is applied to a protein interaction network, clusters of proteins that may be of biological significance can be constructed. Zhou provided three examples of such an application. Most of the proteins in these examples were involved in known functions.

It was possible to predict similar biological functions for the few proteins in each cluster which were previously unanalyzed.

1.3.3.3 UVCLUSTER

The UVCLUSTER [6] approach is informed by the observation that the shortest path distance between protein pairs is typically not very fine-grained, and that many pairs have the same distance value. This method proposes an iterative approach to distance exploration;

unlike other distance-based approaches, it converts the set of primary distances into secondary distances. The secondary distance measures the strength of the connection between each pair of proteins when the interactions for all the proteins in the group are considered. Secondary distance is derived by first applying a hierarchical clustering step based on the affinity coefficient to generateN different clustering results. The number of solutions generated in which any two selected proteins are not in the same cluster is defined as the secondary distance between the two proteins. Defined succinctly, the secondary distance represents the likelihood that two selected proteins will not be in the same cluster.

This approach has four steps:

1. A primary distancedbetween any two proteins in a PPI network are measured by the minimum number of steps required to connect them. Each valid step is a known, physical protein-protein interaction. Users are allowed to select groups of proteins to be analyzed either by choosing a single protein and establishing a cutoff distance value or by providing the program with a list of proteins.

2. Next, agglomerative hierarchical clustering is applied to the sub-table of primary distances generated in the first step to produceNalternative and equally-valid clustering

(14)

solutions. The user specifies a value forNbefore starting the analysis. UVCLUSTER first randomly samples the elements of the dataset and then clusters them according to the group average linkage. The agglomerative process ends when the affinity coefficient (AC) is reached. TheACis defined as follows:

AC= 100[(Pm−Cm)/(Pm−1)], (1.12) whereCm(the cluster mean) is the average of the distances for all elements included in the clusters andPm(the partition mean) is the average value of distances for the whole set of selected proteins. ACvalue is selectly by the user at the start of the process.

3. Once the dataset ofNalternative solutions has been obtained, the number of pairs of elements that appear together in the same cluster is counted. A secondary distance d^′ between two elements is defined as the number of solutions in which those two elements do not appear together in the same cluster, divided by the total number of solutions (N). In effect, the secondary distance iteratively resamples the original primary distance data, thus indicating the strength of the connection between two elements. Secondary distance represents the likelihood that each pair of elements will appear in the same cluster when many alternative clustering solutions are generated.

4. After the generation of secondary distance data, the proteins can be clustered using conventional methods such as UPGMA (Unweighted Pair Group Method with Arithmetic Mean) or neighbor-joining. The results of an agglomerative hierarchical clustering process in which UPGMA is applied to the secondary distance data are placed in a second UVCLUSTER output file. A third output file contains a graphical representation of the data in PGM (Portable GreyMap) format. To generate the PGM file, proteins are ordered according to the results described in the second output file.

The use of UVCLUSTER offers four significant benefits. First, the involvement of the secondary distance value facilitates identification of sets of closely-linked proteins.

Furthermore, it allows the incorporation of previously-known information in the discovery of proteins involved in a particular process of interest. Third, guided by theACvalue, it can establish groups of connected proteins even when some information is currently unavailable.

Finally, UVCLUSTER can compare the relative positions of orthologous proteins in two species to determine whether they retain related functions in both of their interactomes.

1.3.3.4 Similarity Learning Method

By incorporating very limited annotation data, a similarity learning method is introduced in [70].

The method defines the similarity between two proteins in a probabilistic framework.

Edges in the network are regarded as a means of message passing. Each protein propa- gates its function to neighboring proteins. Meanwhile, each protein receives these function messages from its neighboring proteins to decide its own function. The final probability of a protein having a specific function is therefore a conditional probability defined on its neighbors’ status of having this function annotation. For a certain functional label in consideration, the probability of a proteinAhaving this function isP(A). Another protein B’s probability of having this function by propagation usingAas the information source can then be represented as a conditional probabilityP(B|A). This conditional probability gives the capability ofA’s function being transferred toBvia the network. The similarity between proteins A and B is defined as the product of two conditional probabilities:

(15)

SimilarityAB=P(A|B)∗P(B|A).

Now the problem of estimating the similarity between two proteins is changed into estimating the two conditional probabilities. For this purpose, a statistic model is defined to predict the conditional probabilities using topological features. Since in most organisms, there are certain amount of annotation data for proteins, some training samples are available.

The method uses a two-step approach:

1. Model training step: known annotation data are used to estimate the parameters in the model.

2. Conditional probability estimation step: the numerical values of the conditional probabilities are calculated using the model and the parameters estimated in the previous step.

Unsupervised clustering method can be applied on the resulting similarity matrix.

1.3.3.5 Summary

This subsection has provided a review of a series of approaches to distance-based clustering. The first category of approaches uses classic distance measurement methods, which offered a variety of coefficient formulas to compute the distance between proteins in PPI networks. The second class of approaches defines a distance measure based on network distance, including the shortest path length, combined strength of paths of various lengths, and the average number of steps a Brownian particle takes to move from one vertex to another. The third approach type, exemplified by UVCLUSTER, defines a primary and a secondary distance to establish the strength of the connection between two elements in relationship to all the elements in the analyzed dataset. The forth is a similarity learning approach by incorporating some annotation data. Although these four categories of approaches each involve different methods for distance measurement, they all apply classic clustering approaches to the computed distance between proteins.

1.3.4 Graph-based Clustering

A protein-protein interaction network is an unweighted graph in which the weight of each edge between any two proteins is either 1 or 0. This section will explore graph-based clustering, another class of approaches to the process of clustering. Graph-based clustering techniques are explicitly presented in terms of a graph, thus converting the process of clustering a dataset into such graph-theoretical problems as finding a minimum cut or maximal subgraphs in the graphG.

1.3.4.1 Finding Dense Subgraphs

The goal of this class of approaches is to identify the densest subgraphs within a graph;

specific methods vary in the means used to assess the density of the subgraphs. Five variations on this theme will be discussed in this subsection.

Enumeration of complete subgraphs

This approach is to identify all fully connected subgraphs (cliques) by complete enumeration [80]. In general, finding all cliques within a graph is an NP-complete problem.

Exceptionally, however, this problem is anti-monotonic, meaning that, if a subset of set A

(16)

Figure 1.4. Example of Enumeration of complete subgraphs.

is not a clique, then set A is also not a clique. Because of this property, regions of density can be quickly identified in sparse graphs. In fact, to find cliques of size n, one needs only to enumerate those cliques that are of size n-1. Assume a process which starts from the smallest statistically significant number, which is 4 in the case depicted in Figure 1.4. All possible pairs of edges in the nodes will be considered. For example, as shown in Figure 1.4., to examine the edgeABandCD, we must check for edges betweenAC,AD,BCand BD. If these edges all connect, they are considered fully connected, and a cliqueABCD has thus been identified. To test every identified cliqueABCD, all known proteins will be successively selected. If for proteinE, there existEA,EB,EC,ED, then the clique will be expanded toABCDE. The end result of this process is the generation of cliques which are fully internally connected.

While this approach is simple, it has several drawbacks. The basic assumption underlying the method - that cliques must be fully internally connected - does not accurately reflect the real structure of protein complexes and modules. Dense subgraphs are not necessarily fully connected. In addition, many interactions in the protein network may fail to be detected experimentally, thus leaving no trace in the form of edges.

Monte Carlo optimization

Seeking to address these issues, Spirin and Mirny[80] introduced a new approach which searches for highly-connected rather than fully-connected sets of nodes. This was concep- tualized as an optimization problem involving the identification of a set ofnnodes that maximizes the object functionQ, defined as follows:

Q(P) = 2m

n·(n−1). (1.13)

The termmenumerates the edges(interactions) among thennodes in the subgraphP. In this formula, the functionQcharacterizes the density of a cluster. If the subset is fully connected,Qequals 1; if the subset has no internal edge ,Qequals 0. The goal is to find a subset withnnodes which maximizes the objective functionQ.

A Monte Carlo approach is used to optimize the procedure. The process starts with a connected subsetSofnnodes. These nodes are randomly picked from the graph and then updated by adding or deleting selected nodes from setS, then remain the nodes which increase functionQofS. These steps are repeated until the maximumQis identified; this yields ann-node subset with high density.

Another quality measure used in this approach is the sum of the shortest distances between selected nodes. Correspondingly, a similar Monte Carlo approach is applied to

(17)

minimize this value. This process proceeds as follows. At timet= 0, a random set ofM nodes is selected. For each pair of nodesi, jfrom this set, the shortest pathLijbetweeni andjon the graph is calculated. The sum of all shortest pathsLijfrom this set is denoted asL0. At each time step, one of M nodes is randomly selected and replaced by random one from among its neighbors. To assess whether the original node is to be replaced by this neighbor, the new sum of all shortest paths,L1, is then calculated. IfL1 < L0, the replacement is accepted with probability 1. IfL1 > L0, the replacement is accepted with probabilityexp⁻^L1−L0^T , whereTis the effective temperature. At every tenth time step, an attempt is made to replace one of the nodes from the current set with a node that has no edges with the current set. This procedure ensures that the process is not caught in an isolated disconnected subgraph. This process is repeated either until the original set converges to a complete subgraph or for a predetermined number of steps. The tightest subgraph, defined as the subgraph corresponding to the smallestL0, is then recorded. The recorded clusters are merged and redundant clusters are removed. The use of a Monte Carlo approach allows smaller pieces of the cluster to be separately identified rather focusing exclusively on the whole cluster. Monte Carlo simulations are therefore well suited to recognizing highly dispersed cliques.

The experiments conducted by Spirin started with the enumeration of all cliques of size 3 and larger in a graph withN = 3,992nodes andM = 6,500edges. Additionally, 1,000 random graphs of the same size and degree distribution were constructed for comparison.

Using the approach described above, more than 50 protein clusters of sizes from 4 to 35 were identified. In contrast, the random networks contained very few such clusters. This work indicated that real complexes have many more interactions than the tightest complexes found in randomly-rewired graphs. In particular, clusters in a protein network have many more interactions than their counterparts in random graphs.

Redundancies in PPI network

Samanta and Liang [76] took a statistical approach to the clustering of proteins. This approach assumes that two proteins that share a significantly larger number of common neighbors than would arise randomly will have close functional associations. This method first ranks the statistical significance of forming shared partnerships for all protein pairs in the PPI network and then combines the pair of proteins with least significance. The p-value is used to rank the statistical significance of the relationship between two proteins. In the next step, the two proteins with smallest p-value are combined and are considered to be in the same cluster. This process is repeated until a threshold is reached. The steps of the algorithm are described in more detail as follows:

First, thep-values [81] for all possible protein pairs are computed and stored in a matrix.

The formula of computingp-value between two proteins is shown in Equation 1.14:

P(N, n1, n2, m) = µ N

m

¶µ N−m n1−m

¶µ N−n1

n2−m

¶ µ N

n1

¶µ N n2

¶

= µ n1

m

¶µ N−n1

n2−m

¶ µ N

n2

¶

= (N−n1)! (N−n2)!n1!n2!

N!m! (n1−m)! (n2−m)! (N−n1−n2+m)!,

(1.14)

(18)

Figure 1.5. If the element (m, n) has the lowestp-value, a cluster is formed with proteinsmand n. Therefore, rows/columnsmandnare merged with newp-value of the merged row/column as geometric mean of the separatep-values of the corresponding elements. Figure is adapted from [76]

whereN is the number of the proteins in the network, each protein in the pair hasn1and n2neighbors, respectively, andmis the number of neighbors shared by both proteins. This formula is symmetric with respect to interchange ofn1 andn2. It is a ratio in which the denominator is the total number of ways that two proteins can haven1andn2neighbors. In the numerator, the first term represents the number of ways by whichmcommon neighbors can be chosen from allN proteins. The second term represents the number of ways by whichn1−mremaining neighbors can be selected from the remainingN−mproteins.

The last term represents the number of ways by whichn2−mremaining neighbors can be selected, none of which can match any of then1neighbors of the first protein.

Second, the protein pair with the lowest p-value is designated as the first group in the cluster. As illustrated in Figure 1.5., the rows and columns for these two proteins are merged into one row and one column. The probability values for this new group are the geometric means of the two original probabilities (or the arithmetic means of thelogP values). This process is repeated until a threshold is reached, adding elements to increase the size of the original cluster. The protein pair with the second-lowestp-value is selected to generate the next cluster.

As mentioned in Section 3.2, a high rate of false positives typically creates significant noise which disrupts the clustering of protein complexes and functional modules. This method overcomes this difficulty by using a statistical technique that forms reliable functional associations between proteins from noisy interaction data. The statistical significance

(19)

of forming shared partnerships for all protein pairs in the interaction network is ranked. This approach is grounded on the hypothesis that two proteins with a significantly larger number of common interaction pairs in the measured dataset than would arise randomly will also have close functional links[76] .

To validate this hypothesis, all possible protein pairs were ranked in the order of their probabilities. For comparison, the corresponding probabilities were examined for a random network with the same number of nodes and edges but with different connections. The connections in the random network were generated from a uniform distribution. The comparison suggests that the associations in the real dataset contain biologically meaningful information. It also indicates that such low-probability associations did not arise simply from the scale-free nature of the network.

Molecular complex detection (MCODE)

Molecular complex detection (MCODE), proposed by Bader and Hogue [8], is an effective approach for detecting densely-connected regions in large protein-protein interaction networks. This method weights a vertex by local neighborhood density, chooses a few seeds with high weight, and isolates the dense regions according to given parameters. The MCODE algorithm operates in three steps: vertex weighting, complex prediction, and op- tional postprocessing to filter or add proteins to the resulting complexes according to certain connectivity criteria.

In the first step, all vertices are weighted based on their local network density using the highestk-core of the vertex neighborhood. The core-clustering coefficient of a vertex v is defined to be the density of the highestk-core of the vertices connected directly to v and alsov itself (which is called immediate neighborhood ofv). Compared with the traditional clustering coefficient, the core-clustering coefficient amplifies the weighting of heavily-interconnected graph regions while removing the many less-connected vertices that are usually part of a biomolecular interaction network. For each vertexv, the weight ofv is:

w=k×d, (1.15)

wheredis the density of the highestk-core graph from the set of vertices including all the vertices directly connected withvand vertexvitself. For example, using the example provided in Figure 1.2., the 2-core weight of nodeAis2×_5×(5−1)^2×5 = 1. It should be noted that nodeDis not included in the 2-core node set because the degree of nodeDis 1.

The second step of the algorithm is the prediction of molecular complexes. With a vertex-weighted graph as input, a complex with the highest-weighted vertex is selected as the seed. Once a vertex is included, its neighbors are recursively inspected to determine if they are part of the complex. Then the seed is expanded to a complex until a threshold is encountered. The algorithm assumes that complexes cannot overlap (this condition is more fully addressed in step three), so a vertex is not checked more than once. This process stops when, as governed by the specified threshold, no additional vertices can be added to the complex. The vertices included in the complex are marked as having been examined. This process is repeated for the next-highest unexamined weighted vertex in the network. In this manner, the densest regions of the network are identified. The vertex weight threshold parameter defines the density of the resulting complex.

Post-processing occurs optionally in the third step of the algorithm. Complexes are filtered out if they do not contain at least a 2-core node set. The algorithm may be run with the “fluff” option, which increases the size of the complex according to a given fluff parameter between 0.0 and 1.0. For every vertexvin the complex, its neighbors are added to

(20)

the complex if they have not yet been examined and if the neighborhood density (including v) is higher than the given fluff parameter. Vertices that are added by the fluff parameter are not marked as examined, so there can be overlap among predicted complexes with the fluff parameter set.

Evaluated using the Gavin [25] and MIPS [60] data set, MCODE effectively finds densely-connected regions of a molecular interaction network based solely on connectivity data. Many of these regions correspond to known molecular complexes.

Summary

This subsection has introduced a series of graph-based clustering approaches which are structured to maximize the density of subgraphs. The first approach examined seeks to identify fully-connected subgraphs within the network. The second approach improves upon this method by optimizing a density function for finding highly-connected rather than fully-connected subgraphs. The third approach merges pairs of proteins with the lowest p-values, indicating that those proteins have a strong relationship, to identify the dense subgraphs within the network. The final approach discussed assigns each vertex a weight to represent its density in the entire graph and uses the vertex with the highest weight as the seed to generate to a dense subgraph. These approaches all use the topology of the graph to find a dense subgraph within the network and to maximize the density of each subgraph.

1.3.4.2 Finding Minimum Cut

A second category of graph-based clustering approaches generates clusters by trimming or cutting a series of edges to divide the graph into several unconnected subgraphs. Any edge which is removed should be the least important (minimum) in the graph, thus minimizing the informational cost of removing the edges. Here, the least important is based on the structure of the graph. It doesn’t mean the interaction between these two proteins is not important.

This subsection will present several techniques which are based upon this method.

Highly connected subgraph (HCS) algorithm

The Highly-connected subgraph or HCS method [33] is a graph-theoretic algorithm which separates a graph into several subgraphs using minimum cuts. The resulting subgraphs satisfy a specified density threshold. Despite its interest in density, this method differs from approaches discussed earlier which seek to identify the densest subgraphs.

Rather, it exploits the inherent connectivity of the graph and cuts the most unimportant edges to find highly-connected subgraphs.

Some graph-theoretic concepts should first be defined at this point. Theedge-connectivity k(G)of a graphGis the minimum numberkof edges whose removal results in a disconnected graph. Ifk(G) =lthenGis termed anl-connected orl-connectivity graph. For example, in Figure 1.6., the graphGis a 2-connectivity graph because we need at least cut two edges (dashed lines in graph) to produce a disconnected graph. Ahighly connected subgraph (HCS) is defined as a subgraph whoseedge-connectivityexceeds half the number of vertices. For example, in Figure 1.6., graphG1is ahighly connected subgraphbecause its edge-connectivity k(G) = 3is more than half of the vertices number. Acutin a graph is a set of edges whose removal disconnects the graph. Aminimumcut(abbreviatedmincut) is a cut with a minimum number of edges. Thus, a cutSis a minimum cut of a non-trivial graphG if f|S|=k(G). The length of a path between two vertices consists of the number of edges in the path. The distanced(u, v)between vertices uandv in graph Gis the minimum length of their connecting path, if such path exists; otherwised(u, v) =∞. The diameter of a connected graphG, denoteddiam(G), is the longest distance between any two vertices inG. The degree of vertexv in a graph, denoteddeg(v), is the number of edges incident to the vertex.

(21)

Figure 1.6. An example of applying the HCS algorithm to a graph. Minimum cut edges are denoted by broken lines. Figure is adapted from [33]

The algorithm identifies highly-connected subgraphs as clusters. The HCS algorithm is detailed below, and Figure 1.6. contains an example of its application. GraphGis first separated into two subgraphsG1andG2, whichG1is ahighly connected subgraphand G2is not. SubgraphG2is separated into subgraphsG3andG4. This process produces threehighly connected subgraphs G1,G3andG4, which are considered clusters.

HCS(G(V, E))algorithm begin

(H, H, C)← MINCUT (G) if G is highly connected

then return(G) else

HCS(H) HCS(H) end

The HCS algorithm generates solutions with desirable properties for clustering. The algorithm has low polynomial complexity and is efficient in practice. Heuristic improve- ments made to the initial formulation have allowed this method to generate useful solutions for problems with thousands of elements in a reasonable computing time.

Restricted Neighborhood Search Clustering Algorithm (RNSC)

In [45], King et al. proposed a cost-based local search algorithm based on the tabu search metaheuristic [31]. In the algorithm, a clustering of a graphG= (V, E)is defined as a partitioning of the node setV. The process begins with an initial random or user-input clustering and defines a cost function. Nodes are then randomly added to or removed from

(22)

Figure 1.7. An example of RNSC approach.

clusters to find a partition with minimum cost. The cost function is based on the number of invalid connections. An invalid connection incident withv is a connection that exists betweenvand a node in a different cluster, or, alternatively, a connection that does not exist betweenvand a nodeuin the same cluster asv.

The process begins with an initial random or user-input clustering and defines a cost function. Nodes are then randomly added to or removed from clusters to find a partition with minimum cost. The cost function is based on the number of invalid connections.

Consider a nodevin a graphG, and a clusteringCof the graph. Letαvbe the number of invalid connections incident withv. The naive cost function ofCis then defined as

Cn(G, C) =1 2

X

v∈V

αv, (1.16)

whereV is the set of nodes inG. For a vertexvinGwith a clusteringC, letβvbe the size of the following set:vitself, any node connected tov, and any node in the same cluster as v. This measure reflects the size of the area thatvinfluences in the clustering. The scaled cost function ofCis defined as:

Cn(G, C) =|V| −1 3

X

v∈V

αv

βv

. (1.17)

For example, in Figure 1.7., if the eight vertices are grouped into two clusters as shown, the naive cost functionCn(G, C) = 2, and the scaled cost functionCn(G, C) =²⁰₉.

Both cost functions seek to define a clustering scenario in which the nodes in a cluster are all connected to one another and there are no other connections between two clusters.

The RNSC approach searches for a low-cost clustering solution by optimizing an initial state. Starting with an initial clustering defined randomly or by user input, the method iteratively moves a node from one cluster to another in a random manner. Since the RNSC is randomized, different runs on the same input data will result in different clustering results.

To achieve high accuracy in predicting true protein complexes, the RNSC output is filtered according to a maximum P-value selected for functional homogeneity, a minimum density

(23)

value, and a minimum size. Only clusters that satisfy these three criteria are presented as predicted protein complexes.

Super paramagnetic clustering (SPC)

The super-paramagnetic clustering (SPC) method uses an analogy to the physical properties of an inhomogenous ferromagnetic model to find tightly-connected clusters in a large graph [11, 27, 28]. Every node on the graph is assigned a Potts spin variableSi= 1,2, ..., q.

The value of this spin variable Si engages in thermal fluctuations which are determined by the temperatureT and the spin values of the neighboring nodes. Two nodes connected by an edge are likely to have the same spin value. Therefore, the spin value of each node tends to align itself with that of the majority of its neighbors.

The SPC procedure proceeds via the following steps:

1. Assign to each point−→xiaq-state Potts spin variableSi.

2. Find the nearest neighbors of each point according to a selected criterion; measure the average nearest-neighbor distancea.

3. Calculate the strength of the nearest-neighbor interactions using Eq. (19).

Jij =Jji= 1

Kˆexp(−k−→xi− −→xjk²

2a² ), (1.18)

whereKˆ is the average number of neighbors per site.

4. Use an efficient Monte Carlo procedure with Eq. (20) to calculate the susceptibility χ.

χ=N

T(hm²i − hmi²), m=(Nmax/N)q−1

q−1 , (1.19)

whereNmax=max{N1, N2, , Nq}andNµis the number of spins with the valueµ.

5. Identify the range of temperatures corresponding to the superparamagnetic phase, betweenTf s, the temperature of maximalχ, and the (higher) temperatureTpswhere χdiminishes abruptly. Cluster assignment is performed atTclus= (Tf s+Tps)/2.

6. Once theJijhave been determined, the spin-spin correlation function can be obtained by a Monte Carlo procedure. Measure atT =Tclusthe spin-spin correlation function, hδSi,Sji, for all pairs of neighboring points−→xiand−→xj.

7. Clusters are identified according to a thresholding procedure. IfhδSi,Sji> θ, points

−

→xi,−→xj, are defined as “friends”. Then all mutual friends (including friends of friends, etc.) are assigned to the same cluster.

The SPC algorithm is robust in conditions with noise and initialization errors and has been shown to identify natural and stable clusters with no requirement for pre-specifying the number of clusters. Additionally, clusters of any shape can be identified.

Markov clustering

The Markov clustering (MCL) algorithm was designed specifically for application to simple and weighted graphs [82] and was initially used in the field of computational graph clustering [83]. The MCL algorithm finds cluster structures in graphs by a mathematical bootstrapping procedure. The MCL algorithm simulates random walks within a graph by

(24)

Figure 1.8. (A) Example of a protein-protein similarity graph for seven proteins (A-F), circles represent proteins (nodes) and lines (edges) represent detected BLASTp similarities with E-values (also shown). (B) Weighted transition matrix for the seven proteins shown in (A). (C)Associated column stochastic Markov matrix for the seven proteins shown in (A). Figure is adapted from [21]

the alternation of expansion and inflation operations. Expansion refers to taking the power of a stochastic matrix using the normal matrix product. Inflation corresponds to taking the Hadamard power of a matrix (taking powers entrywise), followed by a scaling step, so that the resulting matrix is again stochastic.

Enright et al. [21] employed the MCL algorithm for the assignment of proteins to families. A protein-protein similarity graph is represented as described in Section 2 and as illustrated in Figure 1.8.A. Nodes in the graph represent proteins that are desirable clustering candidates, while edges within the graph are weighted according to a sequence similarity score obtained from an algorithm such as BLAST [5]. Therefore, the edges represent the degree of similarity between these proteins.

A Markov matrix (as shown in Figure 1.8.B) is then constructed in which each entry in the matrix represents a similarity value between two proteins. Diagonal elements are set arbitrarily to a “neutral” value and each column is normalized to produce a column total of 1. This Markov matrix is then provided as input to the MCL algorithm.

As noted above, the MCL algorithm simulates random walks within a graph by alternating two operators: expansion and inflation. The structure of the MCL algorithm is described by the flowchart in Figure 1.9.. After parsing and normalization of the similarity matrix,