Evaluation of Image Segmentation

(1)

Evaluation of Image Segmentation

1 Introduction

This technical report describes several performance measures that are used in literature in order to evaluate the image segmentation.

2 Scientific problem

The segmentation problem:

• two-labels: one class of interest and one class of background;

• multiple labels: more classes of interest and one class of background (e.g. in tumour segmentation, core and edema represents two class of interest; taking into account the background class, a 3-label segmentation problem is obtained).

The segmentation of an image,I, can be defined as the partitioning ofI inLsubregions,R¹,R²,. . .,R^L, such that:

• ∪_nL

l=1R^l=I

• ∀ex∈R^l, S(ex) =l,∀l∈ {1,2, . . . , L}

• R^lis a connected set,∀l∈ {1,2, . . . , L}

• R^l¹ ∩R^l² =ϕ,∀l₁, l₂, l₁ ̸=l₂

• Q(R^l) =T rue, l∈ {1,2, . . . , L}

• Q(R^l¹∪R^l²) =F alsefor each pair of adjacent regionsR^l¹,R^l².

whereQis a logical predicate defined on the points of the considered region (Qis used for characterizing the objects of the image).

In the case of two regions (L = 2: background and foreground)R^b andR^f (from a segmentation S), an element is defined to be on the boundary ofR^b andR^f if at least one of its connected neighbours belongs to a different category (R^f orR^b). The boundary elements in an image form a setBS.

3 Evaluation of segmentation

In order to evaluate the quality of segmentation algorithms, various methodologies are possible. The criteria that make differences between these practices regard

• the reference segmentation to that the computed segmentation is compared or

• the working principles of the evaluation methods [Bauer et al., 2009].

(2)

3.1 Reference segmentation

In the first case, the aim of evaluation is to compare (by using similarity measures or distance-based measures) the computed segmentations to the segmentations created by human experts [Niessen et al., 1998], [Gerig et al., 2001]. Even if the segmentations produced by humans (medical experts) could not represent a true gold standard, this framework is the most objective one.

A second case regards a ranking system in which the computed segmentations receive ranks, given by human experts. Such approach is subjective and requires numerous experts and a large time in order to analyse all the images and to rate them [Shimizu and Nawano, 2004].

A third case is represented by a comparison of segmentation methods by taking into account their common agreement (in [Warfield et al., 2004] the STAPLE algorithm is suggested). Even if this evaluation provides good results, some risks appear (e.g., the algorithms obtain good scores by simply producing the same errors as other methods, which also leads to a high common agreement).

3.2 Working principles

Another criteria for classifying the evaluation methods regards the working principles of them [Cardenes et al., 2007], [Cárdenes et al., 2009]. Taking into account if a human analyses (in a visual way) or not the produced segmentation, the evaluation methods can be classified as:

1. subjective methods — that require the human evaluator; the quality of these methods depend upon the standard used by different experts in order to evaluate the segmentations or upon the order in which evaluators observe the segmentation results;

2. objective methods — that do not require the human evaluator. By taking into account the impact of segmentations, these objective methods can be classified as:

(a) System-level evaluation methods — the segmentation method is analysed by involving it into an overall system; the performance of the system will highlight the segmentation quality;

(b) Direct evaluation methods — the segmentation method is analysed independently. By considering the object of examination, the direct evaluation methods are classified as:

i. analytical evaluation methods — based only on intrinsic features of the segmentation algorithms (e.g. processing strategy (parallel, sequential, iterative, or mixed), processing complexity, resource efficiency, and segmentation resolution); these methods are aimed to analyse an algorithm in terms of principles, requirements, complexity etc. without reference to a con- crete implementation of the algorithm or test data. For example, one can define the time complexity of an algorithm, or its response to a theoretical data model such as a step edge [Zhang et al., 2008].

ii. empirical evaluation methods — based on the study of the results

A. discrepancy (or supervised) methods – compare the results with a reference image or gold standard or ground truth; measures as the number of misclassified voxels, position of misclassified voxels, number of objects in the image or feature values of segmented objects could be used in order to evaluate the quality of segmentation.

• based on the number of misclassified pixels versus the reference image, with penalties weighted proportional to the distance to the closest correctly classified pixel for that region [10-14].

• based on the differences in the feature values measured from the segmented images and the reference image [15-20]. These methods have been extended to accommodate the problem when the number of objects differs between the segmented and reference images [21-25]

(3)

• evaluation of edge-based image segmentation methods [26-28,68,30-35]

• Everingham et al. [36] proposed a method to comprehensively evaluate segmentation algorithms using the Pareto front. Instead of using a single discrepancy metric and evaluating effectiveness in a discrepancy space, it performs evaluation in a multi-dimensional fitness/cost space with multiple discrepancy metrics

B. goodness (stand-alone or unsupervised) methods – based on the study of the results them- selves; evaluate a segmented image based on how well it matches a broad set of characteristics of segmented images as desired by humans. These methods evaluate algorithms by computing a “goodness” metric on the segmented image without a priori knowledge of the desired segmentation result.

4 Supervised evaluation

It is very important to establish how we define similar regions or segmentations. The obtained segmentations and their boundaries could be compact, discontinuous, smooth, etc. In order to define a similarity measure several aspects have to be considered: colour, texture, motion. When a distance metric is involved in the evaluation process, the minimum, the mean or the maximum of such a measure could be considered.

4.1 Prerequisites

Firstly, we give the definitions of four basic cardinalities of the so-called confusion matrix, namely the true positives (TP), the false positives (FP), the true negatives (TN), and the false negatives (FN) for crisp segmentations.

• 2D images: LetI(ex) : R² →Rbe a 2D (medical) image (composed by pixels), andS(ex) :R² → Ω, Ω ={0,1,2, . . . , L−1}, be anL-ry decision segmentation of the imageI(ex).

• 3D images: LetI(x) :e R³ → Rbe a 3D (medical) image (composed by voxels), andS(x) :e R³ →Ω, Ω ={0,1,2, . . . , L−1}, be anL-ry decision segmentation of the imageI(ex).

Suppose that we have two segmentations:

• the reference segmentation (gold standard or ground truth)¹-Sr

• the computed segmentation (by a specific algorithm) -S_c. Each image point (pixel/voxel) can be classified as followed:

• true positive: (Sr(x)e is 1∧(Sc(x)e is 1),

• false positive:(Sr(ex)is 0∧(Sc(ex)is 1),

• true negative: (S_r(x)e is 0∧(S_c(x)e is 0),

• false negative:(S_r(x)e is 1∧(S_c(x)e is 0).

Each of these segmentations are composed by k segments/regions/classes (e.g. if k = 2, then the two segments are represented by the class of interest and the background; if k = 3, then two classes of interest and the background will represent the possible segments).

In the case of k = 2 segments, the confusion matrix can be represented as that from Table1.

1in order to call the reference segmentation “ground truth” we have to be certain that it is. Manual reference segmentations drawn by experts normally approximate ground truth, in which case it can be used as gold standard, but not as the ground truth itself.

(4)

real segments positive (interest)

segment

negative (background)

segment computed

segments

positive (interest)

segment TP FN

negative (background)

segment FN TN

Table 1: confusionMatrix 4.2 Measures based on spatial overlap

These metrics compute the overlap between regions. They quantify the similarity of two segmentations and they are useful when the volume changes are of importance for the image analysis.

4.2.1 Dice coefficient

• Definition: This similarity coefficient [Dice, 1945] is computed as the ratio between the number of pixels/voxels belonging to the intersection (of two possible segmentations) and the average of their sizes.

Coeff_Dice(Sc, Sr) = 2|Sr∩Sc|

|S_r|+|S_c| = 2T P

2T P +F P +F N (1)

• u.m.:Percent

• Range:Its values range between 0 (no overlap) and 1 (perfect agreement).

• Utility:It is able of quantifying the reproducibility (repeatability).

• Other names: It is also called the overlap index. A mathematical equivalent of Dice coefficient is the F_β measure (defined by eq.2, withβ= 1))

Fβ = (β²+ 1)∗Precision∗Recall

β²∗Precision+Recall (2)

Another possible reference to Dice coefficient is the term Symmetric Volume Overlap (SVO) [Campadelli et al., 2009, Campadelli et al., 2008,Campadelli et al., 2010,Casiraghi et al., 2009], defined as 1 - the symmetric vol-

ume difference (SVD).

SV D= 1−CoeffDice= 1−2∗ |Sc∩Sr|

|Sc|+|Sr| (3)

Coef f_Dice is a special case of kappa statistics if the number of true negatives is significantly larger than the other three statistical parameters.

• Code: Visceral

(5)

4.2.2 Jaccard coefficient

• Definition: The Jaccard index (JAC) [Jaccard, 1912] is computed as the ratio between the intersection (of two possible segmentations) and their union.

Coeff_Jaccard(S_c, S_r) = |S_r∩S_c|

|S_r∪S_c| = T P

T P +F P +F N (4)

• u.m.:Percent

• Range:Its values range between 0 (no overlap) and 1 (perfect agreement).

• Utility: In [Taha and Hanbury, 2015] is noted that CoeffJaccardand CoeffDiceare equal at the extrema 0, 1 and Coeff_Dice> Coeff_Jaccardbetween those limits. Furthermore the two metrics are related according to:

CoeffJaccard= CoeffDice

2−Coeff_Dice.

Due to this relation, both coefficients evaluate the same aspects of segmentation, without providing sup- plementary information if both of them are used as validation metrics.

• Other names:Volumetric overlap error is the corresponding error measure [Ruskó et al., 2009] (V OE = 1−Coeff_Jaccard).

• Code: Visceral

Dice and Jaccard coefficients are sensitive to misplacement of the segmentation label, but they are rela- tively insensitive to volumetric under- and overestimations. Shape infidelity is only captured if the deviation is volumetrically impactful: a thin panhandle will not result in a large deviation from one. The Dice coefficient is currently more popular than the Jaccard coefficient, since Jaccard is numerically more sensitive to mismatch when there is reasonably strong overlap. Dice values “look nicer” because they are higher for the same pair of segmentations. A drawback of both is that they are unsuitable for comparing segmentation accuracy on objects that differ in size [Rohlfing et al., 2004].

4.2.3 True positive rate&co

• Definition: Recall is computed as the ratio between the number of positive pixels/voxels in the reference image and the number of pixels/voxels identified as positive in the segmented image.

Recall= T P

T P +F N (5)

In conjunction with Precision (defined as_{T P}^{T P}_{+F P}), Recall is used in order to compute F-measure.

Specificity is computed as the ration between the number of negative pixels/voxels in the reference image and the number of pixels/voxels identified as negative in the segmented image.

Specificity= T N

T N+F P (6)

These two measures depend on the size of segments. There are two other measures that are related to these metrics, namely Fallout and the false negative rate (FNR). They are defined by:

Fallout= F P

F P +T N = 1−Specificity (7)

(6)

FNR= F N

F N+T P = 1−Recall (8)

• u.m.: Percent

• Range:[0,1]

• Utility: Since the last two measures are equivalent to Specificity and Recall, only one pair ((Recall, Specificity) or (Fallout, False Negative rate)) should be used in order to evaluate and to analyze the performance of segmentation.

• Other names: Recall is also called Sensitivity or True Positive Rate. Specificity is also called True Negative Rate. Fallout is also called the false positive rate (FPR).

• Code: Visceral

4.2.4 Global consistency error

Another frequently used measure for evaluating the segmentation performance is the Global Consistency Error (GCE) [Martin et al., 2001]. An error-based measure is actually an “opponent” to a similarity measure (two segmentations are identical if an error-based measure is 0).

• Definition: This measure is computed as an average over the error of pixels/voxels belonging to two segmentations. It is able of comparing partitions of the same image and it is tolerant of one partition refining the other (e.g. by splitting regions or merging them together). First of all, if we consider an imageI (that containsnpixels/voxels: n= |I|) and a segmentation region of it S. We can note the set of all points (pixels/voxels) that are neighbours top_i and that belong to the same segmentation regionS byR(S, pi). For two segmentationsScandSr, the (not symmetric) error (called local refinement error in [Martin et al., 2001]) at point (pixel/voxel)pi, LRE(Sc, Sr, pi)is defined as

LRE(Sc, Sr, pi) = |R(Sc, pi)−R(Sr, pi)|

|R(Sc, pi)| (9)

The GCE between segmentations can be defined as a mean over the error of all points (pixels/voxels):

GCE(S1, S2) = 1

|I|min





|I|

∑

i=1

LRE(S1, S2, pi),

|I|

∑

i=1

LRE(S2, S1, pi)



 (10) By using the cardinalities introduced in Section4.1, GCE can be expressed as follows:

GCE(Sc, Sr) = 1

|I|min

{F N(F N+ 2T P)

T P +F N +F P(F P + 2T N)

T N +F P ,F P(F P + 2T P)

T P +F P +F N(F N + 2T N) T N +F N

}

(11)

• u.m.:Percent

• Range:Its values range between 0 (perfect agreement) and 1 (most different).

• Utility: This measure is able of quantifying the consistency between image segmentations of differing granularities. It has the advantage of being tolerant of (label) refinement. Using this measure has sense when the two segmentations being compared have similar number of segments [Unnikrishnan et al., 2007].

• Code: Visceral

(7)

4.2.5 False positive Dice and False Negative Dice

Another possibility to analyse two segmentations is to use the False Positive Dice (FPD) and False Negative Dice (FND) measures [Babalola et al., 2009].

• Definition: False Positive Dice (FPD) is defined by Eq. 12and False Negative Dice (FND) is defined by Eq.13.

FPD(Sc, Sr) = 2S_c∩S_r

|Sc|+|Sr| (12)

FND(S_c, S_r) = 2S_c∩S_r

|Sc|+|Sr| (13)

whereS_candS_rare the complements of the segmentation results and gold standards, respectively.

• u.m.: Percent

• Range: [0,1]

• Utility:

• Other names: The FPD helps to evaluate the over-segmentation, while the FND gives a measure of under-segmentation.

• Code: none

4.3 Measures based on distance

These metrics compute the distance between two segmented regions (surfaces/volumes) by taking into account the point (pixel/voxel) location. They quantify the dissimilarity of two segmentations and they are useful when the contours (the shapes of the boundary of the structure) are of importance for the image analysis.

A distance value of 0 corresponds to a perfect match between the computed segmentation and the ground truth, while greater values indicate higher errors.

4.3.1 Mean symmetric surface distance (MSSD)

The mean surface distance (MSSD) [Babalola et al., 2009] takes into account the error between the surfaces (boundaries) of two segmentations (ScandSr) by using distances between their boundary pixels/voxels.

• Definition: The border points (pixels/voxels) of segmentation and reference are determined. These are defined as those points (pixels/voxels) in the object that have at least one neighbour (from all the nearest neighbours) that does not belong to the object. For each point (pixel/voxel) in these sets, the closest point (pixel/voxel) in the other set is determined (using Euclidean distance and real world distances, so taking into account the generally different resolutions in the different scan directions). All these distances are stored, for border pixels/voxels from both reference and segmentation. The average of all these distances gives the mean symmetric absolute surface distance.

The distance surface is defined for a point (pixel/voxel). Ifpⁱ is thei^thsurface point (pixel/voxel) onS_c (pⁱ∈BSc), the distance frompⁱto the closest point (pixel/voxel) onSris:

d(pⁱ, S_r) = min

p^j∈Sr

d(pⁱ, p^j) (14)

(8)

whered(pⁱ, p^j)is the Euclidean distance of the points (pixels/voxels) incorporating the real spatial resolution of the image. The distance has associate a sign: the negative one means that a particular point (pixel/voxel) inS_cis within the surface defined byS_r, while the positive values otherwise.

The mean symmetric surface distance (MSSD) is the average of all the distances from points on the boundary of computed segmentation to the boundary of ground truth and from points on the boundary of ground truth to the boundary of the computed segmentation, respectively:

MSSD(S_c, S_r) = 1

|B_S_c|+|B_S_r|×



 ∑

pⁱ∈B_Sc

d(pⁱ, S_r) + ∑

p^j∈B_Sr

d(p^j, S_c)



 (15)

• u.m.: Millimeters

• Range: This measure is 0 for a perfect segmentation and greater the value of the distance, the worse the segmentation is.

• Utility: This distance provides a measure of the average mutual distance between edges of the two surfaces.

• Code: none

4.3.2 Hausdorff distance and Average Hausdorff distance

• Definition: In order to compute this distance, some preliminary concepts must be explained.

The distance between a pointpbelonging to a segmentationScand another segmentationSris defined as:

d(p, Sr) = min

pⁱ∈Sr

d(p, pⁱ) (16)

where∥.∥denotes some norm, e.g. Euclidean distance.

The directed Hausdorff distance between two segmentations (ScandSr) is given by the maximal distance from a point in the first segmentation to a nearest point in the other one

d(Sc, Sr) = max

pⁱ∈Sc

d(pⁱ, Sr) (17)

This distance is in general not symmetrical [Aspert et al., 2002]; d(Sc, Sr) can be called the forward distance andd(S_r, S_c)can be called the backward distance.

In order to compute the symmetrical Hausdorff distance between two segmentations, for each point in Sc the minimum distance to all points in Sr is computed and vice versa (for each pixel/voxel on the boundary ofScthere is guaranteed to be a voxel ofSrin a distance of at most HAD, and vice versa).

Then, the maximum over the set of minimum distances will represent SHD. The symmetrical Hausdorff distance [Hausdorff, 1914] is defined as follows:

SHD(S_c, S_r) = max{d(S_c, S_r), d(S_r, S_c)} (18) The symmetrical distance allows of better evaluation of the errors between segmentations (the computa- tion of a “onesided” error can lead to significantly underestimated distance values ).

(9)

The Average Hausdorff Distance (AHD) represents the mean of the directed Hausdorff distance for all the points of the first set.

d_a(S_c, S_r) = 1

|Sc|

∑

pⁱ∈Sc

d(pⁱ, S_r) (19)

AHD(S_c, S_r) = max{d_a(S_c, S_r), d_a(S_r, S_c)} (20)

• u.m.:Millimeters

• Range:This measure is 0 for a perfect segmentation and greater the value of the distance, the worse the segmentation is.

• Utility: The Hausdorff distance is in fact the maximum symmetric surface distance. The HD is generally sensitive to outliers. Because noise and outliers are common in medical segmentations, it is not recommended to use the HD directly [Zhang and Lu, 2004]. However, the quantile method proposed by Huttenlocher et al. [Huttenlocher et al., 1993] is one way to handle outliers. According to the Hausdorff quantile method, the HD is defined to be theq^th quantile of distances instead of the maximum, so that possible outliers are excluded, where q is selected depending on the application and the nature of the measured point sets. The AHD is known to be stable and less sensitive to outliers than the HD.

It is sensitive to outliers and returns the true maximum error. This is required for applications as surgical planning, where the worst case error is more important than average errors.

• Other names:Also known as maximum symmetric surface distance.

• Code: Visceral

4.3.3 Mahalanobis distance

• Definition: The Mahalanobis Distance (MHD) [Mahalanobis, 1936] regards the correlation among all the points (pixels/voxels) from the set that the considered two points (pixels/voxels) are belonging.

MHD(p¹, p²) =

√

(p¹−p²)^TCov⁻¹(p¹−p²) (21) whereCovdenotes the covariance matrix associated to all the points (pixels/voxels) from the set.

If we want to compute the Mahalanobis distance between two sets, then the average over all elements of them must be considered.

MHD(Sc, Sr) =

√

(µSc−µSr)^TCov⁻¹(µSc−µSr) (22) whereCovrepresents the covariance matrix associated to two segmentations.

• u.m.: Millimeters

• Range:This measure is 0 for a perfect segmentation and greater the value of the distance, the worse the segmentation is.

• Utility: Note that if Cov is an identity matrix, the Mahalanobis distance reduces to the Euclidean distance. This measure is able of automatically assigning weights based on statistical variation in the data.

It gives each dimension “equal” weight and also accounts for correlation between different dimensions.

• Code: Visceral

(10)

4.3.4 JCd

• Definition:In [Cárdenes et al., 2009] a new similarity measure is presented. It is based on the distance [Pichon et al., 2004] from the misclassified points (pixels/voxels):

d(r, Sc, Sr) =







0 ifr∈Sc∩Sr

min_pi∈Scr−pⁱ ifr∈S_r−S_c min_p^j_∈_S_rr−p^j ifr∈Sc−Sr

(23)

JCd(Sc, Sr) = |Sc∩Sr|

|S_c∩S_r|+∑

id²(pⁱ, S_c, S_r) +∑

jd²(p^j, S_r, S_c) (24) wherepⁱare the misclassified points (pixels/voxels) ofScthat should be classified asSr,p^jare the points (pixels/voxels) ofS_rthat should be classified asS_c.

• u.m.: Percent

• Range: [0,1]

• Utility: The new measure is aimed to penalize more those points (pixels/voxels) that are more distant from their class in the gold standard. This penalisation is performed by some weights associated to misclassified points (pixels/voxels). In [Cárdenes et al., 2009] for a point (pixel/voxel) p_i, the square of Euclidian distance to the nearest point (pixel/voxel) of the same class to pi is used to weight each misclassified point (pixel/voxel).

• Code: none

4.3.5 Root mean square symmetric surface distance

• Definition: The root mean square symmetric surface distance (RMSD) [Bauer et al., 2009] is defined as:

RMSD(S_c, S_r) =

√

1

|B_S_c|+|B_S_r|×√ ∑

p¹∈B_Sc

d²(p¹, B_S_r) + ∑

p²∈B_Sr

d²(p², B_S_c) (25)

whered(p, S)represents the distance defined by Eq.14.

• u.m.:Millimeters

• Range:This measures is similar to the mean symmetric surface distance, but stores the squared distances between the two sets of border points (pixels/voxels).

• Utility: The RMSD distance is highly correlated with the average distance, but has the advantage that large deviations from the true contour are punished stronger [Bauer et al., 2009].

• Code: none

4.4 Measures based on connectivity

• Definition:In [Cárdenes et al., 2009] a new measure is presented that takes into account the connectivity of the two given segmentations. Firstly, two sets of points (pixels/voxels) are connected if all the points (pixels/voxels) from a set has one or more neighbours (a given neighbourhood must be defined) in the

(11)

other set. Secondly, the number of connected components is computed for each set (ScandSr) and for each classc(class/classes of interest or background):N_S^c_c andN_S^c_r, respectively. Then, the connectivity coefficient for a classcis determined as:

CC_c(S_c, S_r) = min{ N_S^c

c, N_S^c

r

} N_S^c

c+N_S^c

r

(26)

• u.m.:Percent

• Range:The range ofCCcis[0,1].

• Utility:???

• Code: none

4.5 Measures based on boundaries

• Definition:In [Cárdenes et al., 2009] a similarity measure, called boundary Jaccard Coefficient (BJC)) is introduced that is based on the boundaries of the segmentations. If we note byB_S^c the boundary of a segmentationS(S_corS_r) for a classc(class/classes of interest or the background), BCJ is computed as:

BJCc(Sc, Sr) = B_S^c_c∩B_S^c_r

B_S^c_c∪B_S^c_r (27)

• u.m.:

• Range:

• Utility:

• Code: none

4.6 Measures based on volumes

These measures are useful when the segmentation purpose is to identify for changes in size. They are sensitive to mis-estimations of the segmented volume more than anything else. In the case of a single measurement, the volume error (two times the volume difference over the volume sum) can be used, while in the case of multiple measurements (when an average result is of interest) the absolute volume error can be used.

4.6.1 Volumetric similarity

• Definition:The similarity of two segmentations can be computed by taking into account their volumes.

In order to determine such a measure, a distance between two volumes must be defined (the similarity being 1 - this distance). A possible definition [Taha and Hanbury, 2015] for the volumetric distance is the ratio of the absolute volume difference and the sum of the compared volumes.

VS(Sc, Sr) = 1−||S_c^c| − |S_r^c||

|S_c^c|+|S_r^c| = 1− |F N−F P|

2T P +F P +F N (28)

• u.m.:Percent

• Range: [0,1]

(12)

• Utility:This similarity is based on the absolute volume, the overlap between segments is absolutely not considered (the volumetric similarity can have its maximum value even when the overlap is zero).

• Code: Visceral

4.6.2 Relative volume difference

• Definition: Another distance between the segmented volumes is their difference in size as a fraction of the size of the reference:

RVD(Sc, Sr) =±|Sc| − |Sr|

|Sr| (29)

• u.m.:Percent

• Range:[−1,1]???

• Utility: This distance is not symmetric (it is not a metric; its values are signed numbers). If this distance is 0, this does not imply that the two segmentations are identical or they actually overlap with each other. Taking into account this remark, it is recommended [Bauer et al., 2009] to use this measure in combination with other measures. When more quality measure are used in order to evaluate two segmentation, the absolute value of this measure must be used (e.g. in a ranking/scoring system).

• Code: none

4.7 Measures based pair counting

In order to describe some pair-counting based metrics, the four basic classical cardinalities must be defined for the pair-counting framework.

We suppose that a set of pointsXis composed by two partitions andP is the set of (_n

2 )

tuples formed by all pairs of objects (inX×X). Depending on the place in the partitions of each object of a tuple(p¹, p²)∈P, four groups of tuples are possible, the size of each category beinga^′,b^′,c^′ andd^′, respectively.

• ifp¹andp² are place in the same subset in both partitionsS_candS_r;

• ifp¹andp² are place in the same subset inSr, but in different subsets inSc;

• ifp¹andp² are place in the same subset inSc, but in different subsets inSr;

• ifp¹andp² are placed in different subsets in both partitionsS_candS_r.

The first and the last groups indicate an agreement (a^′+d^′), while the second and the third group indicate the disagreement (b^′+c^′) between the two partitions. More details about how these pair-counting cardinalities can be computed based on the classical cardinalities can be found in [Taha and Hanbury, 2015].

4.7.1 Rand Index (RI)

• Definition: A similarity measure able to evaluate both clusterings and classifications (because it is not based on labels) is the Rand Index (RI), proposed in [Rand, 1971].

RI(Sc, Sr) = a^′+b^′

a^′+b^′+c^′+d^′ (30)

Another possibility to compute the Rand Index is to consider all the points (pixels/voxels) of the image (pi, i= 1, . . .∥I∥), the labels associated to these points by the two segmentations:l_cⁱandlⁱ_r, respectively,

(13)

and to compute the ratio of the number of pairs of points having the same label relationship inScandSr: R(S_c, S_r) = 1

(_∥_I_∥ 2

) ∑

i,j,i̸=j

[I_f(

l_cⁱ =l_c^j∧lⁱ_r =l^j_r) +I_f(

lⁱ_c̸=l^j_c∧lⁱ_r̸=l_r^j)]

(31)

whereI_f is the identity function and the denominator is the number of possible unique pairs among∥I∥ data points [Unnikrishnan et al., 2007]. In [Unnikrishnan et al., 2007] it is noted that the number of unique labels inScandSris not restricted to be equal.

• u.m.:Percent

• Range:[0,1]

• Utility: The main idea of Rand Index is to count the pairs of pixels that have compatible label relation- ships in the two segmentations to be compared [Unnikrishnan et al., 2007].

• Code: Visceral

4.7.2 Adjusted Rand Index (ARI)

• Definition:The Adjusted Rand Index (ARI), proposed by Hubert and Arabie [Hubert and Arabie, 1985]

, is a modification of the Rand Index that considers a correction for chance . It is given by:

ARI(S_c, S_r) = 2(a^′d^′−b^′c^′)

b^′²+c^′²+ 2a^′d^′+ (a^′+d^′)(b^′+c^′) (32)

• u.m.:Percent

• Range:[0,1]

• Utility: The adjusted index has the property of having expected value equal to 0 and maximum value of 1. Since the unadjusted Rand index was in the range [0, 1], the adjusted index can take on a wider range of values, increasing the sensitivity of the measure [Unnikrishnan et al., 2005,Unnikrishnan and Hebert, 2005].

• Code: Visceral

4.7.3 Structural Similarity index (SSIM)

For grey level images a well known image similarity measure is the Structural Similarity index (SSIM) [Wang et al., 2004]

which takes luminance, contrast and structure of two images A and B into account:

SSIM(A, B) = (2µAµB+C1)(2σAB+C2)

(µ²_A+µ²_B+C₁)(σ_A² +σ_B² +C₂) (33) where the constants are set toC₁ = (0.01×255)²andC₂ = (0.03×255)². The SSIM index is then applied locally using an11×11circular-symmetric Gaussian weighting function, and the mean over the image is used as the final similarity measure.

5 Comparison of statistical validation metrics

Popovic has introduced three criteria in order to compare two segmentation validation measures: bias, consistency and discriminancy. In order to compare two segmentation results R₁ andR₂ for a validation metric f(Sc, Sr), the following relation is introduced: if f is rating R1 over R2, then we say that: f(R1, Sr) ≻

(14)

f(R2, Sr). In [Popovic et al., 2007] Λ is defined as the domain ofSc and Sr, andN as a total number of possible instances ofScinΛ.

Cf. [Popovic et al., 2007], we have:

• Bias: The segmentation validation metricsf(S_c, S_r)is biased if a pair{R₁, S_r}and{R₂, S_r}, such that f(R1, Sr) =f(R2, Sr)andR1 ≻R2exists.

• Consistency: Two segmentation validation metricsf(R, S_r)andg(R, S_r)are consistent if a pair{R₁, S_r} and{R₂, S_r}, such thatf(R₁, S_r)≻f(R₂, S_r)andg(R₁, S_r)≺g(R₂, S_r)does not exist.

• Discriminancy: Segmentation validation metricf(Sc, Sr)is more discriminating thang(Sc, Sr)if a pair {S₁, S_r} and{S₂, S_r}, such thatf(S₁, S_r) ̸= f(S₂, S_r)and g(S₁, S_r) = g(S₂, S_r) exists and a pair {S1, Sr}and{S2, Sr}such thatf(S1, Sr) =f(S2, Sr)andg(S1, Sr)̸=g(S2, Sr)does not exist.

Because validation metrics rarely have total consistency, probabilistic measures have been defined in [Popovic et al., 2007]:

• Degree of bias: If a segmentation validation metricsf(S_c, S_r)hasN_bpairs{S_i, S_r}and{S_j, S_r}, such thatf(S_i, S_r) =f(S_j, S_r)andS_i ̸=S_j, degree of bias is defined as:

B = N_b

N (34)

• Degree of consistency: If for two segmentation validation metrics f(S_c, S_r) andg(S_c, S_r)there exist Ncpairs{Si, Sr}and{Sj, Sr}, such thatf(Si, Sr) ≻ f(Sj, Sr)andg(Si, Sr) ≺ g(Sj, Sr), degree of consistency is defined as

C= N −Nc

N (35)

• Degree of discriminancy: If for two validation metrics f(Sc, Sr) andg(Sc, Sr) there existN_{f g} pairs {S_i, S_r} and {S_j, S_r}, such that f(S_i, S_r) ̸= f(S_j, S_r) and g(S_i, S_r) = g(S_j, S_r), and N_gf pairs {Si, Sr}and{Sj, Sr}, such thatg(Si, Sr) ̸=g(Sj, Sr)andf(Si, Sr) =f(Sj, Sr), degree of discriminancy is defined as

D= N_{f g} Ngf

(36) In [Popovic et al., 2007] it is also noted that the value of degree of consistency is in the range[0,1], the degree of discriminancy is in the range[0,∞], the degrees of consistency are symmetrical while the degrees of discriminancy are reciprocal (D_{f g} =D_gf⁻¹).

6 Unsupervised evaluation

Notation:

• I - initial image

• Sc- computed segmented image

• F_I(p)- feature of pixelpin imageI (gray-level, color value (RBG), others)

• |I|- size ofI (measured in number of pixels)

The performance of algorithms in solving the image segmentation task have been evaluated taking into account four important criteria (originally proposed in [Haralick and Shapiro, 1985]:

(15)

• the regions must be uniform and homogeneous (respect to some image characteristic) – this criterion has been evaluated by computing various measures of intra-region uniformity.

• the adjacent regions must be different (respecting the same characteristic considering by the previous criterion) – this criterion has been evaluated by computing various measures of inter-region uniformity.

• the interior of each region must be simple and without holes – this criterion has been evaluated by computing various semantic measures.

• the boundaries of each region must be simple, not ragged (in fact, they must be spatially accurate) – this criterion has been evaluated by computing various semantic measures, also.

Because the evaluation of the last two criteria requires some semantic information about the images to be segmented, we will discuss only the first two criteria in what follows.

6.1 Intra-region uniformity

6.1.1 Measures based on colour error

Intra-region visual error Eintra One of the most intuitive criterion being able to quantify the quality of a segmentation result is the intra-region uniformity. Levine and Nazif defined a criterion that calculates the uniformity of a region characteristic based on the variance of this characteristic [Levine and Nazif, 1985].

Eintrarepresents the proportion of misclassified pixels in an image.

E_intra = 1− 1

|I|

∑L l=1

∑

s∈R_l

[

F_I(s)− ∑

t∈R_l

F_I(t) ]2

[ maxs∈Rl

F_I(s)−min

s∈Rl

F_I(s)

]2 (37)

Characteristics:

• Range: not specified

• Optimisation direction: maximisation

• Other names: intra-region uniformity (max); intra-region disparity (min)

• Advantages: not specified

• Disadvantages: not specified

Internal and External Contrasts The contrast can be defined at many levels: between two points/pixels/voxels, between two regions or at the level of the entire image.

The contrast between two pointsc(p1, p2)of an imageI is defined as:

c(p₁, p₂) =|F_I(p₁)−F_I(p₂)| (38) At the level of a region, two contrasts are possible: internal contrast and external contrast. These contrasts take into account the internal and external contrast of the regions measured in the neighborhood of each pixel.

If we noteN_r(p)the neighborhood of the point/pixel/voxelp(a neighbourhood of a pointpis a setN_r(p) consists of all points q such thatdistance(p, q) ≤ r and the number r is called the radius of N_r(p)),F(p) the point/pixel/voxel feature (e.g. intensity) andF^{M AX} the maximum feature (e. g. maximum intensity), the

(16)

contrast insideInternalContrastand outsideExternalContrastof the regionRl(withl∈1,2, . . . , L) are respectively:

InternalContrast(R_l) = 1

|Rl|

∑

p∈R_l

max

n∈Nr(p)∩Rl

c(p, n) (39)

ExternalContrast(Rl) = 1

|BorderR_l|

∑

p∈Rl

n∈Nmaxr(p),n /∈R_lc(p, n) (40) where|BorderR_l|represent the length of the border of regionR_l.

Internal contrast measures the uniformity of each region. Ii is defined as the averageMaxContrastin that region, whereMaxContrastis the largest luminance difference between a pixel and its neighboring pixels in the same region.

The contrast of a regionR_lbelongs to[0,1]range and is defined as:

C(R_l) =







1− InternalContrast(Rl)

ExternalContrast(R_l), if0< InternalContrast(Rl)< ExternalContrast(Rl) ExternalContrast(R_l), ifInternalContrast(R_l) = 0

0, otherwise

(41)

The global contrast is defined as:

C(I) = 1

|I|

∑L l=1

|R_l|C(R_l) (42)

Characteristics ofC(I):

• Range:[0,1]

• Other names:

• Disadvantages: not correctly taking into account strongly textured regions.

Discrepancy Discrepancy (D) D measures the gray-level difference between the original image and the output image after segmentation. It was proposed to evaluate thresholding-based segmentation techniques that separate the foreground object from the background [Weszka and Rosenfeld, 1978].

D=∑

p∈I

(F_I(p)−F_S_c(p))² (43)

Characteristics:

• Range:[0,255²]

• Other names:

(17)

6.1.2 Measures based on squared colour error

Within-class varianceσ_w² Intra-region uniformity measure, the within-class variance,σ²_w, is the sum of the squared color error of the foreground object and the background, weighted by their respective sizes.

σ_w² =

∑L l=1

|R_l|

|I| err_F²(R_l) (44)

whereerr²_F(R_l)represents the squared error colour of regionR_land is defined as:

err_F²(Rl) = ∑

p∈Rl

(F(p)−F(Rl))² (45)

F(Rl) =

∑

p∈R_l

F(p)

|R_l| ,∀l= 1,2, . . . L (46) Characteristics:

• Optimisation direction: minimisation

• Other names:

Square error of colourD(I) D(I)is a measure of intra-region uniformity. In [Rosenberger and Chehdi, 2000]

D(I)is computed as the average squared color error of each region weighted by its size.

D(I) = 1 L

∑L l=1

|Rl|err_F²(Rl)

|S_c| (47)

Characteristics:

• Other names: intra-region uniformity

Normalised uniformityN U The normalized uniformity measureN U is a region uniformity measure and is the normalized sum of the squared color error of the foreground object and the background.

N U = 1−err²_F(Robj) +err_F²(Rbackground)

Z (48)

N U was used in the context of an evaluation approach based on a set of segmentation measures that consti- tute a performance vector (P V). TheP V vector stores the factors characterizing the segmentation, including region uniformity, region contrast, line contrast, line connectivity, and texture [Levine and Nazif, 1985]. N U

(18)

improved uponP V by enhancing the region uniformity measure inP V to use a normalized region uniformity measure [Sahoo et al., 1988].

Characteristics:

• Optimisation direction: not specified

• Other names:

• Disadvantages: Defined for two regions only.

Gray-level uniformityU Levine and Nazif [Levine and Nazif, 1985] have included some weighting coefficients in the formula of the uniformity for images with more then two regions. Gray-level uniformity measure, U, describes intra-region uniformity. For a gray-scale image,Uis a measure of the weighted sum of the squared gray-level error of each region.

U = 1−

∑L l=1

err²_F(R_l)ω_l

Z (49)

where:

• ω_lis a weighting factor (it could be defined asω_l= ^|^R_|_I^l_|^|) and

• Z is a normalisation factor. Z = (Fmax−Fmin)², where theFmax andFmindenote the maximum and minimum feature of points/pixels/voxels in the test image, respectively.

Characteristics:

• Other names:

Average squared colour error measures F, F^′ and Q Methods F, F^′ andQ are based on the average squared color error of each region. They involve different penalties in order to deal to the over-segmentation and under-segmentation problems.

Liu and Yang [Liu and Yang, 1994] have proposed an evaluation function, inspired by the qualitative criteria for good image segmentation established by Haralick and Shapiro [Haralick and Shapiro, 1985] ² that requires no user-defined parameters. In [Borsotti et al., 1998] the authors have identified some limitations in this evaluation function, and propose two new functions (F^′ andQ).

2A good segmentation must satisfy two properties:

• A region has to contain only one primitive (a texture or a constant gray level) to guarantee that there is no under-segmentation.

Thus, a region is characterized by the stability of the statistics inside it.

• Two adjacent regions have to contain two different primitives to guarantee that there is no over-segmentation. This corresponds to a disparity of statistics between these two regions.

(19)

F measures the average squared color error of the segments, penalizing over-segmentation by weighting proportional to the square root of the number of regions. It is user-independent (no parameter should be fixed by the user) and is image-independent (any contents and type of image can be evaluated) [Liu and Yang, 1994].

F uses the sum of the squared color error of each region, averaged by the square root of region size.

F(I) =√ L

∑L l=1

err²_F(R_l)

√|R_l| (50) F^′ was proposed to improve F, because F was found to have a bias towards over-segmentation, which is the characteristic of producing many more regions than desired within a single real-world object. SinceF favors segmentations with a large number of small regions, F^′ extendedF by penalizing segmentations that have many small regions of the same size [Borsotti et al., 1998].

F^′(I) = 1 10³|S_c|

vu

ut^{M axArea}∑

a=1

N R¹⁺^a¹(a)

∑L l=1

err²_F(R_l)

√|Rl| (51) where:N R(a)= number of regions in the segmented imageScwith area=a;

Similar to F, F^′ uses the sum of the squared color error of each region, averaged by the square root of region size.

Qis an improved version ofF^′ since it deals to both over-segmentation and under-segmentation problems [Borsotti et al., 1998].

Q(I) =

√L 10³|Sc|

∑L l=1

[

err²_F(R_l) 1 + log(|Rl|) +

(N R(|R_l|)

|Rl|

)2]

(52) Quses the sum of the squared color error of each region averaged by the logarithm of the region size. In order to deal to the very small regions, a constant (equal to 1) is added to this logarithm.

Characteristics:

• Other names: -

• Disadvantages: not specified 6.1.3 Measures based on texture

Busyness Evaluation method Busy measures the ”busyness” of an image, assuming that a ”smoother” image is preferred. The ”busyness” of an image is computed as either the sum of the absolute values of 4- (or 8-) neighbor Laplacians, or is based on the gray-level co-occurrence matrix of the image. Both metrics actually measure the texture or edges across the whole image, so Busy effectively only measures global texture uniformity, not individual region uniformity. Busy is based on the measure of ”busyness” in the image, with the assumption that the ideal objects and background are not strongly textured and have simple compact shapes [Weszka and Rosenfeld, 1978].

Characteristics: