{"id":1896,"date":"2025-10-26T17:43:16","date_gmt":"2025-10-26T16:43:16","guid":{"rendered":"https:\/\/xeddixx.cluster029.hosting.ovh.net\/?page_id=1896"},"modified":"2025-10-27T09:43:44","modified_gmt":"2025-10-27T08:43:44","slug":"clustering-problems","status":"publish","type":"page","link":"https:\/\/www.francq.info\/index.php\/clustering-problems\/","title":{"rendered":"Clustering Problems"},"content":{"rendered":"\n<h4 class=\"wp-block-heading has-text-align-center\"><strong>Abstract<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">The clustering problems are a class of optimization problems where the goal is to group a set of objects in different groups, each object being assigned in one group only.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"contents\">Table of Contents<\/h5>\n\n\n\n<p class=\"refsec1 wp-block-paragraph\"><a href=\"#sec1\" rel=\"nofollow\">1. Description<\/a><\/p>\n\n\n\n<p class=\"refsec1 wp-block-paragraph\"><a href=\"#sec2\" rel=\"nofollow\">2. Constraints<\/a><\/p>\n\n\n\n<p class=\"refsec1 wp-block-paragraph\"><a href=\"#sec3\" rel=\"nofollow\">3. Criteria<\/a><\/p>\n\n\n\n<p class=\"refsec1 wp-block-paragraph\"><a href=\"#sec4\" rel=\"nofollow\">4. Algorithm Evaluation<\/a><\/p>\n\n\n\n<p class=\"refsec2 wp-block-paragraph\"><a href=\"#sec4_1\" rel=\"nofollow\">4.1. Adjusted&nbsp;Rand&nbsp;index<\/a><\/p>\n\n\n\n<p class=\"refsec2 wp-block-paragraph\"><a href=\"#sec4_2\" rel=\"nofollow\">4.2. Precision<\/a><\/p>\n\n\n\n<p class=\"refsec2 wp-block-paragraph\"><a href=\"#sec4_3\" rel=\"nofollow\">4.3. Recall<\/a><\/p>\n\n\n\n<p class=\"refsec1 wp-block-paragraph\"><a href=\"#sec5\" rel=\"nofollow\">5. Related<\/a><\/p>\n\n\n\n<p class=\"refsec1 wp-block-paragraph\"><a href=\"#references\" rel=\"nofollow\">References<\/a><\/p>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"sec1\">1. Description<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h5>\n\n\n\n<p class=\"wp-block-paragraph\">A clustering problem, sometimes called cluster analysis, is the task to assigning a set of objects into groups (called clusters) according some criteria, each object being assigned in one group only. In general, the criteria is to group similar objects in the same cluster (using some similarity measure), where each cluster can contain as many objects as necessary. But the goal may vary.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For example in the bin packing problem, each cluster can only group a maximal size (giving by the sum of the sizes of the object already group together), and the goal is to use as few cluster as possible.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The rigorous definition of the problem begins with a set&nbsp;\\(\\mathcal{O}\\) of&nbsp;\\(|\\mathcal{O}|\\) objects to place in a set <mathml>\\(G\\)<\/mathml> of \\(|\\mathcal{G}|\\) groups:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\[\\mathcal{O}=\\{o_{o}\\}\\quad\\quad\\forall\\,o=1,\\cdots,|\\mathcal{O}|\\]\n\n\n\n<p class=\"wp-block-paragraph\">\\[\\mathcal{G}=\\{g_{g}\\}\\quad\\quad\\forall\\,g=1,\\cdots,|\\mathcal{G}|\\]\n\n\n\n<p class=\"wp-block-paragraph\">Depending of the problem, the number of available groups can be infinite (in the bin packing problem for example) or limited (such as when some tasks must be assigned to a fixed number of workers).&nbsp;<\/p>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"sec2\">2. Constraints<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h5>\n\n\n\n<p class=\"wp-block-paragraph\">The clustering problems introduce several constraints, but it is always possible to define them as a member of one of the three constraints classes, namely the \u201ccapacity constraints\u201d class, the \u201chomogeneity constraints\u201d and the \u201cprofile constraints\u201d classes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The \u201ccapacity constraints\u201d&nbsp;\\(\\mathcal{C}_{c}\\) constitute the first class. It regroups the constraints related to a given \u201ccapacity\u201d of each group. In the bin packing problem for example, each group has a maximal size. These constraints may also introduce a minimal or maximal number of objects that a group must have. In the case the assignment of tasks to workers, one constraint could be a homogeneous repartition of the task between them.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The \u201chomogeneity constraints\u201d&nbsp;\\(\\mathcal{C}_{h}\\) are the second class. These constraints specify the conditions that must be respected by an object when it is put in an existing group. This may be, for example, a minimal similarity with the objects already assigned.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The \u201cprofile constraints\u201d&nbsp;\\(\\mathcal{C}_{p}\\)&nbsp;are the last class. To obtain a valid solution, the clustering must respect the \u201cwishes\u201d of each object. For example, if the goal is to assign students to work groups, a constraint could be to avoid that a given genre is represented only once in a group.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"sec3\">3. Criteria<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h5>\n\n\n\n<p class=\"wp-block-paragraph\">The criteria that should be optimized for the clustering problem vary from one problem to another. For some problems, the criteria are well defined (for example in the bin packing problem where the number of clusters should be minimized). But, in the case of the document clustering, where the goal is to regroup \u201csimilar\u201d documents in the same topic, the problem is&nbsp;<em>fuzzy:<\/em>&nbsp;is it not easy to express the fact that \u201ctwo documents are similar\u201d. When several criteria must be optimized, the clustering problem is a&nbsp;<a href=\"\/index.php\/optimisation-problems#11\" rel=\"nofollow\">multiple objective one<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Some clustering algorithms work with one type of problems (criteria) only. The k-Means algorithm is designed to minimize the within-cluster sum of squares&nbsp;[<a href=\"#1\" rel=\"nofollow\">1<\/a>]. Other methods, such as the&nbsp;<a href=\"\/index.php\/gga\" rel=\"nofollow\">Grouping Genetic Algorithms (GGA)<\/a>, may solve a whole range of problems.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"sec4\">4. Algorithm Evaluation<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h5>\n\n\n\n<p class=\"wp-block-paragraph\">To evaluate the performance of an algorithm, it is necessary to have an \u201cideal\u201d solution for a given problem instance and some measures to compare the \u201ccomputed\u201d solution provided by the algorithm with this \u201cideal\u201d solution. Two situations exist:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>The quality of a solution depends on a computation based on the clustering. For example, in the bin packing problem, the number of clusters needed to group the objects.<\/li>\n\n\n\n<li>The quality of a solution depends on the particular assignment of the objects.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">In the first case, a comparison between the computation done for the \u201ccomputed\u201d and \u201cideal\u201d solution evaluates the performance of the algorithm. For the second situation, we need some measures to compare the \u201ccomputed\u201d assignments to the ideal ones (each object is assigned to an \u201cideal cluster\u201d in the \u201cideal\u201d solution). At least, three measures can be used for this latest comparison:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>the adjusted&nbsp;Rand&nbsp;index,&nbsp;\\(g\\), used in conventional clustering problems;<\/li>\n\n\n\n<li>the precision,&nbsp;\\(P_G\\), adapted from the precision of the information retrieval problems&nbsp;[<a href=\"#2\" rel=\"nofollow\">2<\/a>];<\/li>\n\n\n\n<li>the recall,\u00a0\\(r_G\\), adapted from the recall of the information retrieval problems\u00a0[<a href=\"#2\" rel=\"nofollow\">2<\/a>].<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The adjusted&nbsp;Rand&nbsp;Index is used to measure the global quality of a solution. When a clustering is not correct, the measures of precision and recall are necessary to determine the reasons. In fact, there is a relation between recall and precision:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A high value of the recall and a low value of the precision indicate that the resulting clustering has too few clusters: most objects that belongs to the same ideal cluster are grouped together, but are also grouped with objects of other ideal clusters.<\/li>\n\n\n\n<li>A low value of the recall and a high value of the precision indicate that the resulting clustering has too many clusters: most objects that are grouped together belongs to the same ideal cluster, but all the objects of the same ideal cluster are not grouped together.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">A given object,&nbsp;\\(o_o\\), is assigned to a computed cluster,&nbsp;\\(c(o_{o})\\)&nbsp;by the algorithm, and corresponds to a ideal cluster,&nbsp;\\(\\tau(o_{o})\\). The number of objects in the cluster,&nbsp;\\(c_c\\), is denoted&nbsp;\\(|c_{c}|\\). \\(|\\tau(o_{o})|\\)&nbsp;represents the number of objects of \u201cideal\u201d cluster&nbsp;\\(\\tau(o_{o})\\). Moreover,&nbsp;\\(|\\tau(o_{o})\\cap c(o_{o})\\)&nbsp;is the number of objects that are in the same computed and ideal cluster as object&nbsp;\\(o_{o}\\).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To verify whether an object and a computed cluster belong to the same ideal cluster, it is only necessary to verify whether the two ideal cluster sets,&nbsp;\\(\\tau\\), are identical&nbsp;\\(\\tau(o_{o})=\\tau(c_{c})\\)&nbsp;because each one contains one element only.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The measures can be illustrated by means of the example in Table&nbsp;1.<\/p>\n\n\n\n<figure class=\"wp-block-table figtab\"><table class=\"has-fixed-layout\"><tbody><tr><td>Ideal Clustering<\/td><td>Computed Clustering<\/td><\/tr><tr><td>\\(\\{o_{1},o_{2},o_{3}\\}\\in t_{1}\\\\\\{o_{4},o_{5},o_{6}\\}\\in t_{2}\\\\\\{o_{7},o_{8},o_{9}\\}\\in t_{3}\\)<\/td><td>\\(c_{1}=\\{o_{1},o_{2}\\}\\\\c_{2}=\\{o_{3},o_{4},o_{5},o_{6}\\}\\\\c_{3}=\\{o_{7},o_{8}\\}\\\\c_{4}=\\{o_{9}\\}\\)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"has-text-align-center legende wp-block-paragraph\">Table 1. Example of an ideal and a computed clustering.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Notice that, while there are three ideal cluster, the computed solution provides four clusters in this example.<\/p>\n\n\n\n<h6 class=\"wp-block-heading\" id=\"sec4_1\">4.1. Adjusted&nbsp;Rand&nbsp;index<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h6>\n\n\n\n<p class=\"wp-block-paragraph\">The aim of the adjusted&nbsp;Rand&nbsp;index is to establish an overall comparison between the computed and the ideal clustering. It is based on the&nbsp;Rand&nbsp;index&nbsp;[<a href=\"#3\" rel=\"nofollow\">3<\/a>], where a comparison is made between the assignments of each pair of objects in the ideal and the computed clustering. Roughly speaking, the&nbsp;Rand&nbsp;index computes the percentage of pairs of objects for which both classification methods, the computed and the ideal one, agree.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&nbsp;\\(a\\)&nbsp;be the number of pairs of objects in the same topic and the same computed cluster, let&nbsp;\\(b\\)&nbsp;be the number of pairs of objects in the same ideal cluster but not in the same computed cluster, let&nbsp;\\(c\\)&nbsp;be the number of pairs of objects in the same computed cluster but not in the same ideal cluster, and let&nbsp;\\(d\\)&nbsp;be the number of pairs of objects in different computed clusters and in different ideal clusters. For the example in Table&nbsp;1:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\begin{array}{ccl}<br>a=5&amp;&amp;\\{(o_{1},o_{2}),(o_{4},o_{5}),(o_{4},o_{6}),(o_{6},o_{5}),(o_{7},o_{8})\\}\\\\b=4&amp;&amp;\\{(o_{1},o_{3}),(o_{2},o_{3}),(o_{7},o_{9}),(o_{8},o_{9})\\}\\\\c=3&amp;&amp;\\{(o_{3},o_{4}),(o_{3},o_{5}),(o_{3},o_{6})\\}\\\\d=24&amp;&amp;\\{(o_{1},o_{4}),(o_{1},o_{5}),(o_{1},o_{6}),(o_{1},o_{7}),(o_{1},o_{8}),(o_{1},o_{9}),(o_{2},o_{4}),(o_{2},o_{5}),(o_{2},o_{6}),(o_{2},o_{7}),(o_{2},o_{8}),(o_{2},o_{9}),\\\\&amp;&amp;(o_{3},o_{7}),(o_{3},o_{8}),(o_{3},o_{9}),(o_{4},o_{7}),(o_{4},o_{8}),(o_{4},o_{9}),(o_{5},o_{7}),(o_{5},o_{8}),(o_{5},o_{9}),(o_{6},o_{7}),(o_{6},o_{8}),(o_{6},o_{9})\\}<br>\\end{array}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The\u00a0Rand\u00a0index\u00a0[<a href=\"#3\" rel=\"nofollow\">3<\/a>]\u00a0is defined by:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\begin{array}{ccl}<br>g_{\\mathrm{Rand}}&amp;=&amp;\\frac{a+d}{a+b+c+d}\\\\&amp;=&amp;\\frac{5+24}{5+4+3+24}=0.81<br>\\end{array}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A problem with the&nbsp;Rand&nbsp;index is that two randomly computed clustering have not a constant index, for example zero.&nbsp;Hubert&nbsp;and&nbsp;Arabie&nbsp;therefore introduce the adjusted&nbsp;Rand&nbsp;index&nbsp;[<a href=\"#4\" rel=\"nofollow\">4<\/a>], which is based on the assumption that the process is the generalized hypergeometric distribution, i.e. the ideal and computed clustering are selected at random so that the number of objects in both clustering is fixed. Table&nbsp;2&nbsp;introduces some notations for the computation where&nbsp;\\(n_{ij}\\)&nbsp;represents the number of objects of an ideal cluster,&nbsp;\\(t_i\\), to be grouped into a computed cluster,&nbsp;\\(c_j\\).<\/p>\n\n\n\n<figure class=\"wp-block-table figtab\"><table class=\"has-fixed-layout\"><tbody><tr><td>Ideal \\ Computed<\/td><td>\\(c_1\\)<\/td><td>\\(c_2\\)<\/td><td>\u22ef<\/td><td>\\(c_{|\\mathcal{C}|}\\)<\/td><td>Sums<\/td><\/tr><tr><td>\\(t_1\\)<\/td><td>\\(n_{11}\\)<\/td><td>\\(n_{12}\\)<\/td><td>\u22ef<\/td><td>\\(\\)<\/td><td>\\(n_{1.}\\)<\/td><\/tr><tr><td>\\(t_2\\)<\/td><td>\\(n_{21}\\)<\/td><td>\\(n_{22}\\)<\/td><td>\u22ef<\/td><td>\\(\\)<\/td><td>\\(n_{2.}\\)<\/td><\/tr><tr><td>\u22ee<\/td><td>\u22ee<\/td><td>\u22ee<\/td><td><\/td><td>\u22ee<\/td><td>\u22ee<\/td><\/tr><tr><td>\\(t_{|\\mathcal{T}|}\\)<\/td><td>\\(n_{|\\mathcal{T}|1}\\)<\/td><td>\\(n_{|\\mathcal{T}|2}\\)<\/td><td>\u22ef<\/td><td>\\(n_{|\\mathcal{T}||\\mathcal{C}|}\\)<\/td><td>\\(n_{|\\mathcal{T}|.}\\)<\/td><\/tr><tr><td>Sums<\/td><td>\\(n_{.1}\\)<\/td><td>\\(n_{.2}\\)<\/td><td>\u22ef<\/td><td>\\(n_{.|\\mathcal{C}|}\\)<\/td><td>\\(n\\)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"has-text-align-center legende wp-block-paragraph\">Table 2. Notation for comparing two clustering.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The general form of the adjusted&nbsp;Rand&nbsp;index is:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\[\\frac{\\mathrm{index}-\\mathrm{expected}\\,\\mathrm{index}}{\\max\\,\\mathrm{index}-\\mathrm{expected}\\,\\mathrm{index}}\\]\n\n\n\n<p class=\"wp-block-paragraph\">It has been show that the adjusted&nbsp;Rand&nbsp;index can be written&nbsp;[<a href=\"#4\" rel=\"nofollow\">4<\/a>]:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\[g=\\frac{{\\displaystyle \\sum_{i,j}}\\left(\\begin{array}{c} n_{ij}\\\\ 2 \\end{array}\\right)-\\frac{{\\displaystyle \\sum_{i}}\\left(\\begin{array}{c} n_{i.}\\\\ 2 \\end{array}\\right){\\displaystyle \\sum_{j}}\\left(\\begin{array}{c} n_{.j}\\\\ 2 \\end{array}\\right)}{\\left(\\begin{array}{c} n\\\\ 2 \\end{array}\\right)}}{\\frac{{\\displaystyle \\sum_{i}}\\left(\\begin{array}{c} n_{i.}\\\\ 2 \\end{array}\\right)+{\\displaystyle \\sum_{j}}\\left(\\begin{array}{c} n_{.j}\\\\ 2 \\end{array}\\right)}{2}-\\frac{{\\displaystyle \\sum_{i}}\\left(\\begin{array}{c} n_{i.}\\\\ 2 \\end{array}\\right){\\displaystyle \\sum_{j}}\\left(\\begin{array}{c} n_{.j}\\\\ 2 \\end{array}\\right)}{\\left(\\begin{array}{c} n\\\\ 2 \\end{array}\\right)}}\\]\n\n\n\n<p class=\"wp-block-paragraph\">Let us illustrate the adjusted&nbsp;Rand&nbsp;index on the example shown in Table&nbsp;1. Table&nbsp;3&nbsp;shows the different values of the elements defined in Table&nbsp;2.<\/p>\n\n\n\n<figure class=\"wp-block-table figtab\"><table class=\"has-fixed-layout\"><tbody><tr><td>Ideal \\ Computed<\/td><td>\\(c_1\\)<\/td><td>\\(c_2\\)<\/td><td>\\(c_3\\)<\/td><td>\\(c_4\\)<\/td><td>Sums<\/td><\/tr><tr><td>\\(t_{1}\\)<\/td><td>2<\/td><td>1<\/td><td>0<\/td><td>0<\/td><td>3<\/td><\/tr><tr><td>\\(t_{2}\\)<\/td><td>0<\/td><td>3<\/td><td>0<\/td><td>0<\/td><td>3<\/td><\/tr><tr><td>\\(t_{3}\\)<\/td><td>0<\/td><td>0<\/td><td>2<\/td><td>1<\/td><td>3<\/td><\/tr><tr><td>Sums<\/td><td>2<\/td><td>4<\/td><td>2<\/td><td>1<\/td><td>9<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"has-text-align-center legende wp-block-paragraph\">Table 3. Adjusted&nbsp;Rand&nbsp;index computed on the example.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let us compute the adjusted&nbsp;Rand&nbsp;index:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\[g=\\frac{5-\\frac{9\\cdot8}{36}}{\\frac{9+8}{2}-\\frac{9\\cdot8}{36}}=0.46\\]\n\n\n\n<p class=\"wp-block-paragraph\">We can see that the&nbsp;Rand&nbsp;index is greater than the adjusted&nbsp;Rand&nbsp;index. In fact, the range of values is greater for the adjusted&nbsp;Rand&nbsp;index (\\([-1,+1]\\)) than for the&nbsp;Rand&nbsp;index (\\([0,+1]\\)), which makes it a better measure. Many different clustering measures were studied in&nbsp;[<a href=\"#5\" rel=\"nofollow\">5<\/a>], and the recommendation is to use adjusted&nbsp;Rand&nbsp;index.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let us consider now the examples shown in Table&nbsp;4. The two computed clustering are different but have the same quality: all the objects are correctly grouped except object&nbsp;o6&nbsp;which is put either into&nbsp;c1&nbsp;(clustering n\u00b01) or in&nbsp;c2&nbsp;(clustering n\u00b02) rather than to remain alone.<\/p>\n\n\n\n<figure class=\"wp-block-table figtab\"><table class=\"has-fixed-layout\"><tbody><tr><td>Ideal Clustering<\/td><td>Computed Clustering n\u00b01<\/td><td>Computed Clustering n\u00b02<\/td><\/tr><tr><td>\\(\\begin{array}{c} \\{o_{1},o_{2},o_{3}\\}\\in t_{1}\\\\ \\{o_{4},o_{5}\\}\\in t_{2}\\\\ \\{o_{6}\\}\\in t_{3} \\end{array}\\)<\/td><td>\\(\\begin{array}{l} c_{1}=\\{o_{1},o_{2},o_{3},o_{6}\\}\\\\ c_{2}=\\{o_{4},o_{5}\\} \\end{array}\\)<\/td><td>\\(\\begin{array}{l} c_{1}=\\{o_{1},o_{2},o_{3}\\}\\\\ c_{2}=\\{o_{4},o_{5},o_{6}\\} \\end{array}\\)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"has-text-align-center legende wp-block-paragraph\">Table 4. Another example.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The adjusted&nbsp;Rand&nbsp;Indexes,&nbsp;\\(g_1\\)&nbsp;and&nbsp;\\(g_2\\), for both computed clustering can be computed:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\begin{array}{ccl}g_{1}&amp;=&amp;0.587\\\\g_{2}&amp;=&amp;0.706\\end{array}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It can be seen that the two values are not equal also if the real quality of both computed clustering is identical. This is called the&nbsp;<em>disturbed measurement problem<\/em>. This means that a little variation of the adjusted&nbsp;Rand&nbsp;Index between two clustering does not automatically implies that one of the solution is better than the other one.<\/p>\n\n\n\n<h6 class=\"wp-block-heading\" id=\"sec4_2\">4.2. Precision<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h6>\n\n\n\n<p class=\"wp-block-paragraph\">This measure is derived from the concept of precision in the information retrieval systems&nbsp;[<a href=\"#2\" rel=\"nofollow\">2<\/a>]. The precision of the clustering is the average of the precisions of each object. The precision,&nbsp;\\(P_{G}(o_{o})\\), of a given object,&nbsp;\\(o_o\\), is the fraction of the other objects in its computed cluster that are issued from the same ideal cluster:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\[P_{G}(o_{o})=\\left\\{ \\begin{array}{ll} \\frac{|\\mathcal{\\mathcal{S}}_{\\tau(o_{o})}\\cap c(o_{o})|-1}{|c(o_{o})|-1} &amp; |c(o_{o})|&gt;1\\\\ 1 &amp; |c(o_{o})|=1 \\end{array}\\right.\\]\n\n\n\n<p class=\"wp-block-paragraph\">where&nbsp;\\(\\mathcal{\\mathcal{S}}_{\\tau(o_{o})}\\) represents the set of objects attached to the same ideal cluster than object&nbsp;\\(o_o\\).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let us consider the example in Table&nbsp;1. The precision&nbsp;\\(P_{G}(o_{1})=1\\)&nbsp;because the object,&nbsp;\\(o_2\\), in the computed cluster of&nbsp;\\(o_1\\)&nbsp;is also in the same ideal cluster. The precision&nbsp;\\(P_{G}(o_{3})=0\\)&nbsp;because none of the objects of computed cluster of&nbsp;\\(o_3\\)&nbsp;are in the same ideal cluster. The precision&nbsp;\\(P_{G}(o_{4})=\\frac{2}{3}\\)&nbsp;because only two objects (\\(o_5\\)&nbsp;and&nbsp;\\(o_6\\)) of computed cluster of&nbsp;\\(o_4\\)&nbsp;are also in the same topic. After computing the precision of each object, the precision of the clustering,&nbsp;\\(P_{G}\\), is the average of the precisions of the objects:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"{\">\\begin{array}{ccl}P_{G}&amp;=&amp;\\frac{1}{9}\\sum_{o=1}^{9}P_{G}(o_{o})\\\\&amp;=&amp;\\frac{1}{9}\\cdot[1+1+0+\\frac{2}{3}+\\frac{2}{3}+\\frac{2}{3}+1+1+1]\\\\&amp;=&amp;0.78\\end{array}<\/p>\n\n\n\n<h6 class=\"wp-block-heading\" id=\"sec4_3\">4.3. Recall<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h6>\n\n\n\n<p class=\"wp-block-paragraph\">This measure is derived from the concept of recall in the information retrieval systems&nbsp;[<a href=\"#2\" rel=\"nofollow\">2<\/a>]. The clustering recall is the average of the recalls of each object. The recall,&nbsp;\\(r_{G}(o_{o})\\), of a given object,&nbsp;\\(o_o\\), is the fraction of the other objects of the same ideal cluster that were grouped into the same computed cluster:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\[r_{G}(s_{k})=\\left\\{ \\begin{array}{ll} \\frac{|\\mathcal{\\mathcal{S}}_{\\tau(o_{o})}\\cap c(o_{o})|-1}{|\\mathcal{\\mathcal{S}}_{\\tau(o_{o})}|-1} &amp; |\\tau(o_{o})|&gt;1\\\\ 1 &amp; |\\tau(o_{o})|=1 \\end{array}\\right.\\]\n\n\n\n<p class=\"wp-block-paragraph\">where \\(\\mathcal{\\mathcal{S}}_{\\tau(o_{o})}\\) represents the set of objects attached to the same ideal cluster than object&nbsp;\\(o_o\\).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let us again consider the example in Table&nbsp;1. Recall&nbsp;\\(r_{G}(o_{1})=\\frac{1}{2}\\)&nbsp;because only one object,&nbsp;\\(o_2\\), of the same ideal cluster as&nbsp;\\(o_1\\)&nbsp;is also in the same computed cluster. The recall&nbsp;\\(r_{G}(o_{3})=0\\) because none of the objects of the same ideal cluster as&nbsp;\\(o_3\\)&nbsp;(\\(o_1\\)&nbsp;and&nbsp;\\(o_2\\)) are in the same computed cluster. The recall&nbsp;\\(r_{G}(o_{4})=1\\)&nbsp;because all the objects of the same ideal cluster as&nbsp;\\(o_4\\)&nbsp;(\\(o_5\\)&nbsp;and&nbsp;\\(o_6\\)) are also in the same virtual community. After computing the recall of each object, the recall of the clustering,&nbsp;\\(r_G\\), is the average of the recalls of the objects:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\begin{array}{ccl}r_{G}&amp;=&amp;\\frac{1}{9}\\sum_{o=1}^{9}r_{G}(o_{o})\\\\&amp;=&amp;\\frac{1}{9}\\cdot[\\frac{1}{2}+\\frac{1}{2}+0+1+1+1+\\frac{1}{2}+\\frac{1}{2}+0]\\\\&amp;=&amp;0.56\\end{array}<\/p>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"sec5\">5. Related<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h5>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"\/index.php\/gga\" rel=\"nofollow\">Grouping Genetic Algorithms<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"\/index.php\/sgga\" rel=\"nofollow\">Similarity-based Grouping Genetic Algorithm<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"\/index.php\/nngga\" rel=\"nofollow\">Nearest Neighbors Grouping Genetic Algorithm<\/a><\/p>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"references\">References<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h5>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"1\">[1]&nbsp;James B. MacQueen, \u201cSome Methods for Classification and Analysis of Multivariate Observations\u201d, In&nbsp;<em>Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability<\/em>,&nbsp;<em>Volume 1: Statistics<\/em>, pp. 281\u2014297, 1967.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2\">[2]&nbsp;Ricardo Baeza-Yates&nbsp;&amp;&nbsp;Berthier Ribeiro-Neto,&nbsp;<em>Modern Information Retrieval: The Concepts and Technology behind Search<\/em>, Addison-Wesley, 2011.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"3\">[3]&nbsp;William M. Rand, \u201cObjective criteria for the evaluation of clustering methods\u201d,&nbsp;<em>Journal of the American Statistical Association<\/em>, 66(336), pp. 846\u2014850, 1971.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4\">[4]&nbsp;Lawrence Hubert&nbsp;&amp;&nbsp;Phipps Arabie, \u201cComparing partitions\u201d,&nbsp;<em>Journal of Classification<\/em>, 2(1), pp. 193\u2014218, 1985.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5\">[5]&nbsp;Glenn W. Milligan&nbsp;&amp;&nbsp;Martha C. Cooper, \u201cA study of the comparability of external criteria for hierarchical cluster analysis\u201d,&nbsp;<em>Multivariate Behavioral Research<\/em>, 21(4), pp. 441\u2014458, 1986.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Abstract The clustering problems are a class of optimization problems where the goal is to group a set of objects in different groups, each object being assigned in one group only. Table of Contents 1. Description 2. Constraints 3. Criteria 4. Algorithm Evaluation 4.1. Adjusted&nbsp;Rand&nbsp;index 4.2. Precision 4.3. Recall 5. Related References 1. Description\u2191 A [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-1896","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/www.francq.info\/index.php\/wp-json\/wp\/v2\/pages\/1896","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.francq.info\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.francq.info\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.francq.info\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.francq.info\/index.php\/wp-json\/wp\/v2\/comments?post=1896"}],"version-history":[{"count":55,"href":"https:\/\/www.francq.info\/index.php\/wp-json\/wp\/v2\/pages\/1896\/revisions"}],"predecessor-version":[{"id":2227,"href":"https:\/\/www.francq.info\/index.php\/wp-json\/wp\/v2\/pages\/1896\/revisions\/2227"}],"wp:attachment":[{"href":"https:\/\/www.francq.info\/index.php\/wp-json\/wp\/v2\/media?parent=1896"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}