{"id":2116,"date":"2025-10-26T17:44:21","date_gmt":"2025-10-26T16:44:21","guid":{"rendered":"https:\/\/xeddixx.cluster029.hosting.ovh.net\/?page_id=2116"},"modified":"2025-10-28T09:49:32","modified_gmt":"2025-10-28T08:49:32","slug":"similarity-measure","status":"publish","type":"page","link":"https:\/\/www.francq.info\/index.php\/similarity-measure\/","title":{"rendered":"Similarity Measure"},"content":{"rendered":"\n<h4 class=\"wp-block-heading has-text-align-center\">Abstract<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Several algorithms dedicated to information science related problems (for example document clustering) need some existing similarity measures. This article presents such a measure for the\u00a0<a href=\"\/index.php\/tensor-space-model\/\" data-type=\"page\" data-id=\"2182\" rel=\"nofollow\">tensor space model<\/a>\u00a0which takes the different concept categories into account.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"contents\">Table of Contents<\/h5>\n\n\n\n<p class=\"refsec1 wp-block-paragraph\"><a href=\"#sec1\" rel=\"nofollow\">1. Introduction<\/a><\/p>\n\n\n\n<p class=\"refsec1 wp-block-paragraph\"><a href=\"#sec2\" rel=\"nofollow\">2. Concept Categories: A Similarity Perspective<\/a><\/p>\n\n\n\n<p class=\"refsec2 wp-block-paragraph\"><a href=\"#sec2_1\" rel=\"nofollow\">2.1. Token Concepts<\/a><\/p>\n\n\n\n<p class=\"refsec2 wp-block-paragraph\"><a href=\"#sec2\" rel=\"nofollow\">2.2. Metadata Concepts<\/a><\/p>\n\n\n\n<p class=\"refsec2 wp-block-paragraph\"><a href=\"#sec2_3\" rel=\"nofollow\">2.3. Structure Concepts<\/a><\/p>\n\n\n\n<p class=\"refsec2 wp-block-paragraph\"><a href=\"#sec2_4\" rel=\"nofollow\">2.4. Link Concepts<\/a><\/p>\n\n\n\n<p class=\"refsec1 wp-block-paragraph\"><a href=\"#sec3\" rel=\"nofollow\">3. Some Vector Similarity Measures<\/a><\/p>\n\n\n\n<p class=\"refsec2 wp-block-paragraph\"><a href=\"#sec3_1\" rel=\"nofollow\">3.1. Classical Cosine Similarity<\/a><\/p>\n\n\n\n<p class=\"refsec2 wp-block-paragraph\"><a href=\"#sec3_2\" rel=\"nofollow\">3.2. Adapted Cosine Similarity<\/a><\/p>\n\n\n\n<p class=\"refsec2 wp-block-paragraph\"><a href=\"#sec3_2\" rel=\"nofollow\">3.3. Overlap Similarity<\/a><\/p>\n\n\n\n<p class=\"refsec1 wp-block-paragraph\"><a href=\"#sec4\" rel=\"nofollow\">4. Tensor Space Model Similarity<\/a><\/p>\n\n\n\n<p class=\"refsec2 wp-block-paragraph\"><a href=\"#sec4_1\" rel=\"nofollow\">4.1. Token Similarity<\/a><\/p>\n\n\n\n<p class=\"refsec2 wp-block-paragraph\"><a href=\"#sec4_2\" rel=\"nofollow\">4.2. Metadata Similarity<\/a><\/p>\n\n\n\n<p class=\"refsec2 wp-block-paragraph\"><a href=\"#sec4_3\" rel=\"nofollow\">4.3. Structure Similarity<\/a><\/p>\n\n\n\n<p class=\"refsec2 wp-block-paragraph\"><a href=\"#sec4_4\" rel=\"nofollow\">4.4. Link Similarity<\/a><\/p>\n\n\n\n<p class=\"refsec2 wp-block-paragraph\"><a href=\"#sec4_5\" rel=\"nofollow\">4.5. Similarity Aggregation<\/a><\/p>\n\n\n\n<p class=\"refsec1 wp-block-paragraph\"><a href=\"#references\" rel=\"nofollow\">References<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"#notes\" rel=\"nofollow\">Notes<\/a><\/p>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"sec1\">1. Introduction<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h5>\n\n\n\n<p class=\"wp-block-paragraph\">Several tasks tackle in the\u00a0<a href=\"\/index.php\/galilei\" rel=\"nofollow\">GALILEI framework<\/a>\u00a0need a similarity measure to compare two objects, including:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The\u00a0<a href=\"\/index.php\/clustering-problems\/\" data-type=\"page\" data-id=\"1896\" rel=\"nofollow\">document and profile clustering<\/a>.<\/li>\n\n\n\n<li>The validation processes for the\u00a0<a href=\"\/index.php\/profile-descriptions\" rel=\"nofollow\">profile computing methods<\/a>\u00a0and for the\u00a0<a href=\"\/index.php\/group-descriptions\" rel=\"nofollow\">topic and community of interests description computing methods<\/a>.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">A similarity measure,&nbsp;<mathml><span class=\"MathJax_Preview\" style=\"color: inherit;\"><\/span><span id=\"MathJax-Element-1-Frame\" class=\"mjx-chtml MathJax_CHTML\" tabindex=\"0\" data-mathml=\"&lt;math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;&gt;&lt;mi&gt;s&lt;\/mi&gt;&lt;mi&gt;i&lt;\/mi&gt;&lt;mi&gt;m&lt;\/mi&gt;&lt;mo stretchy=&quot;false&quot;&gt;(&lt;\/mo&gt;&lt;msub&gt;&lt;mi&gt;o&lt;\/mi&gt;&lt;mrow class=&quot;MJX-TeXAtom-ORD&quot;&gt;&lt;mi&gt;z&lt;\/mi&gt;&lt;\/mrow&gt;&lt;\/msub&gt;&lt;mo&gt;,&lt;\/mo&gt;&lt;msub&gt;&lt;mi&gt;o&lt;\/mi&gt;&lt;mrow class=&quot;MJX-TeXAtom-ORD&quot;&gt;&lt;mi&gt;v&lt;\/mi&gt;&lt;\/mrow&gt;&lt;\/msub&gt;&lt;mo stretchy=&quot;false&quot;&gt;)&lt;\/mo&gt;&lt;mo&gt;\u2208&lt;\/mo&gt;&lt;mo stretchy=&quot;false&quot;&gt;[&lt;\/mo&gt;&lt;mn&gt;0&lt;\/mn&gt;&lt;mo&gt;,&lt;\/mo&gt;&lt;mn&gt;1&lt;\/mn&gt;&lt;mo stretchy=&quot;false&quot;&gt;]&lt;\/mo&gt;&lt;\/math&gt;\" role=\"presentation\" style=\"font-size: 122%; position: relative;\"><span id=\"MJXc-Node-1\" class=\"mjx-math\" aria-hidden=\"true\"><span id=\"MJXc-Node-2\" class=\"mjx-mrow\"><span id=\"MJXc-Node-3\" class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\" style=\"padding-top: 0.207em; padding-bottom: 0.31em;\">s<\/span><\/span><span id=\"MJXc-Node-4\" class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\" style=\"padding-top: 0.464em; padding-bottom: 0.31em;\">i<\/span><\/span><span id=\"MJXc-Node-5\" class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\" style=\"padding-top: 0.207em; padding-bottom: 0.31em;\">m<\/span><\/span><span id=\"MJXc-Node-6\" class=\"mjx-mo\"><span class=\"mjx-char MJXc-TeX-main-R\" style=\"padding-top: 0.464em; padding-bottom: 0.566em;\">(<\/span><\/span><span id=\"MJXc-Node-7\" class=\"mjx-msubsup\"><span class=\"mjx-base\"><span id=\"MJXc-Node-8\" class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\" style=\"padding-top: 0.207em; padding-bottom: 0.31em;\">o<\/span><\/span><\/span><span class=\"mjx-sub\" style=\"font-size: 70.7%; vertical-align: -0.212em; padding-right: 0.071em;\"><span id=\"MJXc-Node-9\" class=\"mjx-texatom\" style=\"\"><span id=\"MJXc-Node-10\" class=\"mjx-mrow\"><span id=\"MJXc-Node-11\" class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\" style=\"padding-top: 0.207em; padding-bottom: 0.31em; padding-right: 0.003em;\">z<\/span><\/span><\/span><\/span><\/span><\/span><span id=\"MJXc-Node-12\" class=\"mjx-mo\"><span class=\"mjx-char MJXc-TeX-main-R\" style=\"margin-top: -0.151em; padding-bottom: 0.515em;\">,<\/span><\/span><span id=\"MJXc-Node-13\" class=\"mjx-msubsup MJXc-space1\"><span class=\"mjx-base\"><span id=\"MJXc-Node-14\" class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\" style=\"padding-top: 0.207em; padding-bottom: 0.31em;\">o<\/span><\/span><\/span><span class=\"mjx-sub\" style=\"font-size: 70.7%; vertical-align: -0.212em; padding-right: 0.071em;\"><span id=\"MJXc-Node-15\" class=\"mjx-texatom\" style=\"\"><span id=\"MJXc-Node-16\" class=\"mjx-mrow\"><span id=\"MJXc-Node-17\" class=\"mjx-mi\"><span class=\"mjx-char MJXc-TeX-math-I\" style=\"padding-top: 0.207em; padding-bottom: 0.31em;\">v<\/span><\/span><\/span><\/span><\/span><\/span><span id=\"MJXc-Node-18\" class=\"mjx-mo\"><span class=\"mjx-char MJXc-TeX-main-R\" style=\"padding-top: 0.464em; padding-bottom: 0.566em;\">)<\/span><\/span><span id=\"MJXc-Node-19\" class=\"mjx-mo MJXc-space3\"><span class=\"mjx-char MJXc-TeX-main-R\" style=\"padding-top: 0.259em; padding-bottom: 0.361em;\">\u2208<\/span><\/span><span id=\"MJXc-Node-20\" class=\"mjx-mo MJXc-space3\"><span class=\"mjx-char MJXc-TeX-main-R\" style=\"padding-top: 0.464em; padding-bottom: 0.566em;\">[<\/span><\/span><span id=\"MJXc-Node-21\" class=\"mjx-mn\"><span class=\"mjx-char MJXc-TeX-main-R\" style=\"padding-top: 0.361em; padding-bottom: 0.361em;\">0<\/span><\/span><span id=\"MJXc-Node-22\" class=\"mjx-mo\"><span class=\"mjx-char MJXc-TeX-main-R\" style=\"margin-top: -0.151em; padding-bottom: 0.515em;\">,<\/span><\/span><span id=\"MJXc-Node-23\" class=\"mjx-mn MJXc-space1\"><span class=\"mjx-char MJXc-TeX-main-R\" style=\"padding-top: 0.361em; padding-bottom: 0.361em;\">1<\/span><\/span><span id=\"MJXc-Node-24\" class=\"mjx-mo\"><span class=\"mjx-char MJXc-TeX-main-R\" style=\"padding-top: 0.464em; padding-bottom: 0.566em;\">]<\/span><\/span><\/span><\/span><span class=\"MJX_Assistive_MathML\" role=\"presentation\"><math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mi>s<\/mi><mi>i<\/mi><mi>m<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi>o<\/mi><mrow class=\"MJX-TeXAtom-ORD\"><mi>z<\/mi><\/mrow><\/msub><mo>,<\/mo><msub><mi>o<\/mi><mrow class=\"MJX-TeXAtom-ORD\"><mi>v<\/mi><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><mo>\u2208<\/mo><mo stretchy=\"false\">[<\/mo><mn>0<\/mn><mo>,<\/mo><mn>1<\/mn><mo stretchy=\"false\">]<\/mo><\/math><\/span><\/span><script>sim(o_{z},o_{v})\\in[0,1]<\/script><\/mathml>, is a function that compares two objects,&nbsp;oz&nbsp;and&nbsp;ov, such that its returns 0 if the objects have nothing in common and 1 if they are identical.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the\u00a0<a href=\"\/index.php\/tensor-space-model\/\" data-type=\"page\" data-id=\"2182\" rel=\"nofollow\">tensor space model<\/a>, each object,\u00a0\\(o_v\\), is described by a tensor, \\(\\hat{o_{v}}=[o_{ij,v}]\\)\u00a0where\u00a0\\(o_{ij,v}\\) represents the weight of the concept\u00a0\\(c_j\\)\u00a0for a vector associated with the meta-concept\u00a0\\(c_i\\).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the case of\u00a0<a href=\"\/index.php\/document-descriptions\" rel=\"nofollow\">document descriptions<\/a>, all these weights are greater than or equal to\u00a00. But for the\u00a0<a href=\"\/index.php\/profile-descriptions\" rel=\"nofollow\">profile descriptions<\/a> and the\u00a0<a href=\"\/index.php\/group-descriptions\" rel=\"nofollow\">topic and community of interests descriptions<\/a>, which are computed with some linear combination of document descriptions, some objects may be represented by tensors with negative weights. The similarity measure must therefore take these negative weights into account.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The following discussion that leads to the similarity measure proposed is based on document descriptions. But, because negative weighs are managed, this similarity measure can be used to compute similarities between profiles, between topics and between communities of interests, or between pairs of these objects (for example to compute the similarity between a document and a community of interests).<\/p>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"sec2\">2. Concept Categories: A Similarity Perspective<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h5>\n\n\n\n<p class=\"wp-block-paragraph\">In the\u00a0<a href=\"\/index.php\/tensor-space-model\/\" data-type=\"page\" data-id=\"2182\" rel=\"nofollow\">tensor space model<\/a>, each concept is associated to a given type and a given category. The categories represent the \u201cnature\u201d of the concepts (token, metadata, structure and link), i.e. different concepts from a same category have the same \u201cnature\u201d. Let us consider them with regards to the similarity between the corresponding objects (such as documents).<\/p>\n\n\n\n<h6 class=\"wp-block-heading\" id=\"sec2_1\">2.1. Token Concepts<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h6>\n\n\n\n<p class=\"wp-block-paragraph\">Using token concepts (mostly index terms) to compare two documents is the basis of the classical vector space model. The underlying hypothesis is simple: two documents sharing some tokens that don\u2019t appear very often in the whole corpus may deal with related topics, i.e. they are similar\u00a0[<a href=\"file:\/\/\/Users\/pfrancq\/POI\/www\/wikics2\/Similarity_Measure.html#biblio-1\">1<\/a>]. Even if we suppose the\u00a0<a href=\"\/index.php\/tensor-space-model#sec2_2\" data-type=\"page\" data-id=\"2182\" rel=\"nofollow\">token independence<\/a>, the appearance of the same token in two objects (such as documents or profiles) tells us something about the similarity between these objects.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Since document content may be in several languages, one approach consists in treating the languages (English,<em>&nbsp;<\/em>French, etc.) and the language-independent&nbsp;<em>meaningful entities<\/em>&nbsp;(such as names, cities, organizations, etc.) differently. Another approaches consider all terms, whatever their language, as one single subspace (or concept type). This choice is motivated by simplicity and because most terms written similarly in different languages have the same meaning<sup data-fn=\"8c80cb39-2b5e-41c9-8049-55fb3c608e1d\" class=\"fn\"><a href=\"#8c80cb39-2b5e-41c9-8049-55fb3c608e1d\" id=\"8c80cb39-2b5e-41c9-8049-55fb3c608e1d-link\">1<\/a><\/sup>.<\/p>\n\n\n\n<h6 class=\"wp-block-heading\" id=\"sec2_2\">2.2. Metadata Concepts<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h6>\n\n\n\n<p class=\"wp-block-paragraph\">A metadata (often called \u201cdata on data\u201d) is supposed to convey descriptive information on an object (such as a document). Examples of metadata include an author, a title, a place, etc. A metadata can be seen by computer programs as a sort of index of a given object (or a given part of a given object). Here is an example of a XML fragment with three metadata:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">&lt;dc:creator&gt; Pascal Francq &lt;\/dc:creator&gt;<br>&lt;dc:description lang=\u2019en\u2019&gt; Vector &lt;\/dc:description&gt;<br>&lt;dc:description lang=\u2019fr\u2019&gt; Vecteur &lt;\/dc:description&gt;<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">From the similarity point of view, objects (such as documents) sharing similar metadata are probably highly related (for example if two documents are written by the same authors). It is particularly useful to treat the metadata separately (and not as usual text content) since there are standards defining them, which simplifies the comparison between objects. The Dublin Core MetaData Initiative (DCMI) is probably the best known effort&nbsp;[<a href=\"#2\" rel=\"nofollow\">2<\/a>].<\/p>\n\n\n\n<h6 class=\"wp-block-heading\" id=\"sec2_3\">2.3. Structure Concepts<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h6>\n\n\n\n<p class=\"wp-block-paragraph\">Since knowledge is stored, some structural rules are necessary to organize its digital form (file format, database schema, etc.). Even the simple text file proposes some \u201cspecial characters\u201d to separate paragraphs. In fact, structural rules emerge when knowledge begins to be written. Punctuation marks and divisions of documents (such as chapters) are structural rules.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">At first glance, we may identify three categories of structural rules.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First, the rules that structure and organize the content (mostly written language). Elements such as the punctuation marks and divisions come immediately in mind. But other examples exist, such as most tags defined by the DocBook standard&nbsp;[<a href=\"file:\/\/\/Users\/pfrancq\/POI\/www\/wikics2\/Similarity_Measure.html#biblio-3\">3<\/a>]:<\/p>\n\n\n\n<p class=\"algo wp-block-paragraph\">&lt;chapter&gt;<\/p>\n\n\n\n<p class=\"algo2 wp-block-paragraph\" id=\"algo2\">&lt;title&gt;Bourgeois and Proletarians&lt;\/title&gt;<\/p>\n\n\n\n<p class=\"algo2 wp-block-paragraph\">&lt;para&gt; The history of all hitherto existing society is the history of class struggles. &lt;\/para&gt;<\/p>\n\n\n\n<p class=\"algo wp-block-paragraph\">&lt;\/chapter&gt;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Second, The rules that specify how the content should be presented (on screen or on paper). Most tags defined by the well-known HTML standard illustrate this kind of rules:<\/p>\n\n\n\n<p class=\"algo wp-block-paragraph\">&lt;html&gt;<\/p>\n\n\n\n<p class=\"algo2 wp-block-paragraph\">&lt;body&gt;<\/p>\n\n\n\n<p class=\"algo3 wp-block-paragraph\">This a sentence with a word in &lt;i&gt;italic&lt;\/i&gt; and a word in &lt;b&gt;bold&lt;\/b&gt;.<\/p>\n\n\n\n<p class=\"algo2 wp-block-paragraph\">&lt;\/body&gt;<\/p>\n\n\n\n<p class=\"algo wp-block-paragraph\">&lt;\/html&gt;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Third, The rules<em>&nbsp;<\/em>that provide some semantic information on the content. In particular, the XML technologies use tags to convey information to make content machine-readable. Here is an example of a XML document:<\/p>\n\n\n\n<p class=\"algo wp-block-paragraph\">&lt;?xml version=\u00a0\u00bb1.0&Prime; standalone=\u00a0\u00bbno\u00a0\u00bb?&gt;<\/p>\n\n\n\n<p class=\"algo wp-block-paragraph\">&lt;discography xmlns=\u00a0\u00bbhttp:\/\/music.org\/\u00a0\u00bb&gt; <\/p>\n\n\n\n<p class=\"algo2 wp-block-paragraph\">&lt;group groupID=\u00a0\u00bbDeep Purple\u00a0\u00bb&gt; <\/p>\n\n\n\n<p class=\"algo3 wp-block-paragraph\">&lt;groupName&gt; Deep Purple &lt;\/groupName&gt; <\/p>\n\n\n\n<p class=\"algo3 wp-block-paragraph\">&lt;album&gt; <\/p>\n\n\n\n<p class=\"algo4 wp-block-paragraph\">&lt;albumName&gt; In Rock &lt;\/albumName&gt;<\/p>\n\n\n\n<p class=\"algo5 wp-block-paragraph\"> &lt;published&gt; 1969 &lt;\/published&gt; <\/p>\n\n\n\n<p class=\"algo3 wp-block-paragraph\">&lt;\/album&gt; <\/p>\n\n\n\n<p class=\"algo2 wp-block-paragraph\">&lt;\/group&gt; <\/p>\n\n\n\n<p class=\"algo wp-block-paragraph\">&lt;\/discography&gt;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A particular semantic rule set is called a semantic model, and its elements can be seen as semantic concepts (the tags \u201cgroup\u201d, \u201cgroupName\u201d, etc., in the above example). With regards to the similarity between objects, semantic models can be useful. We may suppose that objects (such as documents) that share the same structure concepts are related to the same kind of topics. Indeed, the XML standard encourages the reuse of common semantic models, called XML schemas, in different documents to described the same kind of content.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Of course, every structure rule doesn\u2019t convey useful information. For example, presentation related rules (such as HTML tags) are irrelevant regarding the similarity between two objects. It can therefore be interesting to specify which structure rules constitute useful structure concepts.<\/p>\n\n\n\n<h6 class=\"wp-block-heading\" id=\"sec2_4\">2.4. Link Concepts<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h6>\n\n\n\n<p class=\"wp-block-paragraph\">As explained elsewhere, a link from an object&nbsp;oz&nbsp;to another object&nbsp;ov&nbsp;(for example a hyperlink from one Web page to another) may be seen as if&nbsp;oz&nbsp;recognizes a given authority to&nbsp;ov. Taking this comparison further, one may suppose that two documents containing identical links are probably related. For example, two academic papers citing the same referenced books are probably related to the same scientific field.<\/p>\n\n\n\n<p class=\"has-text-align-center figtab wp-block-paragraph\"><img loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"120\" class=\"wp-image-2137\" style=\"width: 300px;\" src=\"http:\/\/xeddixx.cluster029.hosting.ovh.net\/wp-content\/uploads\/2025\/10\/simlink.png\" alt=\"\" srcset=\"https:\/\/www.francq.info\/wp-content\/uploads\/2025\/10\/simlink.png 405w, https:\/\/www.francq.info\/wp-content\/uploads\/2025\/10\/simlink-300x120.png 300w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n\n\n\n<p class=\"has-text-align-center legende wp-block-paragraph\">Figure 1. Example of a link structure.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The objects and the links form a graph. Let us suppose that we have eleven academic papers forming the graph given at Figure 1. As just discussed, we can suppose that&nbsp;\\(d_1\\),&nbsp;\\(d_2\\),&nbsp;\\(d_3\\)&nbsp;and&nbsp;\\(d_4\\)&nbsp;are probably similar since they share a common set of references (\\(d_5\\),&nbsp;\\(d_6\\),&nbsp;\\(d_7\\),&nbsp;\\(d_8\\),&nbsp;\\(d_9\\)&nbsp;and&nbsp;\\(d_{10}\\)). But what can be said concerning the similarity between these four documents are&nbsp;\\(d_{11}\\) ? The fact that&nbsp;\\(d_{11}\\)&nbsp;cites&nbsp;\\(d_{10}\\), which is also cited by documents cited by documents cited by&nbsp;\\(d_1\\),&nbsp;\\(d_2\\),&nbsp;\\(d_3\\)&nbsp;and&nbsp;\\(d_4\\), is it enough to conclude something about the similarity between&nbsp;\\(d_1\\),&nbsp;\\(d_2\\),&nbsp;\\(d_3\\)&nbsp;and&nbsp;\\(d_4\\), and&nbsp;\\(d_{11}\\)? Probably not.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Therefore, we may suppose that, when there are too many intermediates links between two objects and a common link, nothing can be concluded regarding their similarity.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"sec3\">3. Some Vector Similarity Measures<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h5>\n\n\n\n<h6 class=\"wp-block-heading\" id=\"sec3_1\">3.1. Classical Cosine Similarity<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h6>\n\n\n\n<p class=\"wp-block-paragraph\">In the classical vector space model, a similarity,&nbsp;\\(sim(d_{m},d_{n})\\in[0,1]\\), between two documents,&nbsp;\\(d_m\\)&nbsp;and&nbsp;\\(d_n\\)&nbsp;is defined as the cosine between their vectors:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\[sim(d_{m},d_{n})=\\frac{{\\displaystyle \\sum_{i}}d_{i,m}\\cdot d_{i,n}}{\\sqrt{{\\displaystyle \\sum_{i}}d_{i,m}^{2}}\\cdot\\sqrt{{\\displaystyle \\sum_{i}}d_{i,n}^{2}}}\\]\n\n\n\n<p class=\"wp-block-paragraph\">where\u00a0\\(d_{i,m}\\geq0\\) and\u00a0\\(d_{i,n}\\geq0\\)\u00a0represent the weights of the corresponding vector elements computed using the\u00a0<a href=\"\/index.php\/tensor-space-model#sec4\" data-type=\"page\" data-id=\"2182\" rel=\"nofollow\"><em>tf<\/em>\u00a0and\u00a0<em>idf<\/em>\u00a0factors<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I proposed an adaptation for any objects,&nbsp;\\(o_z\\)&nbsp;and&nbsp;\\(o_v\\), represented by vectors that may have negative weights&nbsp;[<a href=\"#4\" rel=\"nofollow\">4<\/a>]. Let us define the factor&nbsp;\\(n_{i,z,v}\\)&nbsp;as:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\[n_{i,z,v}=\\left\\{ \\begin{array}{cc} 0 &amp; o_{i,z}&lt;0\\textrm{ and }o_{i,v}&lt;0\\\\ 1 &amp; \\textrm{Otherwise} \\end{array}\\right.\\]\n\n\n\n<p class=\"wp-block-paragraph\">A first measure \\(s(o_{z},o_{v})\\in[-1,+1]\\)&nbsp;to compare two objects may be defined by:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\[s(o_{z},o_{v})=\\frac{{\\displaystyle \\sum_{i}}n_{i,z,v}\\cdot o_{i,z}\\cdot o_{i,v}}{\\sqrt{{\\displaystyle \\sum_{i}}o_{i,z}^{2}}\\cdot\\sqrt{{\\displaystyle \\sum_{i}}o_{i,v}^{2}}}\\]\n\n\n\n<p class=\"wp-block-paragraph\">where \\(o_{i,z}\\)\u00a0and\u00a0\\(o_{i,v}\\)\u00a0represent the weights of the corresponding vectors computed using the\u00a0<a href=\"\/index.php\/tensor-space-model#sec4\" rel=\"nofollow\"><em>tf<\/em>\u00a0and\u00a0<em>idf<\/em> factors<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The introduction of the factor\u00a0\\(n_{i,z,v}\\)\u00a0\u00a0takes negative weights in object descriptions into account (such as for\u00a0<a href=\"\/index.php\/profile-descriptions\" rel=\"nofollow\">profile descriptions<\/a>). In fact, if two objects have negative weights for a specific feature (meaning they are not related to it), without the factor introduced, the product will add a positive value and the corresponding similarity will increased. But, knowing that two objects are not related to an identical feature does not mean that they are related to a same subject. The factor\u00a0ni,z,v\u00a0avoids therefore this situation.\u00a0<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To obtain a similarity,&nbsp;\\(sim(o_{z},o_{v})\\in[0,1]\\), the previous measure is adapted:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\begin{align}sim(o_{z},o_{v})&amp;=&amp;0.5+\\frac{{\\displaystyle \\sum_{i}}n_{i,z,v}\\cdot o_{i,z}\\cdot o_{i,v}}{2\\cdot\\sqrt{{\\displaystyle \\sum_{i}}o_{i,z}^{2}}\\cdot\\sqrt{{\\displaystyle \\sum_{i}}o_{i,v}^{2}}}\\end{align}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This similarity assumes the independence of the features, i.e. the set of feature vectors is linearly independent and forms a basis for the subspace of interest.<\/p>\n\n\n\n<h6 class=\"wp-block-heading\" id=\"sec3_2\">3.2. Adapted Cosine Similarity<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h6>\n\n\n\n<p class=\"wp-block-paragraph\">I propose an adaptation of the cosine similarity for two vectors,\u00a0\\(\\vec{o}_{i,z}\\) and\u00a0\\(\\vec{o}_{j,v}\\), of tensors representing objects,\u00a0\\(\\hat{o}_{z}\\) and\u00a0\\(\\hat{o}_{v}\\). Let us first defined two vectors\u00a0\\(\\vec{o\u2019}_{i,z}\\) and\u00a0\\(\\vec{o\u2019}_{j,v}\\)\u00a0by using the\u00a0<a href=\"\/index.php\/tensor-space-model#sec5\" rel=\"nofollow\">tensor operations<\/a>\u00a0and the\u00a0<a href=\"\/index.php\/tensor-space-model#sec4\" rel=\"nofollow\">concept weighting<\/a>:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\begin{align}\\vec{o\u2019}_{i,z}&amp;=&amp;W_{\\mathcal{O}}(\\hat{o}_{z})^{T}\\cdot\\vec{e}_{i}\\\\\\vec{o\u2019}_{j,v}&amp;=&amp;W_{\\mathcal{O}}(\\hat{o}_{v})^{T}\\cdot\\vec{e}_{j}\\end{align}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">where&nbsp;\\(\\vec{e}_{i}\\)&nbsp;is the coordinate vector corresponding to concept&nbsp;\\(c_i\\),&nbsp;\\(\\vec{e}_{j}\\)&nbsp;the coordinate vector of concept&nbsp;\\(c_j\\) and&nbsp;\\(M^{T}\\)&nbsp;represents the transpose of matrix&nbsp;\\(M\\).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The vectors\u00a0\\(\\vec{o\u2019}_{i,z}\\)\u00a0and\u00a0\\(\\vec{o\u2019}_{j,v}\\)\u00a0correspond to the vectors\u00a0\\(\\vec{o}_{i,z}\\)\u00a0and\u00a0\\(\\vec{o}_{j,v}\\)\u00a0when the classical\u00a0<a href=\"\/index.php\/tensor-space-model#sec4\" rel=\"nofollow\"><em>tf<\/em>\u00a0and\u00a0<em>idf<\/em> factors<\/a>\u00a0are used to compute the weights. We can now adapt the similarity measure defined in section\u00a0<a href=\"#sec3_1\" rel=\"nofollow\">3.1<\/a>. Let us define the factor\u00a0\\(n_{k,i,j,z,v}\\)\u00a0as:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\[n_{k,i,j,z,v}=\\left\\{ \\begin{array}{cc} 0 &amp; o\u2019_{ik,z}&lt;0\\textrm{ and }o\u2019_{jk,v}&lt;0\\\\ 1 &amp; \\textrm{Otherwise} \\end{array}\\right.\\]\n\n\n\n<p class=\"wp-block-paragraph\">where&nbsp;\\(o\u2019_{ik,z}\\)&nbsp;and&nbsp;\\(o\u2019_{jk,v}\\)&nbsp;represents the elements of vectors&nbsp;\\(\\vec{o\u2019}_{i,z}\\)&nbsp;and \\(\\vec{o\u2019}_{j,v}\\).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The similarity measure,&nbsp;\\(sim_{c}(\\vec{o}_{i,z},\\vec{o}_{j,v})\\in[0,1]\\)&nbsp;is then defined by\u200a<sup data-fn=\"3737af8b-5b98-4d5a-ac0c-d2e3eee80500\" class=\"fn\"><a href=\"#3737af8b-5b98-4d5a-ac0c-d2e3eee80500\" id=\"3737af8b-5b98-4d5a-ac0c-d2e3eee80500-link\">2<\/a><\/sup>:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\begin{align}sim_{c}(\\vec{o}_{i,z},\\vec{o}_{j,v})&amp;=&amp;0.5+\\frac{{\\displaystyle \\sum_{k}}n_{k,i,j,z,v}\\cdot o\u2019_{ik,z}\\cdot o\u2019_{jk,v}}{2\\cdot\\sqrt{{\\displaystyle \\sum_{k}}o\u2019^{2}{}_{ik,z}}\\cdot\\sqrt{{\\displaystyle \\sum_{k}}o\u2019^{2}{}_{jk,v}}}\\end{align}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This measure defines three intervals:<\/p>\n\n\n\n<dl>\n<dt>\\(>0.5\\)<\/dt> <dd>A majority of discriminant concepts relevant to describe one vector (positive weights) are relevant to describe the second one (positive weights).<\/dd>\n\n\n\n<dt>\\(=0.5\\)<\/dt><dd>  Nothing can be concluded about the similarity of the vectors (they have no concepts in common).<\/dd>\n\n\n\n<dt>\\(>0.5\\).<\/dt> <dd>A majority of discriminant concepts relevant to describe one vector (positive weights) are irrelevant to describe the second one (negative weights), or otherwise.<\/dd>\n<\/dl>\n\n\n\n<h6 class=\"wp-block-heading\" id=\"sec3_3\">3.3. Overlap Similarity<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h6>\n\n\n\n<p class=\"wp-block-paragraph\">Another possible similarity measure between two vectors ,\u00a0\\(\\vec{o}_{i,z}\\) and\u00a0\u00a0\\(\\vec{o}_{j,v}\\), of tensors representing objects,\u00a0\u00a0\\(\\hat{o}_{z}\\)\u00a0and\u00a0\u00a0\\(\\hat{o}_{v}\\) consists in computing a weighted ratio of overlapping concepts. As for the adapted cosine similarity, let us first defined two vectors\u00a0\u00a0\\(\\vec{o\u2019}_{i,z}\\)\u00a0and\u00a0\u00a0\\(\\vec{o\u2019}_{j,v}\\)\u00a0by using the\u00a0<a href=\"\/index.php\/tensor-space-model#sec5\" rel=\"nofollow\">tensor operations<\/a>\u00a0and the\u00a0<a href=\"\/index.php\/tensor-space-model#sec4\" rel=\"nofollow\">concept weighting<\/a>:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\begin{align}\\vec{o\u2019}_{i,z}&amp;=&amp;W_{\\mathcal{O}}(\\hat{o}_{z})^{T}\\cdot\\vec{e}_{i}\\\\\\vec{o\u2019}_{j,v}&amp;=&amp;W_{\\mathcal{O}}(\\hat{o}_{v})^{T}\\cdot\\vec{e}_{j}\\end[align}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Again, to avoid comparing concepts that have negative weights in both vectors, let us define the factor&nbsp;\\(n_{k,i,z,v}\\)&nbsp;as:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\[n_{k,i,j,z,v}=\\left\\{ \\begin{array}{cc} 0 &amp; o\u2019_{ik,z}&lt;0\\textrm{ and }o\u2019_{jk,v}&lt;0\\\\ 1 &amp; \\textrm{Otherwise} \\end{array}\\right.\\]\n\n\n\n<p class=\"wp-block-paragraph\">where&nbsp;\\(o\u2019_{ik,z}\\)&nbsp;and&nbsp;\\(o\u2019_{jk,v}\\)&nbsp;represents the elements of vectors&nbsp;\\(\\vec{o\u2019}_{i,z}\\)&nbsp;and&nbsp;\\(\\vec{o\u2019}_{j,v}\\).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Such a weight ratio,&nbsp;\\(s_{o}(\\vec{o}_{i,z},\\vec{o}_{j,v})\\in[-1,1]\\), can be computed with<sup data-fn=\"18bff711-4ad2-4700-8aeb-72b7dcae925e\" class=\"fn\"><a href=\"#18bff711-4ad2-4700-8aeb-72b7dcae925e\" id=\"18bff711-4ad2-4700-8aeb-72b7dcae925e-link\">3<\/a><\/sup>:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\begin{align}s_{o}(\\vec{o}_{i,z},\\vec{o}_{j,v})&amp;=&amp;\\frac{\\sum_{k}n_{k,i,j,z,v}\\cdot o\u2019_{ik,z}\\cdot o\u2019_{jk,v}}{\\sum_{k}n_{k,i,j,z,v}\\cdot(o\u2019_{ik,z}\\vee o\u2019_{jk,v})^{2}}\\end\\{align}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">where&nbsp;\\(\\vee\\)&nbsp;is the maximum operator.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This measure is adapted to obtain a similarity,&nbsp;\\(sim_{o}(\\vec{o}_{i,z},\\vec{o}_{j,v})\\in[0,1]\\):<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\begin{align}sim_{o}(\\vec{o}_{i,z},\\vec{o}_{j,v})&amp;=&amp;0.5+\\frac{s_{o}(\\vec{o}_{i,z},\\vec{o}_{j,v})}{2}\\end{align}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As for the adapted cosine similarity, this measure defines three intervals:<\/p>\n\n\n\n<dl>\n<dt>\\(>0.5\\)<\/dt>\u2003<dd>A majority of discriminant concepts relevant to describe one vector (positive weights) are relevant to describe the second one (positive weights).<\/dd>\n\n\n\n<dt>\\(=0.5\\)<\/dt><dd>\u2003Nothing can be concluded about the similarity of the vectors (they have no concepts in common).<\/dd>\n\n\n\n<dt>\\(<0.5\\)<\/dt><dd>\u2003A majority of discriminant concepts relevant to describe one vector (positive weights) are irrelevant to describe the second one (negative weights), or otherwise.<\/dd>\n<\/dl>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"sec4\">4. Tensor Space Model Similarity<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h5>\n\n\n\n<p class=\"wp-block-paragraph\">As explained in the previous section, each concept category provides some clues on the similarity between the corresponding objects. This suggest that we may define the similarity between two objects,&nbsp;\\(o_z\\)&nbsp;and&nbsp;\\(o_v\\)&nbsp;by:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\[sim(o_{z},o_{v})=A(sim_{t}(o_{z},o_{v}),sim_{m}(o_{z},o_{v}),sim_{s}(o_{z},o_{v}),sim_{l}(o_{z},o_{v}))\\]\n\n\n\n<p class=\"wp-block-paragraph\">where&nbsp;\\(sim_{t}(o_{z},o_{v})\\),&nbsp;\\(sim_{m}(o_{z},o_{v})\\),&nbsp;\\(sim_{s}(o_{z},o_{v})\\) and&nbsp;\\(sim_{l}(o_{z},o_{v})\\)&nbsp;represent a similarity based respectively on their token, metadata, structure and link concepts, and&nbsp;\\(A\\)&nbsp;is some aggregating function of these similarities.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To comply the agreement, we impose that&nbsp;\\(sim(o_{z},o_{v})\\in[0,1]\\).&nbsp;<\/p>\n\n\n\n<h6 class=\"wp-block-heading\" id=\"sec4_1\">4.1. Token Similarity<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h6>\n\n\n\n<p class=\"wp-block-paragraph\">Token contents are described through a set of vectors,&nbsp;\\(\\vec{o}_{i,z}\\), each vector being associated to a meta-concept,&nbsp;\\(c_i\\), which elements represent the importance of the tokens (eventually zero if a token is not contained or has no discriminant value).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We may suppose that a meta-concept&nbsp;\\(c_{I}\\), and therefore each vector&nbsp;\\(\\vec{o}_{i,z}\\), embodies a particular context (usually of text content). For a document, such contexts could be an abstract, the body, the conclusion, etc.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here, at least three assumptions are possible:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>We assume that comparing the tokens from two contexts (for example an abstract and a body, or a conclusion and an acknowledgment) provides some useful information regarding the similarity between the corresponding objects.<\/li>\n\n\n\n<li>We assume that only tokens associated to a same context provide useful information regarding the similarity between the corresponding objects.<\/li>\n\n\n\n<li>We assume that some pairs of contexts provide a clue regarding the similarity between the corresponding objects (for example an abstract and a body), and others not (for example a conclusion and a bibliography). Moreover, we may associated to each pair of contexts a weight that quantifies its importance regarding the object similarity.&nbsp;<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Intuitively, the third option sounds the more logical. After all, a human being can recognize that two articles are related by reading the abstract of the first one, and the body or the conclusion of the other one. But, it has a practical drawback: all the possible pairs to compare must be defined (with, eventually, a corresponding weight). This is very difficult to do in practice.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The second choice is certainly the most simple one, but it seems a little bit too limited. Therefore, the first choice appears to be the best solution (in particular if all pairs of contexts are supposed to be of the same importance).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To compute the similarity between two vectors,&nbsp;\\(\\vec{o}_{i,z}\\) and&nbsp;\\(\\vec{o}_{j,v}\\), representing some tokens, the adapted cosine similarity can be used (section&nbsp;<a href=\"#sec3_2\" rel=\"nofollow\">3.2<\/a>). To manage the different pairs of vectors, a linear combination of the similarities for each of them weighted by the number of concepts used for the computation is performed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let us suppose that two vectors,&nbsp;\\(\\vec{o}_{i,z}\\)&nbsp;and&nbsp;\\(\\vec{o}_{j,v}\\)&nbsp;have&nbsp;\\(c_{i,j,z,v}\\)&nbsp;common positive weighted text concepts. We define the token similarity as:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\[im_{t}(o_{z},o_{v})=\\frac{\\sum_{c_{i}\\in\\mathfrak{T}}\\sum_{c_{j}\\in\\mathfrak{T}}c_{i,j,z,v}\\cdot sim_{c}(\\vec{o}_{i,z},\\vec{o}_{j,v})}{\\sum_{c_{i}\\in\\mathfrak{T}}\\sum_{c_{j}\\in\\mathfrak{T}}c_{i,j,z,v}}\\]\n\n\n\n<p class=\"wp-block-paragraph\">where&nbsp;\\(\\mathfrak{T}\\) represents the set of token concepts.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It should be noticed that, if the objects are described by just one text vector, the similarity is the classical one computed as the cosine between the corresponding vectors weighted by the\u00a0<a href=\"\/index.php\/tensor-space-model#sec4\" rel=\"nofollow\">tf\u00a0and\u00a0idf factors<\/a>.<\/p>\n\n\n\n<h6 class=\"wp-block-heading\" id=\"sec4_2\">4.2. Metadata Similarity<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h6>\n\n\n\n<p class=\"wp-block-paragraph\">To define a similarity measure related to the metadata concepts, it is necessary to know how these concepts will be represented in terms of vectors in an object description. When documents are described, each metadata (for example \u201cdc:creator\u201d) is a meta-concept associated to a vector representing its value (such as the terms \u201cPascal\u201d and \u201cFrancq\u201d).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To compute the metadata similarity, a first choice could be to use the same measure as for the text similarity. But a different measure will be defined. To understand why, let us start with some metadata describing two different documents:<\/p>\n\n\n\n<p class=\"algo wp-block-paragraph\">&lt;dc:creator&gt; Marx &lt;\/dc:creator&gt;<\/p>\n\n\n\n<p class=\"algo wp-block-paragraph\">&lt;dc:creator&gt; Marx, Engels &lt;\/dc:creator&gt;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Yet, it is clear that they are related since the same author appears in both tags. Intuitively, if two objects (such as documents) were only defined by these two metadata, their similarity should be&nbsp;\\(0.5\\). <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In fact, we may suppose the similarity should be greater than&nbsp;\\(0.5\\)&nbsp;if \u201cMarx\u201d is more discriminant than \u201cEngels\u201d, and lower than&nbsp;\\(0.5\\)&nbsp;in the other case. Moreover, the following two metadata are certainly different even if they share a same index term:<\/p>\n\n\n\n<p class=\"algo wp-block-paragraph\"><strong>&lt;dc:publisher&gt;<\/strong><\/p>\n\n\n\n<p class=\"algo2 wp-block-paragraph\">World Health Organization (WHO)<\/p>\n\n\n\n<p class=\"algo wp-block-paragraph\">&lt;\/dc:publisher&gt;<\/p>\n\n\n\n<p class=\"algo wp-block-paragraph\">&lt;groupName&gt; The WHO &lt;\/groupName&gt;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Accordingly, we may suppose that different metadata cannot be compared, i.e. the metadata similarity depends only on the similarities between vectors corresponding to a same meta-concept (metadata):<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\begin{align}sim_{m}(\\vec{o}_{i,z},\\vec{o}_{j,v})=0&amp;&amp;i\\neq j\\end{align}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To compute the similarity between two metadata vectors&nbsp;\\(\\)&nbsp;and&nbsp;\\(\\), the overlap similarity is appropriated (section&nbsp;<a href=\"#sec3_3\" rel=\"nofollow\">3.3<\/a>):<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\begin{align}sim_{m}(\\vec{o}_{i,z},\\vec{o}_{i,v})&amp;=&amp;sim_{o}(\\vec{o}_{i,z},\\vec{o}_{j,v})\\endalign}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To manage the different metadata, a linear combination can be used to define the metadata similarity:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\[sim_{m}(o_{z},o_{v})=\\frac{\\sum_{c_{i}\\in\\mathfrak{M}}c_{i,z,v}\\cdot sim_{m}(\\vec{o}_{i,z},\\vec{o}_{i,v})}{\\sum_{c_{i}\\in\\mathfrak{M}}c_{i,z,v}}\\]\n\n\n\n<p class=\"wp-block-paragraph\">where&nbsp;\\(c_{i,z,v}\\)&nbsp;is the number of common positive weighted concepts in&nbsp;\\(\\vec{o\u2019}_{i,z}\\)&nbsp;and&nbsp;\\(\\vec{o\u2019}_{j,v}\\), \\(\\mathfrak{M}\\)&nbsp;represents the set of metadata concepts.<\/p>\n\n\n\n<h6 class=\"wp-block-heading\" id=\"sec4_3\">4.3. Structure Similarity<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h6>\n\n\n\n<p class=\"wp-block-paragraph\">As for the metadata concepts, the question of how structure is represented with vectors in an object description must be answered. In the\u00a0<a href=\"\/index.php\/tensor-space-model\" rel=\"nofollow\">tensor space model<\/a>, semantic models (for example a XML schema) are structure concept types, each semantic model defining a set of possible structure concepts of that type (including the neutral one). When documents are described, each neutral concept of a given structure type is a meta-concept associated to a vector representing the semantic concepts of that type used to described the document semantic. Since it makes no sense to compare structure concepts from different semantic models (because they represent distinct domains), we decide that the structure similarity depends only on the similarities between vectors corresponding to a same meta-concept:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\begin{align}sim_{s}(\\vec{o}_{i,z},\\vec{o}_{j,v})=0&amp;&amp;i\\neq j\\end{align}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If we suppose the independence of the structure concepts, the similarity between two vectors,&nbsp;\\(\\vec{o}_{i,z}\\) and&nbsp;\\(\\vec{o}_{i,v}\\) associated to a meta-concept,&nbsp;\\(c_i\\), can be computed with the adapted cosine similarity (section&nbsp;<a href=\"#sec3_2\" rel=\"nofollow\">3.2<\/a>). To manage the different vectors corresponding to semantic models, a linear combination is once again used. Let us suppose that two vectors,&nbsp;\\(\\vec{o}_{i,z}\\)&nbsp;and&nbsp;\\(\\vec{o}_{i,v}\\)&nbsp;have&nbsp;\\(c_{i,z,v}\\)&nbsp;common positive weighted semantic concepts. We define the structure similarity as:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\[sim_{s}(o_{z},o_{v})=\\frac{\\sum_{c_{i}\\in\\mathfrak{S}}c_{i,z,v}\\cdot sim_{c}(\\vec{o}_{i,z},\\vec{o}_{i,v})}{\\sum_{c_{i}\\in\\mathfrak{S}}c_{i,z,v}}\\]\n\n\n\n<p class=\"wp-block-paragraph\">where&nbsp;\\(\\mathfrak{S}\\)&nbsp;represents the set of structure concepts.<\/p>\n\n\n\n<h6 class=\"wp-block-heading\" id=\"sec4_4\">4.4. Link Similarity<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h6>\n\n\n\n<p class=\"wp-block-paragraph\">In the\u00a0<a href=\"\/index.php\/tensor-space-model\" rel=\"nofollow\">tensor space model<\/a>, link schemes (such as URI ou DOI) are link concept types, each scheme defines a concept for all possible links of that type and a neutral one. When documents are described, each neutral concept of a given scheme type is a meta-concept associated to a vector representing the links of that type contained in the document (for example hyperlinks in Web pages). Since it makes no sense to compare different link schemes, we decide that the semantic similarity depends only on the similarities between vectors corresponding to a same meta-concept (same scheme):<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\begin{align}sim_{l}(\\vec{o}_{i,z},\\vec{o}_{j,v})=0&amp;&amp;i\\neq j\\end{align}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As explained in the section&nbsp;<a href=\"#sec2_4\" rel=\"nofollow\">2.4<\/a>, two objects are probably related if they share a given set of common links in their immediate neighborhood in the corresponding graph. We may suppose that if an object contains two links to a document and if another object contains only one link to this document, the objects are less similar than if the second object contains two links to the document.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Moreover, we may also suppose that links that appearing a lot (for example a generic scientific book) convey less information regarding the similarity than links that appear very often in a sub-set of objects (for example a academic paper in scientific articles).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To compute the similarity&nbsp;\\(sim_{l}(\\vec{o}_{i,z},\\vec{o}_{i,v})\\)&nbsp;, a first step is to built these neighborhoods for the two vectors \\(\\vec{o}_{i,z}\\) and&nbsp;\\(\\vec{o}_{i,v}\\). The neighborhood of the vector&nbsp;\\(\\vec{o}_{i,z}\\)&nbsp;is determined with a simple propagation approach in the graph formed by all links appearing in the vectors associated to the meta-concept,&nbsp;\\(c_i\\). Moreover, if to the object&nbsp;\\(o_{z}\\)&nbsp;is linked by the vector&nbsp;\\(\\vec{o}_{i,v}\\)&nbsp;(\\(o_{iz,v}\\neq0\\)), it is added to its own neighborhood. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Figure&nbsp;2&nbsp;shows how this neighborhood is built when the number of steps is limited to \\(3\\). It should be noticed that loops can occur (a document has a link to another document that has a link to the first one).&nbsp;<\/p>\n\n\n\n<p class=\"has-text-align-center figtab wp-block-paragraph\"><img loading=\"lazy\" decoding=\"async\" width=\"250\" height=\"250\" class=\"wp-image-2167\" style=\"width: 250px;\" src=\"http:\/\/xeddixx.cluster029.hosting.ovh.net\/wp-content\/uploads\/2025\/10\/simlink2.png\" alt=\"\" srcset=\"https:\/\/www.francq.info\/wp-content\/uploads\/2025\/10\/simlink2.png 189w, https:\/\/www.francq.info\/wp-content\/uploads\/2025\/10\/simlink2-150x150.png 150w\" sizes=\"auto, (max-width: 250px) 100vw, 250px\" \/><\/p>\n\n\n\n<p class=\"has-text-align-center legende wp-block-paragraph\">Figure 2. Example of neighborhood.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Each neighborhood of a vector&nbsp;\\(\\)&nbsp;can be seen as a vector,&nbsp;\\(\\), where an element \\(\\)&nbsp;represent the weight of a link (represented by the concept&nbsp;\\(c_k\\)) in this neighborhood. This weight depends on the weights of the occurrence of each link in each vector and the number of times it appears. The next two examples on an imaginary graph limited to three nodes (\\(d_1\\),&nbsp;\\(d_2\\) and&nbsp;\\(d_3\\)) illustrates this principle:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Suppose that&nbsp;\\(d_1\\)&nbsp;has only&nbsp;\\(2\\)&nbsp;links to&nbsp;\\(d_2\\)&nbsp;which has itself only&nbsp;\\(3\\)&nbsp;links to&nbsp;\\(d_3\\). The neighborhood of&nbsp;d1contains the following pairs (link, weight):&nbsp;\\(\\{(d_{2},2),d_{3}(6)\\}\\).<\/li>\n\n\n\n<li>Suppose that&nbsp;\\(d_1\\)&nbsp;has only&nbsp;\\(2\\)&nbsp;links to&nbsp;\\(d_2\\)&nbsp;and one link to&nbsp;\\(d_3\\), and&nbsp;\\(d_2\\)&nbsp;has one links to&nbsp;\\(d_3\\). The neighborhood of&nbsp;\\(d_1\\)&nbsp;contains the following pairs (link, weight):&nbsp;\\(\\{(d_{2},2),d_{3}(3)\\}\\).<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Once the \u201cneighbor vectors\u201d computed, the overlap similarity (section&nbsp;<a href=\"#sec3_3\" rel=\"nofollow\">3.3<\/a>) can used two compare the two neighborhoods:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\begin{array}{ccl}sim_{l}(\\vec{o}_{i,z},\\vec{o}_{i,v})&amp;=&amp;sim_{o}(\\vec{o\u00a0\u00bb}_{i,z},\\vec{o\u00a0\u00bb}_{j,v})\\end{array}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To manage the different vectors corresponding to link concepts, again a linear combination is used. Let us suppose that two vectors, \\(\\vec{o}_{i,z}\\)&nbsp;and&nbsp;\\(\\vec{o}_{i,v}\\) have&nbsp;\\(c_{i,z,v}\\)&nbsp;common positive weighted link concepts. We define the link similarity as:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\[sim_{l}(o_{z},o_{v})=\\frac{\\sum_{c_{i}\\in\\mathfrak{L}}c_{i,z,v}\\cdot sim_{l}(\\vec{o}_{i,z},\\vec{o}_{i,v})}{\\sum_{c_{i}\\in\\mathfrak{L}}c_{i,z,v}}\\]\n\n\n\n<p class=\"wp-block-paragraph\">where&nbsp;\\(\\mathfrak{L}\\) represents the set of link concepts.<\/p>\n\n\n\n<h6 class=\"wp-block-heading\" id=\"sec4_5\">4.5. Similarity Aggregation<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h6>\n\n\n\n<p class=\"wp-block-paragraph\">Once the category similarities are computed, they must be aggregated to obtain the global object similarity. The problem of aggregating different similarities can be seen as a multi-criteria decision problem.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In fact, we can formulate the object similarity as follows: let us have a set of object similarities, how to rank these similarities depending on four different criteria (the category similarities)? The product and the sum as aggregating methods suppose the independence between the criteria (category similarities). There exist several multi-criteria decision methods taking the dependency between the criteria partially into account.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The Choquet integral is one if these method which performs better than classical aggregating methods such as the weighted sum. The basic idea is to not only take into account the importance of each criteria (the different similarities in our case), but also the interactions between them. For example, one may suppose that a high semantic similarity alone is not very important (small weight for the semantic criteria), but that it becomes important if the text similarity is high too (important weight for the interaction between the text and semantic criteria).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">These interactions may also be negative (two criteria are important as long as they are not important at the same time). One problem with the Choquet integral is its huge number of parameters (14 for 4 criteria). Therefore, the Choquet integral is often used with respect to a 2-additive capacity&nbsp;[<a href=\"#5\" rel=\"nofollow\">5<\/a>]. This supposes that the combinations of more than two criteria are not taken into account. In this case, the object similarity is then defined by:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\begin{align}sim(o_{z},o_{v})&amp;=&amp;\\sum_{i,j\\in\\{c,m,s,l)|I_{i,j}&gt;0}(sim_{i}(o_{z},o_{v})\\wedge sim_{j}(o_{z},o_{v}))I_{ij}+\\\\&amp;&amp;\\sum_{i,j\\in\\{c,m,s,l)|I_{i,j}&lt;0}(sim_{i}(o_{z},o_{v})\\vee sim_{j}(o_{z},o_{v}))\\left|I_{ij}\\right|+\\\\&amp;&amp;\\sum_{i\\in\\{c,m,s,l)}sim_{i}(o_{z},o_{v})\\left[\\phi(i)-\\frac{1}{2}\\sum_{j\\neq i}\\left|I_{ij}\\right|\\right]\\end{align}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">where&nbsp;\\(\\wedge\\) is the minimum operator,&nbsp;\\(\\vee\\) is the maximum operator,&nbsp;\\(I_{ij}\\)&nbsp;represents the weight of the interaction between two criteria,&nbsp;\\(i\\)&nbsp;and&nbsp;\\(j\\), and&nbsp;\\(\\phi(i)\\) represents the weight of one criteria,&nbsp;\\(i\\).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It is still necessary to specify 10 parameters (the weights for the 4 criteria,&nbsp;\\(\\phi(i)\\), and their 6 interactions,&nbsp;\\(v\\)). To ensure that the Choquet integral remains in&nbsp;\\([0,1]\\)&nbsp;(if all the category similarities are in&nbsp;\\([0,1]\\)), the following constraints must be respected:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\\begin{array}|I_{ij}|\\leq1&amp;&amp;\\forall i,j|i\\neq j\\\\0\\leq\\phi(i)\\leq1&amp;&amp;\\forall i\\\\\\sum_{i}\\phi(i)=1&amp;&amp;\\end{array}<\/p>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"references\">References<a href=\"#contents\" rel=\"nofollow\">\u2191<\/a><\/h5>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"1\">[1]&nbsp;Ricardo Baeza-Yates&nbsp;&amp;&nbsp;Berthier Ribeiro-Neto,&nbsp;<em>Modern Information Retrieval: The Concepts and Technology behind Search<\/em>, Addison-Wesley, 2011.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2\">[2]&nbsp;Dublin Core Metadata Initiative,&nbsp;<em><a href=\"http:\/\/dublincore.org\/documents\/dces\" target=\"_blank\" rel=\"noreferrer noopener\">Dublin Core Metadata Element Set, Version 1.1: Reference Description<\/a><\/em>, 1999.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"3\">[3]&nbsp;Norman Walsh&nbsp;&amp;&nbsp;Leonard Muellner,&nbsp;<em>Docbook 5.0: The Definitive Guide<\/em>, O\u2019Reilly, 2008.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4\">[4]&nbsp;Pascal Francq,&nbsp;<em><a href=\"https:\/\/dipot.ulb.ac.be\/dspace\/bitstream\/2013\/211315\/1\/1d611cd4-a2ac-404e-b055-3e0330e9a263.txt\" target=\"_blank\" rel=\"noreferrer noopener\">Collaborative and Structured Search: An Integrated Approach for Sharing Documents Among Users<\/a><\/em>, PhD Thesis, Universit\u00e9 libre de Bruxelles, 2003.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5\">[5]&nbsp;Michel Grabisch, \u201cL\u2019utilisation de l\u2019int\u00e9grale de Choquet en aide multicrit\u00e8re \u00e0 la d\u00e9cision\u201d,&nbsp;<em>Newsletter of the European Working Group \u201cMulticriteria Aid for Decisions\u00a0\u00bb<\/em>, 3(14), pp. 5\u201410, 2006.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"notes\">Notes<\/h5>\n\n\n<ol class=\"wp-block-footnotes\"><li id=\"8c80cb39-2b5e-41c9-8049-55fb3c608e1d\">If some translation mechanisms exist, they could be use during the indexation step to directly describe documents with terms of one language only. <a href=\"#8c80cb39-2b5e-41c9-8049-55fb3c608e1d-link\" aria-label=\"Aller \u00e0 la note de bas de page 1\">\u21a9\ufe0e<\/a><\/li><li id=\"3737af8b-5b98-4d5a-ac0c-d2e3eee80500\">The concept weighting implies that each vector\u00a0\\(\\vec{o}_{i,z}\\)\u00a0is normalized to build\u00a0\\(\\vec{o\u2019}_{i,z}\\): its elements are divided by the most weighted one. In practice, to compute the cosine similarity, this operation is not necessary since the normalization factors appear in both the numerator and the denominator. <a href=\"#3737af8b-5b98-4d5a-ac0c-d2e3eee80500-link\" aria-label=\"Aller \u00e0 la note de bas de page 2\">\u21a9\ufe0e<\/a><\/li><li id=\"18bff711-4ad2-4700-8aeb-72b7dcae925e\">The concept weighting implies that each vector\u00a0\\(\\vec{o}_{i,z}\\)\u00a0is normalized to build\u00a0\\(\\vec{o\u2019}_{i,z}\\): its elements are divided by the most weighted one. In practice, to compute the metadata similarity, this operation is not necessary since the normalization factors appear in both the numerator and the denominator. <a href=\"#18bff711-4ad2-4700-8aeb-72b7dcae925e-link\" aria-label=\"Aller \u00e0 la note de bas de page 3\">\u21a9\ufe0e<\/a><\/li><\/ol>","protected":false},"excerpt":{"rendered":"<p>Abstract Several algorithms dedicated to information science related problems (for example document clustering) need some existing similarity measures. This article presents such a measure for the\u00a0tensor space model\u00a0which takes the different concept categories into account. Table of Contents 1. Introduction 2. Concept Categories: A Similarity Perspective 2.1. Token Concepts 2.2. Metadata Concepts 2.3. Structure Concepts [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":"[{\"content\":\"If some translation mechanisms exist, they could be use during the indexation step to directly describe documents with terms of one language only.\",\"id\":\"8c80cb39-2b5e-41c9-8049-55fb3c608e1d\"},{\"content\":\"The concept weighting implies that each vector\u00a0\\\\(\\\\vec{o}_{i,z}\\\\)\u00a0is normalized to build\u00a0\\\\(\\\\vec{o\u2019}_{i,z}\\\\): its elements are divided by the most weighted one. In practice, to compute the cosine similarity, this operation is not necessary since the normalization factors appear in both the numerator and the denominator.\",\"id\":\"3737af8b-5b98-4d5a-ac0c-d2e3eee80500\"},{\"content\":\"The concept weighting implies that each vector\u00a0\\\\(\\\\vec{o}_{i,z}\\\\)\u00a0is normalized to build\u00a0\\\\(\\\\vec{o\u2019}_{i,z}\\\\): its elements are divided by the most weighted one. In practice, to compute the metadata similarity, this operation is not necessary since the normalization factors appear in both the numerator and the denominator.\",\"id\":\"18bff711-4ad2-4700-8aeb-72b7dcae925e\"}]"},"class_list":["post-2116","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/www.francq.info\/index.php\/wp-json\/wp\/v2\/pages\/2116","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.francq.info\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.francq.info\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.francq.info\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.francq.info\/index.php\/wp-json\/wp\/v2\/comments?post=2116"}],"version-history":[{"count":61,"href":"https:\/\/www.francq.info\/index.php\/wp-json\/wp\/v2\/pages\/2116\/revisions"}],"predecessor-version":[{"id":2243,"href":"https:\/\/www.francq.info\/index.php\/wp-json\/wp\/v2\/pages\/2116\/revisions\/2243"}],"wp:attachment":[{"href":"https:\/\/www.francq.info\/index.php\/wp-json\/wp\/v2\/media?parent=2116"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}