WO2013029905A1

WO2013029905A1 - A computer implemented method to identify semantic meanings and use contexts of social tags

Info

Publication number: WO2013029905A1
Application number: PCT/EP2012/065019
Authority: WO
Inventors: Iván CANTADOR; David VALLET; Pablo CASTELLS; Paulo Villegas
Original assignee: Telefonica, S.A.
Priority date: 2011-08-26
Filing date: 2012-08-01
Publication date: 2013-03-07

Abstract

A computer implemented method to identify semantic meanings and use contexts of social tags. In the computer implemented method of the invention said social tags are associated to resources by users, and it comprises performing by computing means said identification by applying a clustering strategy based on Lorenz curves and Gini coefficients and employing said identification for social tag semantic disambiguation.

Description

A computer implemented method to identify semantic meanings and use contexts of social tags

Field of the art

The present invention generally relates to a computer implemented method to identify semantic meanings and use contexts of social tags and more particularly to a computer implemented method that comprises performing by computing means said identification by applying a clustering strategy based on Lorenz curves and Gini coefficients and employing said identification for social tag semantic disambiguation.

Prior State of the Art

In the so called Social Web or Web 2.0, systems facilitate the creation of diverse formats of user generated content. People upload and share multimedia objects, post comments and reviews, rate and tag resources, maintain personal bookmarks, communicate online with contacts, and contribute to wiki-style repositories.

Among these formats of user generated content, social tagging has become a popular practice as a lightweight mean to classify and exchange information. Users create or upload content (resources), annotate it with freely chosen words (tags), and share these annotations with others. In this context, the nature of tagged resources is manifold: photos (Flickr), music tracks (Last.fm), video clips (YouTube), movies (MovieLens), Web pages (Delicious), scientific articles (CiteULike), to name a few.

In a social tagging system, the whole set of tags constitutes an unstructured collaborative knowledge classification scheme that is commonly known as folksonomy [21]. This implicit classification serves various purposes, such as for resource organization, promotions, sharing with friends, with the public, etc. Studies have shown, however, that tags are generally chosen by users to reflect their interests. Golder and Huberman [1 1] analyzed tags on Delicious, and found that (1 ) the overwhelming majority of tags identify the topics of the tagged resources, and (2) almost all tags are added for personal use, rather than for the benefit of the community. These findings lend support to the idea of using tags to derive precise user profiles and resource descriptions, and bring with new research opportunities on such functionalities as automatic tag suggestion [9][13], personalized Information Retrieval (IR) [12][18][23][25], and tag-powered Recommender Systems (RS) [6][7][17][19][20][26], among others. Despite the above advantages, tags are free text, and thus suffer from various vocabulary problems [16]. Ambiguity of the tags arises as users apply the same tag in different domains and semantic contexts. At the opposite end, the lack of synonym control can lead to different tags being used for the same concept, precluding collocation. Moreover, multilinguality also obstructs the achievement of a consensus vocabulary, since several tags written in different languages can express the same concept.

To cope with the above problems, folksonomy-based information retrieval and filtering engines have to identify and exploit the underlying semantic meanings of the tags. In fact, works on this direction have recently proliferated in the literature. Information theory and clustering strategies based on tag co-occurrences [3][15][20][24], and linking, transformation and enrichment of folksonomies with structured knowledge bases, such as ontologies, WordNet and Wikipedia [1][7][8][22], represent the main approaches that are being investigated.

Social tagging systems facilitate the users the organization and sharing of content. However, the way users can access the resources is limited to searching and browsing through the collection. User-centered approaches, such as personalized search and recommendation, are not primarily supported by these systems. These functionalities are proven to provide a better user experience, by facilitating access to huge amounts of content, which, in the case of social tagging systems, is created and annotated by the community of users.

Recent works in the research literature have investigated the adaptation of personalization [12][18][25][23] and recommendation techniques [6][17][19][26] to social tagging systems, but they have a main limitation: they do not deal with the ambiguity of tags. For instance, given a tag such as sf, existing systems do not discern between the two main meanings of that tag: San Francisco (the Californian city) and Science Fiction (the literary genre). This phenomenon occurs too frequently to be ignored by a social tagging system. As an example, as for February 201 1 , Wikipedia contains over 189K disambiguation entries.

Tag ambiguity is being investigated in the literature. There are seminal approaches that attempt to identify the real meaning of a tag by linking it with structured knowledge bases [1 ][8][22]. These approaches, however, rely on the availability of external knowledge resources, and so far are preliminary, and have not been applied to personalization and recommendation. Other works are based on the concept of co-occurrence, that is, on extracting the real meaning of a tag by analyzing the occurrence of a tag with others in describing different resources. Typically, these approaches involve the application of clustering techniques over the co-occurrence information gathered from the folksonomy [24]. The application of such tag clustering techniques has been applied in a small number of recent personalization and recommendation approaches [9][20]. The main advantage of these approaches is that an external knowledge source is not required. Nonetheless, they present several problems:

1. Lack of scalability. Current approaches are not incremental: small changes in the folksonomy imply re-computing the clusters within the whole folksonomy. This lack of scalability is undesired for a social tagging system, as its community of users is constantly adding new resources and annotations, resulting in a highly dynamic folksonomy.

2. Lack of generalization capabilities. Many disambiguation techniques implement trainable algorithms, which theoretically can be adapted to the idiosyncrasies of each folksonomy. However, it is often the case that this adaptability comes with a cost in terms of generalization, and systems trained/adapted to one domain do not work well when applied to another domain.

3. Need of a stop criterion. Current approaches need the definition of a stop criterion for the clustering process. For instance, a hierarchical clustering [20] would need to establish the proper level at which clusters are selected, whereas an approach using a partitional clustering technique such as k-means needs to define beforehand how many clusters to build [9]. These values are difficult to define without proper evaluation, and have a definite impact on the outcome of the clustering process, and ultimately, on a semantic disambiguation or contextualization approach. Moreover, such proposals [9][20] define and evaluate the above parameter values over static test collections, and thus may not be easily adjustable over real social tagging systems.

4. Lack of explicit contextualization. Current approaches do not use the cluster based information to build explicitly disambiguated user and resource models. This information is rather incorporated into their retrieval and filtering algorithms, and cannot be exploited by other systems. Thus, these approaches do not offer a real contextualization of the tags, since they do not extract the context in which tags are used. For instance, a desired outcome of a disambiguation approach would be to provide a new contextualized tag description of the user's interests rather than his original raw tag values. Following the above example, the tag sf would be properly contextualized if it is defined within one of its possible meanings, such as 'sf|San_Francisco' and 'sf|Science_Fiction'. Recent approaches have investigated the contextualization of folksonomies [3], but lack a proper user and resource model, and require humans to manually label each context.

Description of the Invention

It is necessary to offer an alternative to the state of the art which covers the gaps found therein, particularly related to the lack of proposals which really allows discerning different semantic meanings of a tag within a particular folksonomy employing efficient disambiguation techniques.

To that end, the present invention provides a computer implemented method to identify semantic meanings and use contexts of social tags, wherein said social tags are associated to resources by users.

On contrary to the known proposals, the computer implemented method of the invention, in a characteristic manner it comprises performing by computing means said identification by applying a clustering strategy based on Lorenz curves and Gini coefficients and employing said identification for social tag semantic disambiguation.

Other embodiments of the method of the first aspect of the invention are described according to appended claims 2 to 30, which include using such computer implemented method within a system for online processing of a folksonomy created by users in a social network facilitated by user devices enabling common annotation of items, and in a subsequent section related to the detailed description of several embodiments. Brief Description of the Drawings

The previous and other advantages and features will be more fully understood from the following detailed description of embodiments, with reference to the attached drawings, which must be considered in an illustrative and non-limiting manner, in which:

Figure 1 shows the architecture of the proposed invention.

Figure 2 shows an example of a Lorenz curve.

Figure 3 shows a Lorenz curve used to calculate the Gini coefficient.

Figure 4 shows a graphic representation of distances between a tag f and a set of similar tags, according to an embodiment of the present invention. Figure 5 shows two examples of a graphic representation of distances between a tag f and a set of similar tags, when f is not ambiguous, according to an embodiment of the present invention.

Figure 6 shows a graphic representation of distances between a tag f and a set of similar tags, when f is ambiguous with two possible semantic contexts Ci and C₂, according to an embodiment of the present invention.

Figure 7 shows an example of a graph associated to tags semantically related to tag sf, computed from a Delicious dataset. The tags are grouped according to the clusters obtained by the proposed semantic contextualization approach. Triangle nodes correspond to centroids of the clusters.

Figure 8 shows the Lorenz curve associated to tag sf, according to an embodiment of the present invention.

Figure 9 shows the Lorenz curve associated to the tag losangeles within the semantic scope of tag sf.

Figure 10 shows the Lorenz curve associated to the tag fantasy within the semantic scope of tag sf.

Figure 1 1 shows an example of a graph associated to tags semantically related to tag web, computed from a Delicious dataset. The tags are grouped according to the clusters obtained by the proposed semantic contextualization approach. Triangle nodes correspond to centroids of the clusters.

Figure 12 shows the Lorenz curve associated to tag web, according to an embodiment of the present invention.

Figure 13 shows the Lorenz curve associated to the tag ajax within the semantic scope of tag web.

Figure 14 shows the Lorenz curve associated to the tag firefox within the semantic scope of tag web.

Detailed Description of Several Embodiments

The proposed invention is capable of measuring relative ambiguities of social tags. It exploits a novel idea based on the Lorenz curve [14] and Gini coefficient [10], which are mathematical artifacts that measure statistical dispersion in a distribution. Such artifacts are adapted to a vector space where semantic similarities and distances between tags can be computed.

This approach is also capable of identifying semantic meanings and use contexts of a social tag within a folksonomy. The meanings and contexts of a particular tag are defined as sets of semantically related tags, and can be represented by means of single "centroid" tags. The presented approach allows measuring a priori probabilities of the different meanings and contexts of a tag, and to establish degrees of belonging between related tags and the meanings and contexts.

As an application of the above or similar strategies, a novel user profiling model based on semantically contextualized social tags is presented. It is proposed to explicitly transform the tags of user and resource annotations within a folksonomy into "pseudo-tags", obtained from the merging of the original tags with (representative tags of) the semantic contexts in which they were used.

The architecture of the proposed invention (called "the system" from now on in this section) was shown in Figure 1 . Its mechanisms can be summarized in the following points:

1. The system retrieves tag annotations of users and resources from a certain folksonomy, typically stemming from a social tagging system, and stores them into a database in the form of triples [user, tag, resource]. This representation allows defining tag-based user and item profiles. The process is numbered as stage 1 in the figure.

2. From the collected annotations (profiles), the system computes semantic distances between pairs of tags, which in general will be based on co-occurrences of tags in user or item profiles. This process is numbered as stage 2 in the figure.

3. According to the computed semantic distances, for each tag f, the system takes the 7 tags most similar to f, i.e. those with which f has the lowest distances.

4. Each tag f and its 7 most similar tags are passed to a module that analyzes and measures their relative ambiguities.

5. Once semantic distances and ambiguities of the tags have been computed, the system applies a novel clustering strategy based on the Lorenz curve and Gini coefficient to automatically obtain the different semantic meanings and use contexts of the tags. The process is numbered as stage 4 in the figure.

6. In each cluster, a "centroid" tag can be selected as representative of the different semantic meanings and contexts existing in the cluster. Popularity values of the clusters, and degrees of confidence of assignment of tags to clusters can also be computed.

7. According to the semantic meanings and contexts, the system transforms each tag of a particular user or item profile in a "pseudo-tag" that consists of the union of the original tag and the centroid of its most appropriate context within the profile annotations. It is numbered as stage 5 in the figure. 8. The generated contextualized tag-based user and resource profiles could be finally exploited by personalized IR and RS approaches. This process is numbered as stage 6 in the figure.

The presented invention is built upon a vector space where similarities between social tags can be computed. Markines et al. [15] present and empirically compare various general folksonomy-based similarity measures derived from established information theoretic, statistical, and practical measures. In the following, some of these similarity metrics are listed as representative examples; in principle any of them may work, though their behaviour needs to be tested with appropriate test sets. For a given folksonomy, consider , t₂ as two social tags, and Rn and R_t2 the sets of resources annotated respectively with and t₂.

- Overlap similarity

- Dice similarity sim(t₁ , t₂)

- Cosine similarity

Once a similarity metric is chosen, it is applied to all combinations of tag pairs in the folksonomy, to create a confusion matrix. Note that from such similarity measures, we can straightforward define tag semantic distances as for example disttf t₂) = 1 - sirnit_t, t₂) e [0.1].

The proposed semantic ambiguity metric for social tags is based on the analysis of an adaptation of the Lorenz curve and Gini coefficient.

The Gini coefficient (G) [10], developed by the Italian statistician, demographer and sociologist Corrado Gini (1884-1965), is a statistical dispersion metric that measures the degree of income inequality in a society. The coefficient ranges from 0 to 1 . A value of 0 expresses total equality, i.e. a society in which every member has exactly the same income. A value of 1 , on the opposite, represents the maximal inequality, i.e. one member receives all the income and the rest of the members do not receive anything.

It is usually defined mathematically based on the Lorenz curve [14], which plots the proportion of the total income of the population (y axis) that is cumulatively earned by the bottom x% of the population, as shown in Figure 2. Each point on the curve is equal to the percentage of income due to a given percentage of the population. The curve starts at the origin (0, 0) and ends at the point (1 , 1 ). If the income were distributed on a perfect equality, the curve would coincide with the line at 45 degrees that passes through the origin (e.g., 40% of the population receives 40% of revenue). If there were a perfect inequality (i.e. a certain person gathers all the income), the curve would coincide with the horizontal axis to the point (1 , 0) where it would jump to the point (1 , 1 ). In general, the curve is in an intermediate position between these two extremes, as shown in Figure 2.

As shown in Figure 3, the Gini coefficient can be defined as the ratio of the area that lies between the line of equality and the Lorenz curve (marked "A" in the diagram) over the total area under the line of equality (marked "A" and "B" in the diagram); i.e. G=A/(A+B).

With the previous formulation, and having a discrete probability function, the

Gini coefficient can be then calculated b using the formula of Brown, defined as:

where G is the Gini coefficient, X represents the cumulative proportion of the independent variable (share of people from lowest to highest incomes), and Y represents the cumulative proportion of the dependent variable (share of income earned).

From this point, the existence of a vector space in which we can represent social tags is assumed. For example, in a N-dimensional space, with N being the number of resources annotated with tags of a certain folksonomy, each tag f can be represented by a vector R_t = (w_{ti 1}, w_iN), where w_in \s a weight associated to tag f and resource r_n, e.g. by the number of users who annotated r_n with f (see [15] for a description of different ways to define the weights w_{t n}, and different tag similarity measures s/m(fv, t₂) e [0.1 ] associated with them). Once a metric is chosen, e.g. from among the ones mentioned above, a procedure to define semantic distances between tags is performed. In Figure 4, a tag t is linked to similar tags ti with undirected edges. The lengths of such edges are proportional to the distances between the corresponding pairs of tags.

The present invention proposes a semantic ambiguity measure of a tag f based on the Gini coefficient of a Lorenz curve in which the independent variable gets as values the set of tags f, most similar to f (ordered by increasing semantic distance from f), and in which the dependent variable corresponds to the cumulative distances given First, it will be explained the case of a given tag f that is not ambiguous (i.e. it has a single semantic context). Computation of the distances towards its closest tags could lead to situations similar to those shown in Figure 5. These distances may be smaller or larger than those between another non-ambiguous tag and its corresponding most similar tags, as the left and right hand sides of the figure show, but in both cases, there should be a relatively smooth increment in distance to any pair of consecutive tags i, and t_i+1 (when ordered by increasing distance). Moreover, the distances from any other tag f, in the set to the rest of the set should retain the same overall shape (that is, no significant discontinuities): since they refer all to a single meaning/context, they are all closely related to each other. In the Lorenz curve, if the tags are ordered by their cumulative distances, it is expected to have no abrupt changes in the accumulated distance for each pair of consecutive tags.

Now, it will be explained the case of a certain tag f that is ambiguous, i.e. it has several meanings or semantic contexts. To simplify the explanation, and without loss of generality, f has two different meanings (semantic contexts) Ci and C₂. In the set of tags {ti} most "similar" to f, some of the tags should be related to the context Ci and others should be related to the context C₂. If so, by representing such set of tags in the corresponding vector space, we should have a situation like the one shown in Figure 6. For the tag f, the shape of its distances to the rest of the tags should be more or less similar to the previous, non-ambiguous case. That is, there should not be a big difference in distance between f and any pair of consecutive tags f, and t_i+1. On the other hand, for a "similar" tag (f or t₂ in Figure 5), the distances to tags within its corresponding semantic context (C or C₂ respectively) should be small, whilst its distances to tags of the other context should be significantly larger. This situation, characteristic of an ambiguous tag, will be exploited by the proposal presented herein. In the Lorenz curve of a tag belonging to (or C₂), it is expected: - small contributions in the cumulative distance from tags of (or C₂) and thus a smooth slope in the curve, and

- larger contributions in the cumulative distance from tags of the other context C₂ (or C-i) and thus a significant change in the curve slope.

Translated into the context of the Lorenz curve defined above, for an unambiguous tag f, the following two facts are expected:

- The Lorenz curve of tag f should tend to the 45 degree line passing through the origin. That is the Gini coefficient value should be close to 0.

- The Gini coefficient value for the remaining tags f, in the constructed set should tend to 0, although in general in a lesser extent than tag t. There should not be abrupt changes in the Lorenz curves' slopes, since there would not exist large differences between the distances of consecutive tag pairs: all tags are alike, so no matter which one we take as the center, there will not be big inequalities.

For an ambiguous tag f, these following two facts are expected:

- As for the unambiguous tag case, the Lorenz curve of tag f should also tend to the 45 degree line passing through the origin. That is, the Gini coefficient value should still tend to 0.

- Differently to the unambiguous tag case, the Lorenz curves of tags f, should show more abrupt slopes, where slope changes may correspond to different contexts of the ambiguous tag. The Gini coefficient values of such tags should be closer to 1.

Given the above observations, the ambiguity of a tag is defined according to the following general expression: f_* ^ , ,ι ΣίιSimilar ft) ^¾)

ami } = fi - G(m - — — where G(x) is the Gini coefficient of tag x, and similar(x) is the set of L tags most similar to x.

Note that the previous expression is a general formulation to model the facts explained above concerning ambiguous and unambiguous tags. Modifications of its terms can be taken into consideration for a better distinction of different semantic ambiguity degrees.

With the proposed adaptation of the Lorenz curve to semantic distances between tags, and the given notion of semantic ambiguity of tags based on the Gini coefficient, an efficient clustering strategy that identifies semantic contexts of a tag within a particular folksonomy is presented. For each tag f <≡7, its Ti most similar tags {f,} are selected and then, for each of these new tags f„ its corresponding 7₂ most similar tags ¾} are selected. With all the obtained tags (at most 1 + 7_/- 7₂), to obtain the different semantic contexts (clusters) of tag i the clustering approach takes into account the ascending semantic distance order in all the Lorenz curves for each item in the tagset, and establishes the number of clusters (i.e. the clustering stop criterion) automatically through the identification of ambiguous tags by the proposed tag ambiguity metric based on the Gini coefficient.

To better understand the present approach to semantically cluster social tags, an example is presented where the ambiguity is measured and the semantic contexts of two tags from a dataset obtained from the Delicious social bookmarking system are identified. These two tags are:

- sf, an acronym that commonly refers to two different meanings: San Francisco, the city in California, USA; and Science Fiction, the literary genre.

- web, a term that refers to the hypertext system operating over the Internet. Although this tag is not semantically ambiguous per se, it is used by Delicious users to annotate bookmarks under different contexts, such as Web design, Web browsers, the Web 2.0, etc., thereby implying different sub-meanings of the term.

Example 1

It corresponds to "sf" tag. In Figure 7 it was shown a graph where the nodes correspond to tags semantically similar to sf, and the edges link pairs of tags with high semantic similarities (i.e. with low semantic distances). The nodes have been grouped according to the semantic contexts identified with the clustering approach, which is described below. It can be seen that some of these contexts (clusters) are related to San Francisco city, and others are related to Science Fiction genre.

The previous clusters are generated from an analysis of the Lorenz curves of sf and its most similar tags. In Figure 8 it was shown the Lorenz curve of sf. In the x axis, where tags are ordered by increasing distance to sf, those tags belonging to the same context have a tendency to be located contiguously. Figures 9 and 10, on the other hand, showed the Lorenz curves for two tags in the set: "fantasy" and "losangeles", which respectively belong to San Francisco and Science Fiction contexts. In Figure 9, for losangeles tag, it could be seen that the tags belonging to San Francisco context (California, sanfrancisco, etc.) are located in the x axis at the left side of sf, while the rest of the tags appear at the right side. In Figure 10, for fantasy tag, it could be observed an analogous behavior; tags that strongly describe the Science Fiction context (sci-fi, sciencefiction, fiction, etc.) are located in the x axis at the left side of sf, while the rest of tags appear at the right side. Moreover, as expected, the slopes of the curves experience significant changes at their values around the ambiguous tag sf.

The following table shows ambiguity values for social tags semantically related to sf. Note that these values are calculated for each tag within their own semantic contexts. Hence, for example food, toread, fiction and fantasy are more ambiguous than sf, since they are likely to be used in Delicious on more diverse semantic contexts. In any case, it is important to note that the differences between ambiguity values between sf and sciencefiction, scifi, sci-fi and sanfrancisco satisfy the claim that sf is the most ambiguous tag.

Next table shows 7 semantic contexts of the tag sf in Delicious, which are identified with the proposed approach. Context 1 and 2 are formed by tags related to the main meanings of sf: San Francisco city and Science Fiction genre. Context 3 is also related to the Californian city, but it is focused on restaurants (dining, eating places) in such city. Contexts 4 and 5 have vocabulary related to (Science Fiction) writing (i.e. authoring) and books (i.e. the written pieces). Context 6 is associated to events and conferences in San Francisco, and finally, Context 7 has tags related to articles about both the city and the literary genre. With this example, it is shown that the clustering technique found more-grained semantic contexts than just considering the two generic clusters of tags related to the two main meanings of sf.

The performed clustering approach works as follows:

To determine the semantic contexts (clusters) of a certain tag f, the Lorenz curves for its most similar tags are analyzed, and the number of times any possible pair of tags appears at the left side of t in the x axis of a curve are counted. Those tags co- occurring frequently in those left sides (above an empirical threshold) are grouped together.

In each cluster, the most ambiguous tag is selected as the cluster centroid. The representation of a full semantic context with this single tag will very useful the our proposed contextualized profiling model, as explained later.

Degrees of belonging of each tag to its cluster can also be computed by means of the semantic similarity or distance between the tag and the cluster centroid. Example 2

As a second example, some results obtained with the present approach for the tag "web" are presented, which, in principle, has a single meaning, and thus is less ambiguous than "sf". Figure 1 1 showed a graph where the nodes correspond to tags semantically similar to web, and the edges link pairs of tags with high semantic similarities. The nodes have been grouped according to the clusters identified with the present approach. It can be seen that these clusters are not associated to different meanings of web, but to different semantic contexts in which that tag is utilized for social bookmarking in Delicious.

Figure 12 showed the Lorenz curve of web. In the x axis, those tags belonging to the same context tend to be located contiguously. Figures 13 and 14, on the other hand, showed the Lorenz curves for tags "ajax" and "firefox", which belong to two different contexts in which "web" is used in Delicious: tagging of Web pages related to AJAX (Asynchronous JavaScript And XML) technology, and tagging of Web pages related to Web browsers, such as Firefox. In Figure 13, for ajax tag, it could be seen that the tags belonging to AJAX context (javascript, js, xml, etc.) are located in the x axis at the left side of web, while the rest of the tags appear at the right side. In Figure 14, for firefox tag, it could be observed an analogous behavior; tags that strongly describe the Web browser context (browsers, ie) are located in the x axis at the left side of web, while the rest of tags appear at the right side.

The following table shows ambiguity values for social tags semantically related to web. Note that these values are calculated for each tag within its own semantic contexts. Hence, for example webdesign, tools and development are more ambiguous than web, since they are likely to be used in Delicious on more diverse semantic contexts.

Finally, next table shows 9 semantic contexts of the tag web in Delicious, which are identified with the proposed approach. Context 1 is formed by tags related to Web design topics. Context 2 is related to the Social Web. Context 3 has vocabulary related to software tools and utilities available in the Web. Context 4 describes graphic and design topics of Web sites. Context 5 is associated to AJAX technology, which allows the creation of dynamic Web pages. Context 6 covers the generic context of Web sites, and Contexts 7, 8 and 9 describe more specific contexts, such as the development of Web applications, Web browsers, and tutorials, documentation and references in the Web.

A folksonomy F can be defined as a tuple F = {T, U, R, A}, where T = {t₁ t_L} is the set of tags that comprise the vocabulary expressed by the folksonomy, U = {u u_M} and R = {r r_N} are respectively the sets of users and resources that annotate and are annotated with tags of T, and A = {(u_m, t_t, /·>,)}<≡ L/x TxR is the set of tag assignments (TA) of each tag i, to resources r_n by users um.

Based on this notation, the simplest way to represent the profile of user u_m is through a set of TA {(u_m, t, r)sA\ ts T, rsR}. Similarly, the profile of resource r_n can be represented as a set of TA {(u, t, r_n)^A\u^ U, f<≡7}. This representation can be adapted to a vector model as follows. The profile of user u_m is defined by a vector u_m={u_m, Um,L), where u_m = \{{u_m, f,, r)^A\r^R}\ is the number of times the user has annotated resources with tag f,. Similarly, the profile of resource r_n is defined by a vector r_n=(r_n:1, ■ r_n,L), where r_n = \{(u, t_t,

is the number of times the resource has been annotated with tag In this vector profile representation, alternative approaches to defining the weights u_m and r_n have been presented in the literature [6][18][23][25], considering for example TF-IDF or BM25 term weighting schemas [4], well known in the IR research field.

Differently to previous approaches, in this document new tag weighting schemas are not proposed, but an alternative concept of "tag" in the profile models. Instead of considering raw tags in folksonomy-based user and resource profiles, the objective is to explicitly contextualize each tag with the semantic context(s) in which that tag was assigned to a resource by a particular user or by the folksonomy community.

A number of generic models to contextualize the tags of a folksonomy-based user or resource profile is presented. To better explain the proposed models it will be used the example of profile shown in the next table, which contains tags extracted from Delicious system.

From this point on, a single folksonomy will be considered (e.g. one extracted from Delicious system), and it will be assumed that from that folksonomy semantic distances between pairs of tags have been already computed, and semantic contexts for each tag have been obtained.

The first semantic contextualization model consists in selecting for each tag its closest context by comparing (e.g. by using the cosine-based similarity function [4]) the set of tags in the profile with the sets of tags belonging to the different semantic contexts of the tag.

The resultant single-context model profile would be formed by a set of "pseudo- tags" consisting on the union of the raw tag values and the selected semantic contexts.

In the given example, the tag "java" (within the profile [j2ee, web, tag]) may be compared with contexts like:

- programming: programming, dev, development, code, coding, j2ee, etc.

- library: library, libraries, libs, integration, frameworks, etc.

- opensource: opensource, open_source, open-source, etc.

- gui: gui, ui, interface, usability, etc.

From these semantic contexts, the closest one, called "programming", would be selected. Then, with that context, the tag "java" would be transformed into the contextualized tag "java|programming", as shown in the next table. The weight of this pseudo-tag could be the original one or other re-computed weight based on the similarity between the tag and the selected context. Tag Weight Contextualized tag Weight java 0.29 java|programming 0.29

j2ee 0.17 j2ee|java 0.17

web 0.42 web|web2.0 0.42

tag 0.12 tag|web2.0 0.12

Another variant, called extended single-context model, follows a slightly different approach. Instead of creating contextualized tags, it keeps the original tags of the initial non-contextualised profile, and extends them with the centroid tags of the selected semantic contexts. Thus, for example, the tag "java" would lead to including "programming" as a new tag of the profile, as shown in the next table. With this extension, the semantic context java-programming would be strengthened in the profile. The weights of new tags would be set taken into account the weights of the original tags and their similarities with the selected contexts.

The first instance of a multiple-context model follows the same principle of the single-context model presented in the previous subsection. However, instead of creating pseudo-tags joining each original tag in the profile with the centroid tag of its closest context, this model creates multiple pseudo-tags for each original tag by considering the C closest semantic contexts. Thus, for example, as shown in the next table, the tag "java" could be transformed into the contextualized tags "java|programming" and "java|tools". Tag Weight Tag Weight java 0.29 java|programming 0.20

j2ee 0.17 java|tools 0.14

web 0.42 j2ee|java 0.13

tag 0.12 web|web2.0 0.23

web|webdev 0.21

tag|web2.0 0.08 Analogously to the single-context model, it is possible an extended multiple- context model in which, instead of being converted into artificial pseudo-tags, the centroids of the semantic contexts closest to the original tags are directly incorporated into the profile. Following the previous example, the tags "programming" and "tools", which are the centroids of the contexts closest to java (within the profile [j2ee, web, tag]), would appear in the extended profile, as shown in the following table.

Advantages of the invention

There are seminal studies on the ambiguity of social tags that show the existence of, and attempt to identify, different semantic meanings of a tag within a particular folksonomy [2][5]. Recent works go a step beyond and present ambiguity measures of individual tags [9] and sets of tags [24]. These measures are based on information theoretic metrics, such as entropy and Kullback-Leibler divergence, which are computed on the whole tag set of the folksonomy. As shown in this document, the ambiguity metric is more efficient since it is computed from a reduced set of semantically related tags. In the literature, there are techniques that cluster the tag space grouping those tags that co-occur more often in resource annotations of a folksonomy [3][20]. The present approach is also build upon a clustering basis, but it is more efficient than previous approaches because (1 ) it does not need to cluster the whole tag space to obtain the semantic contexts of a tag, and (2) it does not need to define a stop criterion to determining the optimum number of contexts for the given tag.

There exist folksonomy-based information retrieval and filtering (recommendation) approaches that attempt to semantically contextualize the use of tags [3][12][17][19][20][25][26], but they are based on the (implicit) consideration and incorporation of contextualization information into the retrieval and filtering algorithms, and do not explicitly modify the user and resource profiles with such contextual information.

The described tag disambiguation approach has the following improvements against existing tag disambiguation techniques:

- More flexible than approaches which make use of external knowledge bases, such as ontologies, WordNet and Wikipedia. The present approach is capable of measuring tag semantic ambiguities, and identifying different semantic meanings and user contexts of tags by only exploiting the TA of an input folksonomy.

- Adaptable to different semantic distances between tags. The present approach is built upon a vector space where similarities between tags can be measured, and it is independent of the utilized tag similarity and distance metrics.

- More scalable to folksonomy dynamics. The present approach measures the ambiguity of a tag within a reduced set of semantically similar tags, instead of depending on the whole folksonomy. Thus, it can be adapted to changes of the folksonomy (e.g. by means of new tag annotations) in a straightforward way, without strong computational requirements.

When compared to other contextualized folksonomy-based information retrieval and recommendation systems, the present approach for semantic tag contextualization and user profiling includes the following benefits:

- Configurable to obtain different sizes (number of tags) of the clusters. The present approach accepts an input parameter that allows obtaining various degrees of specificity on the semantics underlying the tag meanings and contexts.

- Provides a clustering stop criterion independent of any parameter setting. Differently to standard hierarchical and partitional (e.g., k-means) clustering techniques, the present approach automatically determines the optimum number of clusters.

- Greater portability of contextualization information. Since the tags are explicitly contextualized in the user and resource profiles, the resultant profiles can be exploited by other systems. On the contrary, existing approaches implicit or explicitly incorporate the contextual information within the retrieval and filtering algorithms.

A person skilled in the art could introduce changes and modifications in the embodiments described without departing from the scope of the invention as it is defined in the attached claims.

ACRONYMS

AJAX Asynchronous JavaScript And XML IR Information Retrieval

RS Recommender Systems

TA Tag Assignment

REFERENCES

[1] Angeletou, S., Sabou, M., and Motta, E. 2009. Improving Folksonomies Using Formal Knowledge: A Case Study on Search. In Proceedings of the 4th Asian Semantic Web Conference (ASWC 2009), 276-290.

[2] Au Yeung, C. M., Gibbins, N., and Shadbolt, N. 2007. Understanding the Semantics of Ambiguous Tags in Folksonomies. In Proceedings of the International Workshop on Emergent Semantics and Ontology Evolution (ESOE2007), 108-121 .

[3] Au Yeung, C. M., Gibbins, N., and Shadbolt, N. 2009. Contextualising Tags in Collaborative Tagging Systems. In Proceedings of the 20th ACM Conference on Hypertext and Hypermedia (Hypertext 2009), 251 -260. [4] Baeza-Yates, R. A. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc.

[5] Begelman, G., Keller, P., and Smadja, F. 2006. Automated Tag Clustering: Improving Search and Exploration in the Tag Space. In Proceedings of the WWWO6 Collaborative Web Tagging Workshop.

[6] Cantador, I., Bellogin, A., and Vallet, D. 2010. Content-based Recommendation in Social Tagging Systems. In Proceedings of the 4th ACM Conference on Recommender Systems (RecSys 2010), 237-240.

[7] Cantador, I., Szomszor, M., Alani, H., Fernandez, M., Castells, P. 2008. Enriching Ontological User Profiles with Tagging History for Multi-Domain Recommendations. In Proceedings of the 1 st International Workshop on Collective Semantics: Collective Intelligence and the Semantic Web (CISWeb 2008), 5-19.

[8] Garcia-Silva, A., Szomszor, M., Alani, H., and Corcho, O. 2009. Preliminary Results in Tag Disambiguation using DBpedia. In Proceedings of the 1 st International Workshop on Collective Knowledge Capturing and Representation (CKCaR 2009). [9] Gemmell, J., Ramezani, M., Schimoler, T., Christiansen, L., and Mobasher, B. 2009. The Impact of Ambiguity and Redundancy on Tag Recommendation in Folksonomies. In Proceedings of the 3rd ACM Conference on Recommender Systems (RecSys 2009), 45-52.

[10] Gini, C. 1912. Variability e mutabilita. Memorie di Metodologica Statistica, Pizetti, E., Salvemini, T. (Eds.).

[1 1] Golder, S. A., and Huberman, B. A. 2006. Usage Patterns of Collaborative Tagging Systems. Journal of Information Science 32(2), 198-208.

[12] Hotho, A., Jaschke, R., Schmitz, C, and Stumme, G. 2006. Information Retrieval in Folksonomies: Search and Ranking. In Proceedings of the 5th International Semantic Web Conference (ISWC 2006), 41 1-426.

[13] Jaschke, R., Marinho, L, Hotho, A., Schmidt-Thieme, L, and Stumme, G. 2008. Tag Recommendations in Social Bookmarking Systems. Al Communications 21 (4), 231 -247. [14] Lorenz, M. O. 1905. Methods of Measuring the Concentration of Wealth. Publications of the American Statistical Association 9(70), 209-219.

[15] Markines, B., Cattuto, C, Menczer, F., Benz, D., Hotho, A., and Stumme, G. 2009. Evaluating Similarity Measures for Emergent Semantics of Social Tagging. In Proceedings of the 18th International Conference on World Wide Web (WWW 2009), 641 -650.

[16] Mathes, A. 2004. Folksonomies - Cooperative Classification and Communication through Shared Metadata. Computer Mediated Communication - LIS590CMC (Doctoral Seminar), Graduate School of Library and Information Science, University of Illinois Urbana-Champaign, IL, USA, December 2004.

[17] Niwa, S., Doi, T., and Honiden, S. 2006. Web Page Recommender System based on Folksonomy Mining for ITNGO6 Submissions. In Proceedings of the 3rd International Conference on Information Technology: New Generations (ITNG 2006), 388-393.

[18] Noll, M. G., and Meinel, C. 2007. Web Search Personalization via Social Bookmarking and Tagging. In Proceedings of the 6th International Semantic Web Conference (ISWC 2007), 367-380.

[19] Sen, S., Vig, J., and Riedl, J. 2009. Tagommenders: Connecting Users to Items through Tags. In Proceedings of the 18th international Conference on World Wide Web (WWW 2009), 671-680.

[20] Shepitsen, A., Gemmell, J., Mobasher, B., and Burke, R. 2008. Personalized Recommendation in Social Tagging Systems using Hierarchical Clustering. In Proceedings of the 2nd ACM Conference on Recommender Systems (RecSys 2008), 259-266.

[21] Smith, G. 2004. Folksonomy: Social Classification.

http://atomiq.org/archives/2004/08/folksonomy_social_classification.html [22] Specia, L, and Motta, E. 2007. Integrating Folksonomies with the Semantic Web. In Proceedings of the 4th European Conference on the Semantic Web (ESWC 2007), 624-639.

[23] Vallet, D., Cantador, I., and Jose, J. M. 2010. Personalizing Web Search with Folksonomy-Based User and Item Profiles. In Proceedings of the 32nd European Conference on Information Retrieval (ECIR 2010), 420-431.

[24] Weinberger, K. Q., Slaney, M., and Van Zwol, R. 2008. Resolving Tag Ambiguity. In Proceedings of the 16th ACM International Conference on Multimedia (MM 2008), 1 1 1-120.

[25] Xu, S., Bao, S., Fei, B., Su, Z., and Yu, Y. 2008. Exploring Folksonomy for Personalized Search. In Proceedings of the 31 st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), 155- 162. [26] Zanardi, V., and Capra, L. 2008. Social Ranking: Uncovering Relevant Content using Tag-based Recommender Systems. In Proceedings of the 2nd ACM Conference on Recommender Systems (RecSys 2008), 51-58.

Claims

Claims \ - A computer implemented method to identify semantic meanings and use contexts of social tags, wherein said social tags are associated to resources by users, characterised in that it comprises performing by computing means said identification by applying a clustering strategy based on Lorenz curves and Gini coefficients and employing said identification for social tag semantic disambiguation.

2. - A computer implemented method as per claim 1 , comprising retrieving said social tags from a folksonomy and storing said social tags in a database, further including in said database an identification of the resource that has been tagged and of the user that has tagged said resource, for each of said social tags.

3. - A computer implemented method as per claim 2, comprising computing semantic distances between pairs of said social tags stored in said database.

4. - A computer implemented method as per claim 3, comprising employing at least one of the following similarity techniques to compute said semantic distances: overlap similarity, Jaccard similarity, Dice similarity and Cosine similarity.

5. - A computer implemented method as per claim 4, comprising calculating said semantic distance between said pair of said social tags according to the following expression:

dist^.y = 1 - sim^y

where

dist is the distance between two social tags;

and t₂ are said two social tags; and

sim is the result of employing one of said similarity techniques.

6.- A computer implemented method as per claim 3, 4 or 5, wherein said social tag semantic disambiguation comprises measuring relative semantic ambiguities between a certain social tag and at least part of said stored social tags.

7. - A computer implemented method as per claim 6, comprising determining said at least part of said stored social tags according to said computation of semantic distances.

8. - A computer implemented method as per claim 7, wherein said at least part of said stored social tags are the ones that have the lowest distances with said certain social tag.

9. - A computer implemented method as per any of previous claims 2 to 8, comprising representing said social tags in a N-dimensional vector space, being N the number of resources annotated with said social tags of said folksonomy.

10. - A computer implemented method as per claim 9, comprising representing each of said social tags by means of a N-dimensional vector according to the following expression:

where

R_t is said N-dimensional vector for a social tag t; and

w_{t n} is a weight associated to said social tag t and a resource r_n.

1 1. - A computer implemented method as per claim 10, wherein said weight is the number of users who annotated said resource r_n with said social tag t.

12. - A computer implemented method as per any of previous claims 6 to 1 1 , comprising measuring said relative semantic ambiguities between said certain social tag and said at least part of said stored social tags based on the Gini coefficient of a Lorenz curve in which the independent variable corresponds to a set of values determined by said at least part of said stored social tags and in which the dependent variable gets as values the cumulative distances given by said semantic distances.

13. - A computer implemented method as per claim 12, wherein said set of values are ordered by increasing semantic distance from said certain social tag.

14. - A computer implemented method as per claims 12 or 13, comprising calculating an ambiguity parameter of said certain social tag according to the following expression: ambit} =

where

amb(t) is the value of said ambiguity parameter of said certain social tag;

t is said certain social tag;

G(t) is said Gini coefficient of said certain social tag t;

t, are said at least part of said stored social tags; and

|similar(t)| is the number of said at least part of said stored social tags.

15. - A computer implemented method as per any of previous claims 6 to 14, wherein said clustering strategy comprises:

- selecting the most similar social tags for a given social tag constituting a first tagset; - selecting, for each of the items in said first tagset, the most similar social tags constituting a second tagset;

- obtaining said use contexts, or clusters, of said given social tag based on the ascending semantic distance order in all the Lorenz curves for each item in said second tagset; and

- establishing the number of said clusters through the identification of ambiguous social tags according to said relative semantic ambiguities.

16. - A computer implemented method as per claim 15, comprising determining said clusters of said given social tag by counting the number of times that a pair of social tags belonging to said first or second tagsets appear at the left side of said given social tag in the horizontal axis of the Lorenz curves of said social tags belonging to said first or second tagsets; if said number of times is superior to an empirical threshold said pair of social tags are grouped in a same use context, or cluster.

17. - A computer implemented method as per claims 15 or 16, comprising selecting the most ambiguous tag of each cluster as a centroid of said cluster.

18. - A computer implemented method as per claim 17, comprising computing degrees of belonging of each social tag to its cluster by means of said semantic distance between each social tag and said centroid of its cluster.

19. - A computer implemented method as per any of previous claims 15 to 18, comprising contextualizing a social tag with said use context.

20. - A computer implemented method as per any of previous claims, comprising extending a tag-based profile, said tag-based profile comprising a list of social tags assigned to a resource by a user with a corresponding weight for each of said social tags, by adding for each of said social tags at least its use context.

21.- A computer implemented method as per claim 20, comprising maintaining said corresponding weight for each social tag in said extended tag-based profile into a database of tag profiles stored at a server.

22. - A computer implemented method as per any of previous claims, wherein said method is implemented in a server which on the reception of an annotation as free text inserted by a user at a client device and sent to said server, following the procedures of the method, disambiguates said annotation and stores the disambiguated tag in the server database as the true annotation intended by said user.

23. - A computer implemented method as per any of claims 1 to 22, wherein said method is implemented in a server which on the reception of an annotation as free text inserted by a user at a client device and sent to said server, following the procedures of the method, disambiguates said annotation, said disambiguation generating candidates for disambiguation, the method further comprising sending back, by means of said server, said candidates for disambiguation to said user client device, where they are shown to the user so that the user can confirm which of the interpreted terms is the intended meaning, and this confirmation is sent back to the server, the method comprising said server storing in the server database the confirmed candidate for disambiguation.

24. - A computer implemented method as per any of claims 22 or 23, comprising using the server database with stored disambiguated annotations to create a user profile, said user profile comprising, for each user, the set of tags annotated by the user together with their disambiguated meaning, and the time and date when the annotations were made.

25. - A computer implemented method as per claim 24, comprising employing said created user profile to provide an annotation recommendation service where, when a user starts entering at an user device an annotation, each partial input is sent to the server, thereby compared with the annotations in the user profile, and partial matches to either the free-text tags or the disambiguated meanings are sent back to the user device as suggestions for finalizing the annotation.

26. - A computer implemented method, as per the previous claim, comprising using the date in which annotations were made, as stored in the server database, and/or the number of occurrences of each tag in the user profile stored in the database, and/or the number of occurrences of each disambiguated concept in the user profile stored in the database, and/or the weight assigned to each tag, to implement a weighting function that ranks tag suggestions before sending them to the user, so that said tag suggestions are presented ordered by rank at the user device.

27. - A computer implemented method, as per the previous claim, comprising using, to implement said weighting function, in addition to the information of the user profile stored in the server database, also an aggregation of all user profiles stored in the server database, said aggregation comprising all the user annotations, their disambiguated meaning and their time of creation.

28. - A computer implemented method, as per claim 26, comprising using, to implement said weighting function, in addition to the information of the user profile stored in the server database, also an aggregation of user profiles of a subset of all user profiles stored in the server database, said aggregation comprising the annotations of said subset of users, their disambiguated meaning and their time of creation, wherein said subset of users is selected according to a determined criterion that creates groups of users, said subset consisting on the group or groups of users to which the user being provided the service belongs, according to the said determined criterion.

29.- A computer implemented method as per any of previous claims, comprising classifying said resources in a database according to said use contexts of said social tags.

30.- A computer implemented method as per any of previous claims, comprising assigning at least part of said resources to a plurality of user devices, when users of said plurality of user devices constitute a closed group, according to use context of said social tags, said use context being chosen by said users.