CN111931041A - Label recommendation method and device, electronic equipment and storage medium - Google Patents

Label recommendation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111931041A
CN111931041A CN202010636335.8A CN202010636335A CN111931041A CN 111931041 A CN111931041 A CN 111931041A CN 202010636335 A CN202010636335 A CN 202010636335A CN 111931041 A CN111931041 A CN 111931041A
Authority
CN
China
Prior art keywords
label
target
target resource
user
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010636335.8A
Other languages
Chinese (zh)
Inventor
陈程
王贺
周鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Zhuoer Digital Media Technology Co ltd
Original Assignee
Wuhan Zhuoer Digital Media Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Zhuoer Digital Media Technology Co ltd filed Critical Wuhan Zhuoer Digital Media Technology Co ltd
Priority to CN202010636335.8A priority Critical patent/CN111931041A/en
Publication of CN111931041A publication Critical patent/CN111931041A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a label recommendation method, a label recommendation device, electronic equipment and a storage medium, wherein the label recommendation method comprises the following steps: determining a first candidate label of a target resource according to labels of similar resources of the target resource; according to the label of the target resource marked by the similar user of the target user, determining a second candidate label of the target resource; determining a third candidate label of the target resource according to the content of the target resource; outputting a recommended label for the target user to label the target resource according to the first candidate label, the second candidate label and the third candidate label; wherein the determination of the similar resources comprises: calculating the similarity between the subject of the target resource and the subjects of the alternative resources, and taking m1 alternative resources with the maximum similarity to the subject of the target resource as the similar resources, wherein m1 is a positive integer; the target resource includes, but is not limited to, an image or a document.

Description

Label recommendation method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of information technologies, and in particular, to a tag recommendation method and apparatus, an electronic device, and a storage medium.
Background
Tags are a major and effective way of organizing network information resources in the second generation internet era, allowing users to annotate various resources in a network with customized keywords, i.e., tags, in order to efficiently organize, retrieve, and utilize these resources. The label is created by the user without any restriction.
In the current label recommendation method, for how to recommend a label for a user when the user labels resource information, a relatively single label recommendation method is usually adopted, and the recommendation result is often not accurate and reasonable enough.
Disclosure of Invention
The embodiment of the invention provides a label recommendation method and device, electronic equipment and a storage medium.
The technical scheme of the embodiment of the invention is realized as follows:
the embodiment of the invention provides a tag recommendation method, which comprises the following steps:
determining a first candidate label of a target resource according to labels of similar resources of the target resource;
according to the label of the target resource marked by the similar user of the target user, determining a second candidate label of the target resource;
determining a third candidate label of the target resource according to the content of the target resource;
outputting a recommended label for the target user to label the target resource according to the first candidate label, the second candidate label and the third candidate label;
wherein the determination of the similar resources comprises: calculating the similarity between the subject of the target resource and the subjects of the alternative resources, and taking m1 alternative resources with the maximum similarity to the subject of the target resource as the similar resources, wherein m1 is a positive integer; the target resource includes, but is not limited to, an image or a document.
In the above scheme, the method includes:
and determining the similar users according to the similarity between the theme of the label of the target user and the theme of the label of the alternative user, wherein the alternative user is one or more users with known labels.
In the foregoing solution, the determining the similar user according to the similarity between the theme of the tag of the target user and the theme of the tag of the candidate user includes:
taking m2 alternative users with the largest similarity to the subject of the target user's tag as the similar users, wherein m2 is a positive integer.
In the foregoing solution, the determining a third candidate tag of a target resource according to content of the target resource includes:
determining a target word capable of reflecting the content characteristics in the content of the target resource;
and taking the target word as a third candidate label of the target resource, wherein the content of the target resource is text information describing the target resource.
An embodiment of the present invention further provides a tag recommendation apparatus, including:
the candidate label determining module is used for determining a first candidate label of the target resource according to the label of the similar resource of the target resource; according to the label of the target resource marked by the similar user of the target user, determining a second candidate label of the target resource; determining a third candidate label of the target resource according to the content of the target resource;
a recommended label determining module, configured to output a recommended label for the target user to label the target resource according to the first candidate label, the second candidate label, and the third candidate label;
a similar resource determining module, configured to calculate similarities between topics of the target resources and topics of candidate resources, and take m1 candidate resources with the largest similarity to the topic of the target resources as the similar resources, where m1 is a positive integer; the target resource includes, but is not limited to, an image or a document.
In the above scheme, the apparatus further comprises:
and the similar user determining module is used for determining the similar users according to the similarity between the theme of the label of the target user and the theme of the label of the alternative user, wherein the alternative user is one or more users with known labels.
In the foregoing solution, the similar user determining module is specifically configured to use m2 alternative users with the greatest similarity to the topic of the tag of the target user as the similar users, where m2 is a positive integer.
In the foregoing solution, the candidate tag determining module is specifically configured to determine a target word that can reflect the content characteristics in the content of the target resource; and taking the target word as a third candidate label of the target resource, wherein the content of the target resource is text information describing the target resource.
An embodiment of the present invention further provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for implementing the tag recommendation method provided by the embodiment of the invention when the executable instructions stored in the memory are executed.
The embodiment of the invention also provides a computer-readable storage medium, which stores executable instructions, and when the executable instructions are executed by a processor, the tag recommendation method provided by the embodiment of the invention is realized.
According to the method and the device for recommending the label, the recommended label is determined by adopting three label recommending methods, on the first hand, the label is recommended from three aspects of the resource, the user and the content of the resource, so that the problem of data sparsity of a single recommending method is solved, the consistency of data is improved, the probability that the recommended label covers the label which the user wants to select is increased, and the recommending accuracy of the recommended label is improved. In the second aspect, the labels from the three label sources are adopted for label recommendation, so that the description of the labels on the resources is more comprehensive and accurate, and the quality of the recommended labels is improved. In the third aspect, the recommended labels are recommended from the perspective of the user and the perspective of the resources, so that the recommended labels are closer to the target resources in content while the recommended labels are more in line with the orientation and the requirements of the user, the accuracy of the recommended labels is improved, and the user experience is improved. In the fourth aspect, by recommending the label when the user marks the resource, the marking burden of the user is reduced, and the enthusiasm of the user for marking the resource is improved.
Drawings
Fig. 1 is a schematic flowchart of a tag recommendation method according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating another tag recommendation method according to an embodiment of the present invention;
fig. 3 is a schematic diagram illustrating a principle of an LDA-based personalized tag recommendation method according to an embodiment of the present invention;
fig. 4 is a flowchart illustrating a personalized label recommendation method based on LDA according to an embodiment of the present invention;
fig. 5 is a schematic diagram illustrating a principle of an LDA-based personalized tag recommendation method according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a tag recommendation device according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
It will be appreciated by those skilled in the art that while the following description refers to numerous technical details of embodiments of the present invention, this is by way of example only, and not by way of limitation, to illustrate the principles of the invention. The present invention can be applied to places other than the technical details exemplified below as long as they do not depart from the principle and spirit of the present invention.
In addition, in order to avoid limiting the description of the present specification to a great extent, in the description of the present specification, it is possible to omit, simplify, and modify some technical details that may be obtained in the prior art, as would be understood by those skilled in the art, and this does not affect the sufficiency of disclosure of the present specification.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
The following describes a tag recommendation method provided by an embodiment of the present invention. Referring to fig. 1, fig. 1 is a schematic flowchart of a tag recommendation method according to an embodiment of the present invention; the tag recommendation method provided by the embodiment of the invention comprises the following steps:
step S101: determining a first candidate label of a target resource according to labels of similar resources of the target resource;
step S102: according to the label of the target resource marked by the similar user of the target user, determining a second candidate label of the target resource;
step S103: determining a third candidate label of the target resource according to the content of the target resource;
step S104: and outputting a recommended label for the target user to label the target resource according to the first candidate label, the second candidate label and the third candidate label.
In the disclosed embodiment of the present invention, the target resources include, but are not limited to: an image or a document. The target user refers to a user who currently marks the target resource.
Specifically, in an embodiment, step S101 includes: and calculating the similarity between the label of the similar resource and the target resource, and taking the n1 labels with the maximum similarity as the first candidate labels.
In another embodiment, step S101 includes: and counting the frequency of occurrence of each label in the label set of the similar resource, and taking the n1 labels with the highest frequency of occurrence in the labels of the similar resource as the first candidate labels. Wherein the labelset of similar resources contains all the labels of similar resources.
Specifically, in an embodiment, step S102 includes: and calculating the similarity between the label of the target resource marked by the similar user and the target resource, and taking the n2 labels with the maximum similarity as second candidate labels.
In another embodiment, step S102 includes: counting the frequency of the occurrence of the labels in the label set of the target resource marked by the similar user, and taking the n2 labels with the highest occurrence frequency in the label set of the target resource marked by the similar user as second candidate labels. In another embodiment, when the similar user is not labeled with the target resource, all the labels of the similar user are ranked according to the similarity between the labels and the target resource, and the second candidate label is determined.
The method for sorting the labels of the similar resources or the labels of the target resources marked on the similar users is not limited to the above sorting methods, and other related algorithms can be adopted to sort the labels and determine the candidate labels according to the sorting result. For example, the similarity is ranked from high to low, and the tag groups corresponding to the top N similarities in the ranking from high to low are determined as the candidate tags. And N is any positive integer.
The content of the target resource is text information describing the target resource, and in an embodiment, step S103 includes: and taking the n3 keywords with the highest occurrence frequency in the content of the target resource as a third candidate label.
In another embodiment, step S103 includes: determining a target word capable of reflecting the content characteristics in the content of the target resource; and taking the target word as a third candidate label of the target resource.
Specifically, a third candidate tag of the target resource may be determined by using TF-IDF (Term Frequency-Inverse file Frequency), and the n3 tags with the largest TF-IDF value in the text content of the target resource are used as the third candidate tag. TF-IDF is a statistical method to assess how important a word is in a document collection or corpus. Are often used as weighting factors for information retrieval, text mining, and user models. The value of TF-IDF increases as the number of occurrences of a word in a document increases and decreases as the number of occurrences of a word in a corpus increases.
Step S104, comprising: after the first candidate tag, the second candidate tag and the third candidate tag are obtained, all the candidate tags can be ranked by adopting a correlation algorithm, and the final recommended tag is determined according to the ranking result. The set of candidate tags includes: a first candidate tag, a second candidate tag, and a third candidate tag.
Specifically, in an embodiment, the reference ratio of the first candidate tag, the second candidate tag and the third candidate tag in the recommended tags may be directly set to 1:1:1, and the recommended tag set may be formed by selecting the same number of tags from the first candidate tag list, the second candidate tag list and the third candidate tag list. In another embodiment, the frequency of occurrence of each candidate label in the candidate label set is calculated, and the n4 labels with the highest frequency of occurrence are selected as the final recommended labels. In another embodiment, KL distances (Kullback-Leibler Divergence) between candidate tags are calculated, the tags are sorted according to the KL distances, and the tag with the closest KL distance is selected as a final recommended tag.
In the first aspect of this embodiment, by combining tag recommendation methods based on three angles of users, resources, and resource contents, the data consistency is improved, and the defect of data sparsity of a single tag recommendation technology is reduced, so that the quality of tags is improved. In the second aspect, by adopting tags from more tag sources, the description of the tags on resources is more comprehensive, and the quality of the recommended tags is improved. In the third aspect, by combining the user-based and resource-based tag recommendation methods, the recommended tags are more in line with the requirements of the users, and the accuracy of the tags is improved. In the fourth aspect, by providing the recommended labels for the users, the labeling burden of the users is reduced, the enthusiasm of the users for labeling is improved, and the labeling efficiency is improved.
In some embodiments, the determination of similar resources may further include: determining the similar resources according to the similarity between the resource information of the target resources and the resource information of the alternative resources, wherein the alternative resources are one or more resources with known labels; the resource information includes at least one of:
a subject;
text content;
author information;
an attributive column;
information of the target reader.
Specifically, in an embodiment, the determining of the similar resource includes: one or more keywords with the highest occurrence frequency in the text content of the resources are extracted as a keyword set of the resources, the similarity between the keyword set of the target resources and the keyword set of the alternative resources is calculated, and m1 alternative resources with the maximum similarity are used as similar resources. m1 is a positive integer.
The similarity calculation of the keyword set of the target resource and the keyword set of the alternative resource comprises the following steps: and regarding the resources as documents, regarding the keywords of the resources as words in the documents, constructing a topic model to obtain topic probability distribution of the keywords of the resources, and calculating the distance between the topic probabilities of the target resources and the keywords of the alternative resources, wherein the smaller the distance is, the greater the similarity between the target resources and the topic probabilities of the keywords of the alternative resources is.
In one embodiment, the author information includes, but is not limited to, the name of the author, the age of the author, and the work information of the author. The determination of similar resources includes: and taking the alternative resource which is the same as the author information of the target resource as the similar resource.
In an embodiment, the determination of similar resources comprises: and taking the alternative resource which is the same as the attributive column of the target resource as the similar resource.
In one embodiment, the information of the target reader includes, but is not limited to, reading preferences of the target reader. The determination of similar resources includes: and taking the alternative resource which is the same as the target reader information of the target resource as the similar resource.
In an embodiment, the determination of similar resources comprises: and calculating the similarity between the subject of the target resource and the subject of the alternative resource, and taking the m1 alternative resources with the maximum similarity as similar resources. m1 is a positive integer.
As shown in fig. 2, another tag recommendation method provided in this embodiment includes:
step S201: determining a first candidate label of a target resource according to labels of similar resources of the target resource; wherein the determination of the similar resources comprises: and calculating the similarity between the subject of the target resource and the subject of the alternative resource, and taking the m1 alternative resources with the maximum similarity as similar resources. m1 is a positive integer.
Step S202: according to the label of the target resource marked by the similar user of the target user, determining a second candidate label of the target resource;
step S203: determining a third candidate label of the target resource according to the content of the target resource;
step S204: and outputting a recommended label for the target user to label the target resource according to the first candidate label, the second candidate label and the third candidate label.
Specifically, the determining of the similar resource includes: a topic probability distribution matrix for each resource in the set of resources is computed. The resource is regarded as a document, the label of the resource is regarded as a word in the document, and on the basis, L DA (Latent Dirichlet Allocation, implicit Dirichlet Allocation) is used for modeling, so that the probability distribution of the resource on the theme can be obtained. I.e. the amount of probability that a certain topic will appear in each resource. And then calculating the similarity of the theme probability distribution of the resources according to the theme probability distribution of the resources. Wherein the resource set at least comprises a target resource and an alternative resource.
In some embodiments, the determination of similar users includes: and determining the similar users according to the similarity between the theme of the label of the target user and the theme of the label of the alternative user, wherein the alternative user is one or more labels of the known labels.
Specifically, the user set at least comprises a target user and alternative users, and a theme probability distribution matrix of each user in the user set is calculated. Consider a user as a document and the tags used by the user as words in the document. And performing LDA modeling by using python to obtain a theme probability distribution matrix of the user. Then according to the topic probability distribution of the users, the similarity of the topic probability distribution of the users is calculated, and m2 alternative users with the maximum similarity value are selected as similar users. Wherein m2 is a positive integer.
LDA is an unsupervised probabilistic topic model that is often used to model large-scale collections of documents. The basic idea is based on the assumption that when a user writes a document, certain topics necessarily exist in the user's mind, and after the topics exist, the user necessarily selects a word from all word pools of a certain topic with a certain probability to explain the topic, so that the whole document is equivalent to the mixture of different topics. LDA is essentially a three-level bayesian probabilistic model with a hierarchical structure of document-topic-vocabulary, as shown in fig. 3. In the model, W represents vocabulary and is a unique observable variable, M represents the whole document set, N represents the total word number contained in each document, K represents the number of topics, and alpha and beta represent the document-topic probability theta and the topic-word probability distribution respectively
Figure BDA0002569594000000091
Is given as a priori distribution over parameters.
The calculation of the similarity of the probability distributions includes: the calculation of the distance between the two probability distributions is performed using the symmetrical formula JS (Jensen-Shannon) divergence of the KL (Kullback-Leibler) divergence. The interval of the JS divergence values is [0, 1], and the closer the JS divergence values tend to be 0, the closer the distance between the two probabilities is, the more similar the two probability distributions are. The more the JS divergence value goes to 1, the more distant the two probability distributions are, the more dissimilar the two probability distributions are.
According to the embodiment of the invention, the topic model is introduced into the similarity calculation process, the traditional calculation mode is replaced by taking the topic probability distribution as the calculation basis, and the deep semantic knowledge is added, so that the problems of semantic ambiguity of the label and the like are relieved, and the recommendation result of the label is greatly improved.
In some embodiments, prior to determining the candidate tag, the method further comprises: a data set is constructed.
Specifically, the User Generated Content (UGC) text Content of the User is automatically crawled. And Processing the data acquired by the crawler, specifically, performing word segmentation Processing on the crawled data by using a Natural Language Processing and Information Retrieval Sharing platform (NLPIR) Chinese word segmentation system of Chinese academy of sciences, and establishing a resource word corpus and a user label corpus, wherein the resource word corpus at least comprises the corresponding relation between the resource and the label of the resource, and the user-label corpus at least comprises the corresponding relation between the user and the label used for labeling the resource.
According to the embodiment, the data are subjected to word segmentation processing by adopting the NLPIR Chinese word segmentation system, so that the word segmentation processing is more consistent with semantics and more accurate, and the accuracy of label recommendation is improved.
With reference to the above embodiments of the present invention, an exemplary application of the embodiments of the present invention in a practical application scenario will be described below.
The present example provides an LDA-based personalized tag recommendation method, where a tag recommendation concept related in this example refers to that when a certain user wants to label a certain resource, a system recommends a series of related tags for the user in combination with information such as a labeling situation of the user, a resource content feature, and an existing tag in the system, and a schematic diagram of the method is shown in fig. 5, and basic steps of the method are shown in fig. 4:
step S401: and (4) collecting and preprocessing data.
Specifically, the user side gives the right of calling the API of the user side to the labeling system, and the system is allowed to automatically crawl UGC text contents of the independent user. The method comprises the following steps of processing crawled data by using an NLPIR Chinese word segmentation system of Chinese academy of sciences: the book introduction and the irregular labels are segmented, and words without practical meaning and some special symbols are filtered out by using the stop word list. Meanwhile, for the label containing English, the label is uniformly converted into a lower case form. In addition, proper nouns are important parts for describing resource features and are also key sources for label recommendation, and words can be added into a user-defined dictionary by using the user-defined dictionary function of the NLPIR and are reserved during word segmentation. And finally, establishing a resource word corpus and a user-label corpus according to the processed data, wherein the resource word corpus at least comprises the corresponding relation between the resource and the label of the resource, and the user-label corpus at least comprises the corresponding relation between the user and the label used for labeling the resource by the user. According to the example, the NLPIR Chinese word segmentation system is adopted to perform word segmentation processing on data, so that the word segmentation processing is more consistent with semantics and more accurate, and the accuracy of label recommendation is improved.
Step S402: and constructing and training a model.
Specifically, m users exist in the user set U, n resources exist in the resource set R, p tags exist in the tag set T, any one resource R is represented by n brief introduction words w, the resource is regarded as a document, the brief introduction words are regarded as words in the document, and on the basis, modeling is performed by using LDA to obtain a resource-topic model, so that resource-topic probability distribution and topic-word probability distribution can be obtained, and in this example, only probability distribution of the resource on the topic is obtained. The theme model is trained by utilizing python and an LDA toolkit thereof, the number k of themes is 15, the parameter value alpha of the LDA model is 50/k (k is the number of themes), and beta is 0.01, and resource-theme probability distribution, namely the probability of the theme appearing in each resource, is obtained after training.
Regarding a user as a document, regarding a label used by the user as a word in the document, performing LDA modeling by using python to obtain a user-topic model, taking the topic number k as 5, taking the parameter value alpha of the LDA model as 50/k (k is the topic number), and beta as 0.01, and obtaining a user-topic probability distribution matrix after training.
Step S403: and calculating the similarity.
KL (Kullback-Leibler) divergence, also known as KL distance, is commonly used to calculate the distance between two probability distributions. The KL divergence is an asymmetric distance, and therefore, for ease of calculation, its symmetry formula JS (Jensen-Shannon) divergence is often used.
In this example, JS (Jensen-Shannon) divergence is utilized to compute the distance of the subject probability distribution. According to the distribution of the resources on the subject probability, the calculation formula (1) of JS is as follows:
p and q are two probability distributions, the interval of the formula is [0, 1], the closer the JS divergence value tends to 0, the closer the distance between the two probabilities is, and the closer the JS divergence value tends to 1, the farther the two probabilities are.
In order to facilitate subsequent calculation, the similarity of the two probability distributions is calculated by using the formula (2), wherein the formula (2) is as follows:
Figure BDA0002569594000000112
where sim (a, b) is the similarity between resources a and b, D (a, b) is the distance of the probability distribution between a and b, and the denominator plus 1 is to prevent the effect of the distance being 0. The larger the sim value, the more similar the resources a and b are.
Step S404: and generating a recommendation label.
Recommendations based on similar resources. The basic idea is to adopt collaborative filtering based on similar resources, sort the resources in a descending order after the similarity of the resources is calculated, take m1 resources most similar to a target resource R ∈ R as a neighbor resource set of the resources, take the topic probability of a label as the weight of the label, perform weighted sorting on the similarity of the labels of the neighbor resources, and finally recommend n1 labels with the largest similarity as a first candidate label.
Recommendation based on similar users: after the similarity between the users is calculated, m2 users which are most similar to the target user U belonging to U and labeled with the target resource are searched to serve as a neighbor user set, then the topic probability of the label is taken as the weight of the label, the similarity of the label of the target resource labeled by the neighbor users is weighted and ranked, and finally n2 labels with the maximum similarity are recommended to serve as second candidate labels.
Content-based recommendation: the recommendation is made using TF-IDF. The TF-IDF values of the words in the text describing the target resource are calculated, and the n3 words with the largest TF-IDF value are recommended as the third candidate tags.
The basic idea of TF-IDF is: if a word occurs very frequently in a document and very rarely in other documents, the word is important to the document and the distinguishing capability is good.
And taking the topic probability of the first candidate label, the topic probability of the second candidate label and the TF-IDF value of the third candidate label as weights, carrying out normalization processing, calculating the similarity of the candidate labels, carrying out weighted sequencing on the similarity of the candidate labels, and selecting n4 labels with the maximum similarity as final recommended labels.
The method combines three methods of label recommendation based on content, collaborative filtering based on users and collaborative filtering based on resources, introduces LDA into the similarity calculation process, replaces the traditional calculation mode by taking the topic probability distribution as the calculation basis, adds deep semantic knowledge, generates the recommendation labels based on similar resources and similar users, extracts the keywords of the resource contents as the recommendation labels based on the contents, and finally fuses the three results to recommend the labels for the users. On one hand, reference and suggestion can be provided for the user through tag recommendation, the burden of the user is reduced, and the enthusiasm of user labeling is improved. In the second aspect, a plurality of methods are fused for label recommendation according to three label sources, so that the technical defect of data sparsity of a single recommendation method is overcome, the data consistency is improved, and the label quality is improved. In the third aspect, LDA is introduced into the similarity calculation process, the traditional calculation mode is replaced by taking the topic probability distribution as the calculation basis, and deep semantic knowledge is added, so that the accuracy of label recommendation is improved. In the fourth aspect, the data are subjected to word segmentation processing by adopting an NLPIR Chinese word segmentation system, so that the quality of a corpus is improved, and the accuracy of label recommendation is improved. The problems of semantic ambiguity of the label and the like are relieved to a certain extent through a mixed label recommendation method based on LDA.
Continuing with the description of a tag recommendation device 60 provided in embodiments of the present invention, in some embodiments, a tag recommendation device may be implemented as a software module. Referring to fig. 6, fig. 6 is a schematic structural diagram of a tag recommendation device 60 according to an embodiment of the present invention, where the tag recommendation device 60 according to the embodiment of the present invention includes:
a candidate tag determining module 610, configured to determine, according to tags of similar resources of a target resource, a first candidate tag of the target resource; according to the label of the target resource marked by the similar user of the target user, determining a second candidate label of the target resource; determining a third candidate label of the target resource according to the content of the target resource;
a recommended label determining module 620, configured to output a recommended label for the target user to label the target resource according to the first candidate label, the second candidate label, and the third candidate label.
In some embodiments, the apparatus 60 further comprises:
a similar resource determining module, configured to determine a similar resource according to similarity between the resource information of the target resource and resource information of an alternative resource, where the alternative resource is one or more resources with known tags; the resource information includes at least one of:
a subject;
text content;
author information;
an attributive column;
information of the target reader.
In some embodiments, the similar resource determining module is specifically configured to calculate similarities between topics of the target resources and topics of candidate resources, and take m1 candidate resources with the greatest similarity to the topic of the target resources as the similar resources, where m1 is a positive integer; wherein the target resource includes, but is not limited to, an image or a document.
In some embodiments, the apparatus 60 further comprises:
and the similar user determining module is used for determining the similar users according to the similarity between the theme of the label of the target user and the theme of the label of the alternative user, wherein the alternative user is one or more users with known labels.
In some embodiments, the similar user determination module is specifically configured to use m2 alternative users with the greatest similarity to the topic of the tag of the target user as the similar users, where m2 is a positive integer.
In some embodiments, the candidate tag determining module 610 is specifically configured to determine a target word that can reflect the content characteristics in the content of the target resource; and taking the target word as a third candidate label of the target resource, wherein the content of the target resource is text information describing the target resource.
According to the tag recommendation device, the recommended tags are determined by using three tag recommendation methods, on the first hand, tag recommendation is performed from three aspects of resources, users and contents of the resources, so that the problem of data sparsity of a single recommendation method is solved, the data consistency is improved, the probability that the recommended tags cover the tags which the users want to select is increased, and the recommendation accuracy of the recommended tags is improved. In the second aspect, the labels from the three label sources are adopted for label recommendation, so that the description of the labels on the resources is more comprehensive and accurate, and the quality of the recommended labels is improved. In the third aspect, the recommended labels are recommended from the perspective of the user and the perspective of the resources, so that the recommended labels are closer to the target resources in content while the recommended labels are more in line with the orientation and the requirements of the user, the accuracy of the recommended labels is improved, and the user experience is improved. In the fourth aspect, by recommending the label when the user marks the resource, the marking burden of the user is reduced, and the enthusiasm of the user for marking the resource is improved.
An embodiment of the present invention further provides an electronic device, where the electronic device includes:
a memory for storing executable instructions;
and the processor is used for implementing the tag recommendation method provided by the embodiment of the invention when the executable instructions stored in the memory are executed.
The following describes a hardware structure of an electronic device of the tag recommendation method provided in the embodiments of the present invention in detail, where the electronic device includes, but is not limited to, a server or a terminal. Referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the tag recommendation device 70 includes: the at least one processor 701, the memory 702, and optionally the target quantity estimation device 70 may further include at least one communication interface 703, and the various components in the target quantity estimation device 70 are coupled together by a bus system 704, it being understood that the bus system 704 is used to implement connection communication between these components. The bus system 704 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled in fig. 7 as the bus system 704.
It will be appreciated that the memory 702 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 702 described in connection with the embodiments of the invention is intended to comprise, without being limited to, these and any other suitable types of memory.
The memory 702 in embodiments of the present invention is used to store various types of data to support the operation of the tag recommendation device 60. Examples of such data include: any computer program for operating on the tag recommendation device 60, such as stored sample data, predictive models, etc., a program implementing a method of an embodiment of the present invention may be contained in the memory 702.
The method disclosed in the above embodiments of the present invention may be applied to the processor 701, or implemented by the processor 701. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed by the embodiment of the invention can be directly implemented by a hardware decoding processor, or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium having a memory and a processor reading the information in the memory and combining the hardware to perform the steps of the method.
In an exemplary embodiment, the tag recommendation Device 70 may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, Micro Controllers (MCUs), microprocessors (microprocessors), or other electronic components for performing the above-described methods.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The embodiment of the invention also provides a computer-readable storage medium, which stores executable instructions, and when the executable instructions are executed by a processor, the tag recommendation method provided by the embodiment of the invention is realized.
In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories. The computer may be a variety of computing devices including intelligent terminals and servers.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (H TML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
The above description is only an example of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims (10)

1. A tag recommendation method, comprising:
determining a first candidate label of a target resource according to labels of similar resources of the target resource;
according to the label of the target resource marked by the similar user of the target user, determining a second candidate label of the target resource;
determining a third candidate label of the target resource according to the content of the target resource;
outputting a recommended label for the target user to label the target resource according to the first candidate label, the second candidate label and the third candidate label;
wherein the determination of the similar resources comprises: calculating the similarity between the subject of the target resource and the subjects of the alternative resources, and taking m1 alternative resources with the maximum similarity to the subject of the target resource as the similar resources, wherein m1 is a positive integer; the target resource includes, but is not limited to, an image or a document.
2. The method of claim 1, comprising:
and determining the similar users according to the similarity between the theme of the label of the target user and the theme of the label of the alternative user, wherein the alternative user is one or more users with known labels.
3. The method of claim 2, wherein the determining the similar user according to the similarity between the subject of the tag of the target user and the subject of the tag of the alternative user comprises:
taking m2 alternative users with the largest similarity to the subject of the target user's tag as the similar users, wherein m2 is a positive integer.
4. The method of claim 1, wherein determining the third candidate tag for the target resource according to the content of the target resource comprises:
determining a target word capable of reflecting the content characteristics in the content of the target resource;
and taking the target word as a third candidate label of the target resource, wherein the content of the target resource is text information describing the target resource.
5. A tag recommendation device, comprising:
the candidate label determining module is used for determining a first candidate label of the target resource according to the label of the similar resource of the target resource; according to the label of the target resource marked by the similar user of the target user, determining a second candidate label of the target resource; determining a third candidate label of the target resource according to the content of the target resource;
a recommended label determining module, configured to output a recommended label for the target user to label the target resource according to the first candidate label, the second candidate label, and the third candidate label;
a similar resource determining module, configured to calculate similarities between topics of the target resources and topics of candidate resources, and take m1 candidate resources with the largest similarity to the topic of the target resources as the similar resources, where m1 is a positive integer; wherein the target resource includes, but is not limited to, an image or a document.
6. The tag recommendation device of claim 5, further comprising:
and the similar user determining module is used for determining the similar users according to the similarity between the theme of the label of the target user and the theme of the label of the alternative user, wherein the alternative user is one or more users with known labels.
7. The tag recommendation device according to claim 6, wherein the similar user determination module is specifically configured to use m2 candidate users with the greatest similarity to the subject of the tag of the target user as the similar users, and the m2 is a positive integer.
8. The tag recommendation device according to claim 5, wherein the candidate tag determination module is specifically configured to determine a target word that can reflect the content characteristics in the content of the target resource; and taking the target word as a third candidate label of the target resource, wherein the content of the target resource is text information describing the target resource.
9. An electronic device, comprising:
a memory for storing executable instructions;
a processor for implementing the method of any one of claims 1-6 when executing executable instructions stored in the memory.
10. A computer readable storage medium storing executable instructions that, when executed by a processor, implement the method of any of claims 1-4.
CN202010636335.8A 2020-07-03 2020-07-03 Label recommendation method and device, electronic equipment and storage medium Pending CN111931041A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010636335.8A CN111931041A (en) 2020-07-03 2020-07-03 Label recommendation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010636335.8A CN111931041A (en) 2020-07-03 2020-07-03 Label recommendation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111931041A true CN111931041A (en) 2020-11-13

Family

ID=73312361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010636335.8A Pending CN111931041A (en) 2020-07-03 2020-07-03 Label recommendation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111931041A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112925902A (en) * 2021-02-22 2021-06-08 新智认知数据服务有限公司 Method and system for intelligently extracting text abstract in case text and electronic equipment
CN113377971A (en) * 2021-05-31 2021-09-10 北京达佳互联信息技术有限公司 Multimedia resource generation method and device, electronic equipment and storage medium
CN114372532A (en) * 2022-01-11 2022-04-19 腾讯科技(深圳)有限公司 Method, device, equipment, medium and product for determining label marking quality

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101578600A (en) * 2006-05-02 2009-11-11 皇家飞利浦电子股份有限公司 System and method for associating a category label of one user with a category label defined by another user
CN105142028A (en) * 2015-07-29 2015-12-09 华中科技大学 Television program content searching and recommending method oriented to integration of three networks
CN106126669A (en) * 2016-06-28 2016-11-16 北京邮电大学 User collaborative based on label filters content recommendation method and device
CN107679242A (en) * 2017-10-30 2018-02-09 河海大学 Merge the label recommendation method that multiple information sources Coupling Tensor is decomposed
CN108205682A (en) * 2016-12-19 2018-06-26 同济大学 It is a kind of for the fusion content of personalized recommendation and the collaborative filtering method of behavior
US20190114937A1 (en) * 2017-10-12 2019-04-18 Pearson Education, Inc. Grouping users by problematic objectives
CN110162711A (en) * 2019-05-28 2019-08-23 湖北大学 A kind of resource intelligent recommended method and system based on internet startup disk method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101578600A (en) * 2006-05-02 2009-11-11 皇家飞利浦电子股份有限公司 System and method for associating a category label of one user with a category label defined by another user
CN105142028A (en) * 2015-07-29 2015-12-09 华中科技大学 Television program content searching and recommending method oriented to integration of three networks
CN106126669A (en) * 2016-06-28 2016-11-16 北京邮电大学 User collaborative based on label filters content recommendation method and device
CN108205682A (en) * 2016-12-19 2018-06-26 同济大学 It is a kind of for the fusion content of personalized recommendation and the collaborative filtering method of behavior
US20190114937A1 (en) * 2017-10-12 2019-04-18 Pearson Education, Inc. Grouping users by problematic objectives
CN107679242A (en) * 2017-10-30 2018-02-09 河海大学 Merge the label recommendation method that multiple information sources Coupling Tensor is decomposed
CN110162711A (en) * 2019-05-28 2019-08-23 湖北大学 A kind of resource intelligent recommended method and system based on internet startup disk method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
侯崇岳: "大数据在高校图书馆文献推荐中的应用", 《宁波教育学院学报》 *
熊回香 等: "基于LDA主题模型的标签混合推荐研究", 《图书情报工作》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112925902A (en) * 2021-02-22 2021-06-08 新智认知数据服务有限公司 Method and system for intelligently extracting text abstract in case text and electronic equipment
CN112925902B (en) * 2021-02-22 2024-01-30 新智认知数据服务有限公司 Method, system and electronic equipment for intelligently extracting text abstract from case text
CN113377971A (en) * 2021-05-31 2021-09-10 北京达佳互联信息技术有限公司 Multimedia resource generation method and device, electronic equipment and storage medium
CN113377971B (en) * 2021-05-31 2024-02-27 北京达佳互联信息技术有限公司 Multimedia resource generation method and device, electronic equipment and storage medium
CN114372532A (en) * 2022-01-11 2022-04-19 腾讯科技(深圳)有限公司 Method, device, equipment, medium and product for determining label marking quality

Similar Documents

Publication Publication Date Title
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
Gambhir et al. Recent automatic text summarization techniques: a survey
US20200320086A1 (en) Method and system for content recommendation
US20220261427A1 (en) Methods and system for semantic search in large databases
Jonnalagadda et al. A new iterative method to reduce workload in systematic review process
KR20180011254A (en) Web page training methods and devices, and search intent identification methods and devices
Saravanan et al. Identification of rhetorical roles for segmentation and summarization of a legal judgment
RU2639655C1 (en) System for creating documents based on text analysis on natural language
Hensinger et al. Modelling and predicting news popularity
CN110059163B (en) Method and device for generating template, electronic equipment and computer readable medium
CN111090731A (en) Electric power public opinion abstract extraction optimization method and system based on topic clustering
CN110399505B (en) Semantic tag generation method and device, and computer storage medium
CN109635157A (en) Model generating method, video searching method, device, terminal and storage medium
CN112270178B (en) Medical literature cluster theme determination method and device, electronic equipment and storage medium
CN114756733A (en) Similar document searching method and device, electronic equipment and storage medium
Tekli An overview of cluster-based image search result organization: background, techniques, and ongoing challenges
CN111931041A (en) Label recommendation method and device, electronic equipment and storage medium
CN110609952A (en) Data acquisition method and system and computer equipment
CN114330335A (en) Keyword extraction method, device, equipment and storage medium
Rehm et al. Developing and orchestrating a portfolio of natural legal language processing and document curation services
Ruas et al. LasigeBioTM at CANTEMIST: Named Entity Recognition and Normalization of Tumour Morphology Entities and Clinical Coding of Spanish Health-related Documents.
Wang et al. Semantic context based refinement for news video annotation
Nasim et al. Evaluation of clustering techniques on Urdu News head-lines: A case of short length text
Wehnert et al. Concept Hierarchy Extraction from Legal Literature.
Jebari et al. Context-aware citation recommendation of scientific papers: comparative study, gaps and trends

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201113

RJ01 Rejection of invention patent application after publication