CN112417131A - Information recommendation method and device - Google Patents

Information recommendation method and device Download PDF

Info

Publication number
CN112417131A
CN112417131A CN202011336768.8A CN202011336768A CN112417131A CN 112417131 A CN112417131 A CN 112417131A CN 202011336768 A CN202011336768 A CN 202011336768A CN 112417131 A CN112417131 A CN 112417131A
Authority
CN
China
Prior art keywords
tag
label
user
data set
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011336768.8A
Other languages
Chinese (zh)
Inventor
秦泓杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Chuangmi Technology Co ltd
Original Assignee
Shanghai Chuangmi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Chuangmi Technology Co ltd filed Critical Shanghai Chuangmi Technology Co ltd
Priority to CN202011336768.8A priority Critical patent/CN112417131A/en
Publication of CN112417131A publication Critical patent/CN112417131A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses an information recommendation method and device. The method comprises the following steps: obtaining raw data related to content of interest to a user; preprocessing the original data to generate an original tag text data set, wherein the original tag text data set comprises a plurality of target tag texts; generating a label vector of each target label text through a label vector model based on the original label text data set; clustering label vectors of all target label texts; determining the similarity of each label vector in each cluster and other label vectors; and pushing information to the user based on the similarity, wherein the information recommendation method further comprises the following steps: and dynamically updating the label vector model based on the newly added label text data set. The method and the device simplify the calculation process of the recommendation algorithm, reduce the consumption of software and hardware, and realize the dynamic and rapid update of the tag vector model.

Description

Information recommendation method and device
Technical Field
The disclosure relates to the technical field of big data information processing, in particular to an information recommendation method and device.
Background
With the rapid development of scientific technologies, particularly cloud computing and big data technologies, the amount of information has been explosively increased. The user finds information in which the user is interested from massive information more and more difficult, and the problem that the technical staff in the field needs to solve urgently is that the user accurately recommends the required information.
Therefore, an information recommendation method and apparatus are needed.
Disclosure of Invention
The invention aims to provide an information recommendation method and an information recommendation device, which can reduce the system load, reduce the model updating time and improve the real-time performance of a recommendation system.
To achieve the above object, one aspect of the present disclosure provides an information recommendation method, including: obtaining raw data related to content of interest to a user; preprocessing the original data to generate an original tag text data set, wherein the original tag text data set comprises a plurality of target tag texts; generating a label vector of each target label text through a label vector model based on the original label text data set; clustering label vectors of all target label texts; determining the similarity of each label vector in each cluster and other label vectors; and pushing information to the user based on the similarity, wherein the information recommendation method further comprises the following steps: and dynamically updating the label vector model based on the newly added label text data set.
Optionally, the raw data comprises at least one of: the retrieval history data of the user, the browsing history data of the user, the downloading history data of the user, the collection history data of the user and the chatting history data of the user.
Optionally, the plurality of target tag texts are separated by spaces or special characters, and the number of characters of each of the plurality of target tag texts does not exceed a predetermined threshold.
Optionally, the predetermined threshold is in the range of 20 to 50.
Optionally, dynamically updating the tag vector model based on the newly added tag text data set includes: acquiring new data related to the content in which the user is interested; preprocessing the newly added data to generate the newly added tag text data set; and iteratively updating the tag vector model based on the newly added tag text data set.
Another aspect of the present disclosure provides an information recommendation apparatus, including: a raw data acquisition unit configured to acquire raw data related to a content of interest to a user; a preprocessing unit configured to preprocess the raw data to generate a raw tag text data set, the raw tag text data set including a plurality of target tag texts; a tag vector generation unit configured to generate a tag vector of each target tag text through a tag vector model based on the original tag text data set; a clustering unit configured to cluster the tag vectors of all target tag texts; a similarity determination unit configured to determine a similarity of each label vector in each cluster to other label vectors; and an information pushing unit configured to push information to the user based on the similarity, wherein the information recommendation apparatus further includes an updating unit configured to: and dynamically updating the label vector model based on the newly added label text data set.
Optionally, the raw data comprises at least one of: the retrieval history data of the user, the browsing history data of the user, the downloading history data of the user, the collection history data of the user and the chatting history data of the user.
Optionally, the plurality of target tag texts are separated by spaces or special characters, and the number of characters of each of the plurality of target tag texts does not exceed a predetermined threshold.
Optionally, the predetermined threshold is in the range of 20 to 50.
Optionally, dynamically updating the tag vector model based on the newly added tag text data set includes: acquiring new data related to the content in which the user is interested; preprocessing the newly added data to generate the newly added tag text data set; and iteratively updating the tag vector model based on the newly added tag text data set.
Yet another aspect of the present disclosure provides a computing device, comprising: at least one storage medium storing at least one set of instructions; and at least one processor communicatively coupled to the at least one storage medium, wherein the at least one processor executes the at least one set of instructions to perform the method.
The information recommendation method and device provided by one or more embodiments of the disclosure have one or more of the following advantages.
Due to the adoption of the technical scheme, the method has the following advantages:
(1) the application scene is wide. The method can recommend various information resources for the user according to the user log by utilizing the label vector model, such as: related search terms, related authors, related organizations, related products, and the like. The method and the system can be widely applied to recommendation systems of electronic commerce websites.
(2) The data source is rich. The present disclosure uses a tag vector model to implement a recommendation system whose tag data sources are rich.
(3) The pretreatment is simple. According to the method, a utility matrix does not need to be constructed like a traditional algorithm, only label data need to be cleaned, corresponding label text data are extracted and organized according to requirements, and stop words are removed when necessary.
(4) The model update time is short. The method aims at the label vector model, provides a simple and effective dynamic increment updating method, and reduces the updating time of the model while ensuring the quality of the model.
(5) The real-time performance is high. According to the method and the system, the mode of real-time calculation of the original recommendation system is avoided through the modes of off-line calculation/updating and on-line retrieval, and the on-line response time of the recommendation system is greatly reduced.
Drawings
The following drawings describe in detail exemplary embodiments disclosed in the present disclosure. Wherein like reference numerals represent similar structures throughout the several views of the drawings. Those of ordinary skill in the art will understand that the present embodiments are non-limiting, exemplary embodiments, and that the accompanying drawings are for illustrative and descriptive purposes only and are not intended to limit the scope of the present disclosure, as other embodiments may equally fulfill the conceptual intent of the present disclosure. It should be understood that the drawings are not to scale. Wherein:
fig. 1 is a flow diagram of an information recommendation method according to one or more embodiments of the present disclosure;
FIG. 2 is a flow diagram of updating the tag vector according to one or more embodiments of the present disclosure;
FIG. 3 is a schematic diagram of an information recommendation device according to one or more embodiments of the present disclosure;
FIG. 4 is a schematic diagram of a computing device in accordance with one or more embodiments of the present disclosure.
Detailed Description
The following description is presented to enable any person skilled in the art to make and use the present disclosure, and is provided in the context of a particular application and its requirements. Various local modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present disclosure is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the claims.
Those skilled in the art will appreciate that the terminology used in the present disclosure is for the purpose of describing particular example embodiments only and is not intended to be limiting. For example, as used herein, the singular forms "a", "an", "the" and "the" may include the plural forms as well, unless the context clearly indicates otherwise. The terms "comprises," "comprising," "includes," "including," "has," "having," "contains," "equipped with," and/or "provided," when used in this disclosure, are intended to specify the presence of stated integers, steps, operations, elements, components, and/or groups, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof, or groups thereof.
Those skilled in the art will appreciate that specific terminology has been used to describe the embodiments of the disclosure. For example, "an embodiment," "one embodiment," "some embodiments," "embodiments," and/or "embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present disclosure. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "an alternative embodiment" in various portions of this disclosure are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined as suitable in one or more embodiments of the disclosure.
It will be understood by those skilled in the art that, unless otherwise specified, the ordinal adjectives "first", "second", "third", etc., are used to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
Those skilled in the art will understand that aspects of the present disclosure may be illustrated and described in any of a number of patentable categories or contexts, including any new and useful processes, machines, manufacture, or compositions of matter, or any new and useful improvements thereof. Accordingly, aspects of the present disclosure may be implemented entirely in hardware (circuits, chips, logic devices, etc.), entirely in software (including firmware, resident software, micro-code, etc.) or a combination of both, which may be referred to herein generally as "blocks," modules, "" engines, "" units, "" components, "or" systems. Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer-readable media containing computer-readable program code embodied thereon.
Those skilled in the art will appreciate that an algorithm in the present disclosure is generally considered to be a self-consistent sequence of acts or operations leading to a desired result. These include physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, labels, characters, terms, numbers, or the like. It should be understood, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities.
Those skilled in the art will appreciate that discussions of "processing," "computing," "calculating," "determining," "creating," "analyzing," "checking," or the like, in the present disclosure may refer to the action and/or processes of a computer, computing platform, computing system, or other electronic computing device, that manipulate and/or transform data represented as physical, such as electronic, quantities within the computer's registers and/or memories into other data similarly represented as physical quantities within the computer's registers and/or memories or other information storage medium that may store instructions to perform operations and/or processes.
As one of effective ways for promoting the development of search engines, the information recommendation device and method mainly have the tasks of analyzing user behaviors according to user logs, mining user requirements and recommending interesting information for users.
In the recommendation algorithm research, researchers can assume that a two-dimensional relationship between users and interests forms a utility matrix, calculate the matrix by adopting matrix decomposition or other methods, and then recommend or push the users. However, in this type of study, most algorithms are static algorithms based on static data. When the user or amount of information increases, the algorithm cannot be dynamically updated, requiring a recalculation of the entire data set. This results in an increased system load and a waste of computing resources. Obviously, these algorithmic models have not been able to meet practical requirements and have not been able to adapt to the background of rapid expansion of users and information volumes.
As an efficient tool for representing words as real vectors, Word2vec utilizes the deep learning idea to simplify the processing of text contents into vector operation in a low-dimensional vector space, and the similarity in the vector space can be used for representing the similarity in text semantics. The algorithm model is not only efficient, but also has wide application scenes, and a recommendation system is included in the algorithm model. However, the algorithm is still static and cannot be updated dynamically as data increases.
In order to solve the above problems, the present application provides a method and an apparatus for dynamically updating information recommendation.
Fig. 1 is a flowchart of an information recommendation method according to one or more embodiments of the present disclosure.
As shown in fig. 1, the information recommendation method may include step S102, step S104, step S106, step S108, step S110, step S112, and step S114.
Step S102: raw data relating to content of interest to a user is obtained.
The raw data relating to content of interest to the user comprises at least one of: retrieval history data (or log) of the user, browsing history data (or log) of the user, download history data (or log) of the user, collection history data (or log) of the user, and chat history data (or log) of the user. The retrieval history data of the user can be keywords, retrieval types and the like input by the user in various online or offline search engines. The browsing history data of the user can be the contents of articles, documents, web pages, newspapers and/or magazines read by the user through various application software and/or the contents of databases accessed by the user. The download history data of the user can be articles, documents, web pages, newspapers, magazines, audio, videos, hyperlinks and the like downloaded by the user through various application software. The collection history data of the user can be contents collected by the user in various social software or stored on a local terminal. The chat history data of the user can be the content published by the user through various social software or the chat content. The above various historical data can be used for analyzing and refining the attention points and the preferences of the users so as to accurately recommend information to the users. For example, the data related to the user requirement may be a sequence of search words included in each user session in the user log. Although the information recommendation method and device are described in the specification by taking text data as an example, it should be understood that the data related to the user also includes video, audio, picture and other types of data. In some embodiments, the tag data may correspond to audio data, video data, picture data, or other entity data.
Step S104: preprocessing the raw data to generate a raw tag text data set, the raw tag text data set including a plurality of target tag texts.
The "tag" may be any of the aforementioned text, audio, video, pictures, or even hyperlinks. "tag text" may be a textual notation used to facilitate uniform identification of the tags when preprocessing raw data. For example, when the "tag" is a picture of Adidas shoes on a shopping website, its "tag text" may be the word "Adidas shoes"; when the "tag" is the audio of a song qilixiang on a music website, its "tag text" may be the word "song qilixiang"; when a "tag" is a professional document on a web, its "tag text" may be the title of the document. A "tag text data set" is a collection of tag text data, which may typically include multiple sets of tag text data, each set of tag text data may further include multiple tag texts.
In the tag data generated by the preprocessing, a plurality of target tag texts (e.g., a plurality of short texts) may be separated by special characters or character strings (e.g., space characters, tab characters). The number of characters per target tag text does not exceed a predetermined threshold. For example, the predetermined threshold may range from 20 to 50. Such target tag text may also be referred to as short text. For example, in a search engine, an author name with a number of characters less than 20, a keyword with a number of characters less than 30, a search formula or search term with a number of characters less than 40, a document title with a number of characters less than 50, and the like may all be referred to as short text.
In some embodiments, an original label text data set (also referred to as a training data set) for the label vector model may be generated based on the original data. In some embodiments, generating the original tag text data set may include the steps of:
(1) counting the occurrence frequency of the tags in the original data, and constructing a tag text vocabulary table based on the occurrence frequency, wherein the tag text vocabulary table comprises the occurrence frequency of each tag and the index of each tag;
(2) defining a context tag;
(3) setting low-frequency label filtering;
(4) traversing the original data line by line to generate label pairs (also called training data positive samples); and
(5) the tag pairs are shuffled randomly by an algorithm (e.g., a shuffling algorithm).
In some embodiments, the tag vocabulary may be as shown in Table 1.
TABLE 1
Label (R) Label text Frequency of occurrence Index
Adida ball shoes (Picture) Adi ball shoes 3 0
361 degree ball-point shoes (text) 361 shoes for football game 3 1
Ball shoes (text) Ball shoes 6 2
In some embodiments, defining the context of the tag includes determining a labelset that is adjacent to the center tag with a spacing of no more than a predetermined number of tags.
In some embodiments, the partial index pairs may be as shown in table 2.
TABLE 2
Index pair Label pair
(1,0) (361 shoes, Adi shoes)
(0,1) (Adi shoes, 361 shoes)
(0,2) (Adi shoes, shoes)
(2,0) (shoes, Adi shoes)
(1,2) (361 shoes, shoes)
(2,1) (shoes for football 361)
Step S106: and generating a label vector of each target label text through a label vector model based on the original label text data set.
A "tag vector" is a vector representation of a tag or tag text. For example, the tag vector may be derived by training a tag vector model based on a tag text dataset. In particular, a label vector for each target label text in the label text vocabulary may be trained by traversing the label vector model through a label text dataset. For each label vector in the label vector model, update iteration can also be achieved by performing random gradient descent (SGD) on each positive sample and N negative samples obtained by random sampling.
In some embodiments, the tag vector model may be an unsupervised model. In some embodiments, the tag vector model may be a word vector model, such as a continuous bag of words (CBOW) model or a Skip-gram (Skip-gram) model. For example, a training data set for the label vector model may be constructed based on the plurality of target label texts. For example, the training data set of the label vector model may include all of the plurality of target label texts, or may include only a part of the plurality of target label texts. In some embodiments, each target tag text may be characterized as a real vector of 100 dimensions.
The continuous bag of words model may include: an input layer, a projection layer, and an output layer. The input to the continuous bag of words model is a training data set. The continuous bag-of-words model input may also be a label text pair without considering memory, hard disk overhead, and training duration. The continuous bag of words model inputs a vector corresponding to the context of the central target tag text (i.e., one or more tags that precede and follow the current tag in the raw data), and outputs a vector corresponding to the central target tag text (i.e., the current tag). The output of the label vector model may be an iteratively updated label vector corresponding to each central label. The training goal of the label vector model is to make two labels that are adjacent (or within a certain label interval) to be "close" in vector space.
The output of the continuous bag-of-words model is an iteratively updated label vector, i.e., the continuous bag-of-words model itself, which is a self-iterative process. Taking the continuous bag of words model and the label text data set segment "Addie shoes/361 shoes/shoes" as an example, the training flow segment may include the following steps:
(1) reading a (positive) sample from the training dataset, e.g., (1, 0), indicating that the current center tag is 1(361 shoes) and the context tag is 0 (addison shoes);
(2) randomly sampling a center tag (for example, 361 sneakers) according to the occurrence frequency of each tag in a tag text vocabulary to obtain N negative samples, for example, (1, 2) and the like; and
(3) based on each sample integer pair, the corresponding label vector is iteratively updated.
The word skipping model may include: an input layer, a projection layer, and an output layer. The skipping model inputs a training data set and outputs a vector corresponding to context labels (i.e., one or more labels located before and after the current label in the label data) of the target label text after iterative updating.
For example, in training to generate a label vector, the label vector model may first be randomly initialized and then iteratively traversed through the label vector training data set to iteratively update the label vector model.
In one or more embodiments, the tag vector may be iterated through updates by optimizing an objective function L as follows:
Figure BDA0002797475300000111
where C represents a tag training data set or tag data, context (w) represents a context tag set of tag w, { w } represents a set of tags w, NEG (w) represents a negative sample tag set sampled for tag w,
Figure BDA0002797475300000112
the tags in the context labelset that represent tag w,
Figure BDA0002797475300000113
represent to
Figure BDA0002797475300000114
And (3) sampling the obtained negative sample label set, wherein P () is a probability function, and u represents an element in a union of { w } and the negative sample label set.
The conditional probability in the objective function L is:
Figure BDA0002797475300000121
wherein, XwRepresenting the projection vector, v, to which the label w correspondsuFor the vector representation of the tag u, σ () is a preset activation function, such as a Sigmoid function.
The projection layer function may be a summation function, a mean function, or an identity function, etc. In this embodiment, the projection layer function of the continuous bag-of-words model may adopt a mean function, and the projection layer function of the word skipping model may adopt an identity function. XwThe method specifically comprises the following steps:
Figure BDA0002797475300000122
where, | context (w) | represents the total number of tags in the context labelset for tag w, vwA vector corresponding to the label w is represented,
Figure BDA0002797475300000123
is composed of
Figure BDA0002797475300000124
Is represented by a vector of (a).
Each label vector in the continuous bag-of-words model and the skip word model can be updated iteratively by performing a random gradient descent (SGD) on each positive sample and N negative samples obtained by random sampling.
Step S108: and dynamically updating the label vector model based on the newly added label text data set.
Fig. 2 is a flow diagram of updating the tag vector according to one or more embodiments of the present disclosure. As shown in fig. 2, updating the tag vector model based on the newly added tag text dataset may include substep S1082, substep S1084, and substep S1086.
Substep S1082: new data related to the content of interest to the user is obtained.
The additional data related to the content of interest to the user may include at least one of: retrieval history data (or log) of the user, browsing history data (or log) of the user, download history data (or log) of the user, collection history data (or log) of the user, and chat history data (or log) of the user.
Substep S1084: and preprocessing the new data to generate a new tag text data set.
For example, the added tag text data set may include a plurality of added target tag texts.
The preprocessing of the newly added data is similar to the preprocessing of the original data, and therefore, the description thereof is omitted.
Substep S1082: and iteratively updating the label vector model based on the newly added label text data set.
For example, iteratively updating the tag vector model based on the newly added tag text dataset may include: integrating the newly added label text data set and the original label text data set to generate a newly added training data set; and then, based on the newly added training data set, a dynamic increment updating method is adopted to carry out iterative updating on the label vector model.
Specifically, when the label vector model is trained for the first time, all input vectors can be initialized randomly, and when training is finished, all vectors and a label text vocabulary are output.
The dynamic incremental updating method can comprise the following steps: at each update of the tag vector model:
loading all original label text vocabularies;
traversing the newly added label text data set or a training data set constructed based on the newly added label text, and updating a current label text vocabulary based on the newly added label text data set;
loading an original label vector model, and randomly initializing a vector corresponding to the newly added target label text;
traversing the newly added tag text data set, and iteratively updating a tag vector by using a negative sampling method and an SGD algorithm; and
and outputting the updated label vector model and the label text vocabulary after the preset iteration times.
In some embodiments, the inputs to the dynamic delta update include (a) a tag vector model (initial value) of a previous output; (b) updated tag text vocabulary (probability distribution); and (c) adding a new label text data set (training data set) after the preprocessing.
In the dynamic incremental update, the primary role of the original tag text data set is to provide the distribution of the original tag text data (i.e., the original tag text vocabulary). The existence meaning of the updated label text vocabulary is to provide the probability distribution condition of each label text after integration, and the probability distribution condition is used for randomly sampling the negative sample. Generally, the probability of occurrence of high frequency words is higher at the time of sampling. From a visual understanding, the more frequently text appears, the more common the semantics it carries.
The dynamic incremental update is to train/update the model with only "new" data. Training/updating is more convenient and faster because the added data tends to be at least one order of magnitude different from the original/historical data.
In one or more embodiments of the present disclosure, when updating the tag vector model by a dynamic incremental update method, a specific example of the preset number of iterations may be 5, but cannot be too large, so as to prevent over-training or semantic deviation caused by a newly added text training set. A specific example of the preset negative number of samples may be 15, but not too small, in case the original tag vector is not sufficiently updated.
In one or more embodiments of the present disclosure, in the process of updating the tag vector model by using the dynamic incremental update method, since the current model is initialized by using the original tag vector model and the vocabulary, the relationship information between the tags in the original text training data set is retained. Meanwhile, by using a negative sampling method and an SGD algorithm, not only are newly added label vectors updated in an iterative manner, but also the original label vectors are updated.
Step S110: and clustering the label vectors of all target label texts.
For example, all the label vectors may be subjected to clustering analysis according to a preset clustering algorithm, and classified and output according to clustering results. In this embodiment, the clustering algorithm may be a K-Means algorithm.
Step S112: the similarity of each label vector in each cluster to other label vectors is determined.
The calculation method of the similarity may be a cosine measure (cosine measure):
Figure BDA0002797475300000151
wherein the content of the first and second substances,
Figure BDA0002797475300000152
representing a vector
Figure BDA0002797475300000155
The length of (the die) of (a),
Figure BDA0002797475300000153
representing a vector
Figure BDA0002797475300000154
Length (mode).
By comparing the similarity between the label vectors, the similarity between the labels can be known. For example, in each cluster, the first K most similar vectors of each label vector are calculated, K being a preset positive integer.
One or more embodiments of the present disclosure measure tag similarity by the cosine of the included angle of the tag vector. After all tag vectors are unitized, only floating point number addition and multiplication are involved, so the computation process can be further accelerated using multi-threaded techniques or GPU computations.
Step S114: and pushing information to the user based on the similarity.
For example, the relevant recommendation results may be calculated offline. For example, offline results of relevant recommendations may be retrieved from a database and displayed to a relevant page (e.g., via a browser or application) based on user information or behavior.
One or more embodiments of the present disclosure simplify the calculation flow of the conventional recommendation algorithm by implementing the information recommendation method using a tag vector model; the rapid updating of the label vector model is realized through a dynamic increment updating method, and the defects of the original static algorithm are overcome to a certain extent; through cluster analysis, an alternative recommendation result set is reduced, recommendation quality is improved, and time overhead and memory overhead of similarity calculation are reduced; the online response time of the recommendation system is greatly reduced through an offline calculation/updating and online retrieval mode.
Fig. 3 is a schematic diagram of an information recommendation device according to one or more embodiments of the present disclosure. As shown in fig. 3, the information recommendation apparatus 300 may include an original data acquisition unit 310, a preprocessing unit 320, a tag vector generation unit 330, a clustering unit 340, a similarity determination unit 350, an information pushing unit 360, and an updating unit 370.
The raw data acquisition unit 310 may be configured to acquire raw data related to content of interest to the user. For example, data related to content of interest to the user may be obtained from an operation log of the user.
The pre-processing unit 320 may be configured to pre-process the raw data to generate a raw tag text data set, the raw tag text data set comprising a plurality of target tag texts.
The tag vector generating unit 330 may be configured to generate a tag vector for each target tag text through a tag vector model based on the original tag text data set. The label vector model may be a continuous bag of words model or a skip word model.
The clustering unit 340 may be configured to cluster the tag vectors of all target tag texts. The clustering algorithm may be a K-Means algorithm.
The similarity determination unit 350 may be configured to cluster the tag vectors of all target tag texts. The calculation method of the similarity may be a cosine measure.
The information pushing unit 360 may be configured to push information to the user based on the similarity.
The updating unit 370 may be configured to dynamically update the tag vector model based on the newly added tag text dataset. The specific principle of dynamically updating the tag vector model based on the newly added tag text data set is similar to the steps mentioned in the foregoing embodiments for the information recommendation method, and is not described herein again.
One or more embodiments of the present disclosure simplify the calculation flow of the conventional recommendation algorithm by implementing an information recommendation apparatus using a tag vector model; the rapid updating of the label vector model is realized through a dynamic increment updating method, and the defects of the original static algorithm are overcome to a certain extent; through cluster analysis, an alternative recommendation result set is reduced, recommendation quality is improved, and time overhead and memory overhead of similarity calculation are reduced; the online response time of the recommendation system is greatly reduced through an offline calculation/updating and online retrieval mode.
FIG. 4 is a schematic diagram of a computing device in accordance with one or more embodiments of the present disclosure. Computing device 100 may include at least one storage medium having at least one set of instructions stored thereon; and at least one processor communicatively coupled to the at least one storage medium. When the at least one processor executes the at least one set of instructions, the at least one processor performs the aforementioned method.
Computing device 100 may be implemented using suitable hardware components and/or software components (e.g., processors, controllers, memory units, storage units, input units, output units, communication units, operating systems, applications, and the like).
In some example embodiments, computing device 100 may include, for example, a computing device, a mobile phone, a smart phone, a cellular phone, a notebook, a mobile computer, a laptop computer, a notebook computer, a desktop computer, a handheld device, a PDA device, a handheld PDA device, a wireless communication device, a PDA device incorporating a wireless communication device, and the like.
In some example embodiments, computing device 100 may include, for example, one or more of a processor 191, an input unit 192, an output unit 193, a storage unit 194, and/or a storage unit 195. Computing device 100 may optionally include other suitable hardware components and/or software components. In some example embodiments, some or all of the components of one or more of computing devices 100 may be enclosed in a common housing or packaging, and may be interconnected or operatively associated using one or more wired or wireless links. In other embodiments, one or more components of computing device 100 may be distributed in multiple or separate devices.
In some example embodiments, the processor 191 may comprise, for example, a Central Processing Unit (CPU), a Digital Signal Processor (DSP), one or more processor cores, a single-core processor, a dual-core processor, a multi-core processor, a microprocessor, a host processor, a controller, a plurality of processors or controllers, a chip, a microchip, one or more circuits, circuitry, a logic unit, an Integrated Circuit (IC), an application specific IC (asic), or any other suitable multi-functional or special purpose processor or controller. Processor 191 may execute instructions of an Operating System (OS) and/or one or more suitable applications of computing device 100, for example.
In some exemplary embodiments, the input unit 192 may include, for example, a keyboard, keypad, mouse, touch screen, touch pad, trackball, stylus, microphone, or other suitable pointing or input device. The output unit 193 may include, for example, a monitor, a screen, a touch screen, a flat panel display, a Light Emitting Diode (LED) display unit, a Liquid Crystal Display (LCD) display unit, a plasma display unit, one or more speakers or headphones, or other suitable output device.
In some exemplary embodiments, storage medium 194 may include, for example, Random Access Memory (RAM), read-only memory (ROM), Dynamic RAM (DRAM), synchronous DRAM (SD-RAM), flash memory, volatile memory, non-volatile memory, cache memory, buffers, short-term storage units, long-term storage units, hard disk drives, floppy disk drives, Compact Disk (CD) drives, CD-ROM drives, DVD drives, or other suitable removable or non-removable storage units. Storage media 194 may store, for example, data processed by computing device 100.
In some example embodiments, the storage medium 194 may store logic 195, and the logic 195 may include instructions, data, and/or code that, when executed by a machine, may cause the machine to perform methods, processes, and/or operations as described herein. The machine may include, for example, any suitable processing platform, computing device, processing device, computing system, processing system, computer, processor, or the like, and may be implemented using any suitable combination of hardware, software, firmware, or the like. Logic 195 may include or may be implemented as software, a software module, an application, a program, a subroutine, instructions, an instruction set, computing code, words, values, tokens, and the like. The instructions may include any suitable type of code (such as source code, compiled code, interpreted code, executable code, static code, dynamic code, and the like). The instructions may be implemented according to a predefined computer language, manner or syntax, for instructing a processor to perform a certain function. The instructions may be implemented using any suitable high-level, low-level, object-oriented, Visual, compiled and/or interpreted programming language, such as C, C + +, Java, BASIC, Python, Matlab, Pascal, Visual BASIC, assembly language, machine code, and the like.
In some example embodiments, computing device 100 may be configured to communicate with one or more other devices via a wireless and/or wired network. The network may include a wired network, a Local Area Network (LAN), a wireless LAN (wlan) network, a radio network, a cellular network, a wireless fidelity (WiFi) network, an IR network, a Bluetooth (BT) network, and the like.
In some example embodiments, computing device 100 may allow one or more users to interact with one or more processes, applications, and/or modules of computing device 100, e.g., as described herein.
In some example embodiments, computing device 100 may be configured to perform and/or carry out one or more operations, modules, processes, procedures, and/or the like.
In conclusion, upon reading the present detailed disclosure, those skilled in the art will appreciate that the foregoing detailed disclosure can be presented by way of example only, and not limitation. Those skilled in the art will appreciate that the present disclosure is intended to encompass reasonable variations, improvements, and modifications to the embodiments, even though not explicitly stated herein. Such alterations, improvements, and modifications are intended to be suggested by this disclosure, and are within the spirit and scope of the exemplary embodiments of this disclosure.

Claims (10)

1. An information recommendation method, comprising:
obtaining raw data related to content of interest to a user;
preprocessing the original data to generate an original tag text data set, wherein the original tag text data set comprises a plurality of target tag texts;
generating a label vector of each target label text through a label vector model based on the original label text data set;
clustering label vectors of all target label texts;
determining the similarity of each label vector in each cluster and other label vectors; and
pushing information to the user based on the similarity,
the information recommendation method further comprises the following steps: and dynamically updating the label vector model based on the newly added label text data set.
2. The information recommendation method of claim 1, wherein the raw data comprises at least one of: the retrieval history data of the user, the browsing history data of the user, the downloading history data of the user, the collection history data of the user and the chatting history data of the user.
3. The information recommendation method of claim 1, wherein the plurality of target tag texts are separated by spaces or special characters, and the number of characters of each of the plurality of target tag texts does not exceed a predetermined threshold.
4. The information recommendation method of claim 3, wherein the predetermined threshold value ranges from 20 to 50.
5. The information recommendation method of claim 1, wherein dynamically updating the tag vector model based on the newly added tag text dataset comprises:
acquiring new data related to the content in which the user is interested;
preprocessing the newly added data to generate the newly added tag text data set; and
and iteratively updating the label vector model based on the newly added label text data set.
6. An information recommendation apparatus, comprising:
a raw data acquisition unit configured to acquire raw data related to a content of interest to a user;
a preprocessing unit configured to preprocess the raw data to generate a raw tag text data set, the raw tag text data set including a plurality of target tag texts;
a tag vector generation unit configured to generate a tag vector of each target tag text through a tag vector model based on the original tag text data set;
a clustering unit configured to cluster the tag vectors of all target tag texts;
a similarity determination unit configured to determine a similarity of each label vector in each cluster to other label vectors; and
an information pushing unit configured to push information to the user based on the similarity,
wherein the information recommendation apparatus further comprises an updating unit configured to: and dynamically updating the label vector model based on the newly added label text data set.
7. The information recommendation device of claim 6, wherein the raw data comprises at least one of: the retrieval history data of the user, the browsing history data of the user, the downloading history data of the user, the collection history data of the user and the chatting history data of the user.
8. The information recommendation device of claim 6, wherein the plurality of target tag texts are separated by spaces or special characters, and a number of characters of each of the plurality of target tag texts does not exceed a predetermined threshold.
9. The information recommendation device of claim 8, wherein the predetermined threshold ranges from 20 to 50.
10. The information recommendation device of claim 6, wherein dynamically updating the tag vector model based on the newly added tag text dataset comprises:
acquiring new data related to the content in which the user is interested;
preprocessing the newly added data to generate the newly added text label data set;
and iteratively updating the label vector model based on the newly added label text data set.
CN202011336768.8A 2020-11-25 2020-11-25 Information recommendation method and device Pending CN112417131A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011336768.8A CN112417131A (en) 2020-11-25 2020-11-25 Information recommendation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011336768.8A CN112417131A (en) 2020-11-25 2020-11-25 Information recommendation method and device

Publications (1)

Publication Number Publication Date
CN112417131A true CN112417131A (en) 2021-02-26

Family

ID=74843767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011336768.8A Pending CN112417131A (en) 2020-11-25 2020-11-25 Information recommendation method and device

Country Status (1)

Country Link
CN (1) CN112417131A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113222697A (en) * 2021-05-11 2021-08-06 湖北三赫智能科技有限公司 Commodity information pushing method, commodity information pushing device, computer equipment and readable storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125495A (en) * 2019-12-19 2020-05-08 京东方科技集团股份有限公司 Information recommendation method, equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125495A (en) * 2019-12-19 2020-05-08 京东方科技集团股份有限公司 Information recommendation method, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113222697A (en) * 2021-05-11 2021-08-06 湖北三赫智能科技有限公司 Commodity information pushing method, commodity information pushing device, computer equipment and readable storage medium

Similar Documents

Publication Publication Date Title
US11544474B2 (en) Generation of text from structured data
US10762283B2 (en) Multimedia document summarization
CN108319627B (en) Keyword extraction method and keyword extraction device
US20180158078A1 (en) Computer device and method for predicting market demand of commodities
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
CN108334951B (en) Pre-statistics of data for nodes of a decision tree
CN111046221A (en) Song recommendation method and device, terminal equipment and storage medium
CN113434636B (en) Semantic-based approximate text searching method, semantic-based approximate text searching device, computer equipment and medium
CN111950279A (en) Entity relationship processing method, device, equipment and computer readable storage medium
Lee et al. Efficient image retrieval using advanced SURF and DCD on mobile platform
CN113688310A (en) Content recommendation method, device, equipment and storage medium
CN112417133A (en) Training method and device of ranking model
TW202001621A (en) Corpus generating method and apparatus, and human-machine interaction processing method and apparatus
Su et al. Hybrid recommender system based on deep learning model
Lee et al. Extraction and prioritization of product attributes using an explainable neural network
Zhai et al. Text classification of Chinese news based on multi-scale CNN and LSTM hybrid model
CN109241238B (en) Article searching method and device and electronic equipment
CN111191011B (en) Text label searching and matching method, device, equipment and storage medium
CN112417131A (en) Information recommendation method and device
CN112417154B (en) Method and device for determining similarity of documents
CN115964474A (en) Policy keyword extraction method and device, storage medium and electronic equipment
CN107622129B (en) Method and device for organizing knowledge base and computer storage medium
Tu et al. A domain-independent text segmentation method for educational course content
CN113761213A (en) Data query system and method based on knowledge graph and terminal equipment
Li et al. Word-Graph2vec: An efficient word embedding approach on word co-occurrence graph using random walk technique

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 001a, 11 / F, building 1, 588 Zixing Road, Minhang District, Shanghai, 200241

Applicant after: Shanghai chuangmi Shulian Intelligent Technology Development Co.,Ltd.

Address before: Room 001a, 11 / F, building 1, 588 Zixing Road, Minhang District, Shanghai, 200241

Applicant before: SHANGHAI CHUANGMI TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information