CN112417131A - Information recommendation method and device - Google Patents
Information recommendation method and device Download PDFInfo
- Publication number
- CN112417131A CN112417131A CN202011336768.8A CN202011336768A CN112417131A CN 112417131 A CN112417131 A CN 112417131A CN 202011336768 A CN202011336768 A CN 202011336768A CN 112417131 A CN112417131 A CN 112417131A
- Authority
- CN
- China
- Prior art keywords
- tag
- label
- user
- data set
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 239000013598 vector Substances 0.000 claims abstract description 130
- 238000007781 pre-processing Methods 0.000 claims abstract description 18
- 238000004422 calculation algorithm Methods 0.000 abstract description 23
- 238000004364 calculation method Methods 0.000 abstract description 11
- 230000008569 process Effects 0.000 abstract description 10
- 238000012549 training Methods 0.000 description 23
- 230000006870 function Effects 0.000 description 14
- 238000003860 storage Methods 0.000 description 14
- 230000015654 memory Effects 0.000 description 11
- 238000005070 sampling Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 230000002829 reductive effect Effects 0.000 description 7
- 238000012545 processing Methods 0.000 description 6
- 230000003068 static effect Effects 0.000 description 6
- 238000009826 distribution Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 238000007621 cluster analysis Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000000670 limiting effect Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 239000000872 buffer Substances 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses an information recommendation method and device. The method comprises the following steps: obtaining raw data related to content of interest to a user; preprocessing the original data to generate an original tag text data set, wherein the original tag text data set comprises a plurality of target tag texts; generating a label vector of each target label text through a label vector model based on the original label text data set; clustering label vectors of all target label texts; determining the similarity of each label vector in each cluster and other label vectors; and pushing information to the user based on the similarity, wherein the information recommendation method further comprises the following steps: and dynamically updating the label vector model based on the newly added label text data set. The method and the device simplify the calculation process of the recommendation algorithm, reduce the consumption of software and hardware, and realize the dynamic and rapid update of the tag vector model.
Description
Technical Field
The disclosure relates to the technical field of big data information processing, in particular to an information recommendation method and device.
Background
With the rapid development of scientific technologies, particularly cloud computing and big data technologies, the amount of information has been explosively increased. The user finds information in which the user is interested from massive information more and more difficult, and the problem that the technical staff in the field needs to solve urgently is that the user accurately recommends the required information.
Therefore, an information recommendation method and apparatus are needed.
Disclosure of Invention
The invention aims to provide an information recommendation method and an information recommendation device, which can reduce the system load, reduce the model updating time and improve the real-time performance of a recommendation system.
To achieve the above object, one aspect of the present disclosure provides an information recommendation method, including: obtaining raw data related to content of interest to a user; preprocessing the original data to generate an original tag text data set, wherein the original tag text data set comprises a plurality of target tag texts; generating a label vector of each target label text through a label vector model based on the original label text data set; clustering label vectors of all target label texts; determining the similarity of each label vector in each cluster and other label vectors; and pushing information to the user based on the similarity, wherein the information recommendation method further comprises the following steps: and dynamically updating the label vector model based on the newly added label text data set.
Optionally, the raw data comprises at least one of: the retrieval history data of the user, the browsing history data of the user, the downloading history data of the user, the collection history data of the user and the chatting history data of the user.
Optionally, the plurality of target tag texts are separated by spaces or special characters, and the number of characters of each of the plurality of target tag texts does not exceed a predetermined threshold.
Optionally, the predetermined threshold is in the range of 20 to 50.
Optionally, dynamically updating the tag vector model based on the newly added tag text data set includes: acquiring new data related to the content in which the user is interested; preprocessing the newly added data to generate the newly added tag text data set; and iteratively updating the tag vector model based on the newly added tag text data set.
Another aspect of the present disclosure provides an information recommendation apparatus, including: a raw data acquisition unit configured to acquire raw data related to a content of interest to a user; a preprocessing unit configured to preprocess the raw data to generate a raw tag text data set, the raw tag text data set including a plurality of target tag texts; a tag vector generation unit configured to generate a tag vector of each target tag text through a tag vector model based on the original tag text data set; a clustering unit configured to cluster the tag vectors of all target tag texts; a similarity determination unit configured to determine a similarity of each label vector in each cluster to other label vectors; and an information pushing unit configured to push information to the user based on the similarity, wherein the information recommendation apparatus further includes an updating unit configured to: and dynamically updating the label vector model based on the newly added label text data set.
Optionally, the raw data comprises at least one of: the retrieval history data of the user, the browsing history data of the user, the downloading history data of the user, the collection history data of the user and the chatting history data of the user.
Optionally, the plurality of target tag texts are separated by spaces or special characters, and the number of characters of each of the plurality of target tag texts does not exceed a predetermined threshold.
Optionally, the predetermined threshold is in the range of 20 to 50.
Optionally, dynamically updating the tag vector model based on the newly added tag text data set includes: acquiring new data related to the content in which the user is interested; preprocessing the newly added data to generate the newly added tag text data set; and iteratively updating the tag vector model based on the newly added tag text data set.
Yet another aspect of the present disclosure provides a computing device, comprising: at least one storage medium storing at least one set of instructions; and at least one processor communicatively coupled to the at least one storage medium, wherein the at least one processor executes the at least one set of instructions to perform the method.
The information recommendation method and device provided by one or more embodiments of the disclosure have one or more of the following advantages.
Due to the adoption of the technical scheme, the method has the following advantages:
(1) the application scene is wide. The method can recommend various information resources for the user according to the user log by utilizing the label vector model, such as: related search terms, related authors, related organizations, related products, and the like. The method and the system can be widely applied to recommendation systems of electronic commerce websites.
(2) The data source is rich. The present disclosure uses a tag vector model to implement a recommendation system whose tag data sources are rich.
(3) The pretreatment is simple. According to the method, a utility matrix does not need to be constructed like a traditional algorithm, only label data need to be cleaned, corresponding label text data are extracted and organized according to requirements, and stop words are removed when necessary.
(4) The model update time is short. The method aims at the label vector model, provides a simple and effective dynamic increment updating method, and reduces the updating time of the model while ensuring the quality of the model.
(5) The real-time performance is high. According to the method and the system, the mode of real-time calculation of the original recommendation system is avoided through the modes of off-line calculation/updating and on-line retrieval, and the on-line response time of the recommendation system is greatly reduced.
Drawings
The following drawings describe in detail exemplary embodiments disclosed in the present disclosure. Wherein like reference numerals represent similar structures throughout the several views of the drawings. Those of ordinary skill in the art will understand that the present embodiments are non-limiting, exemplary embodiments, and that the accompanying drawings are for illustrative and descriptive purposes only and are not intended to limit the scope of the present disclosure, as other embodiments may equally fulfill the conceptual intent of the present disclosure. It should be understood that the drawings are not to scale. Wherein:
fig. 1 is a flow diagram of an information recommendation method according to one or more embodiments of the present disclosure;
FIG. 2 is a flow diagram of updating the tag vector according to one or more embodiments of the present disclosure;
FIG. 3 is a schematic diagram of an information recommendation device according to one or more embodiments of the present disclosure;
FIG. 4 is a schematic diagram of a computing device in accordance with one or more embodiments of the present disclosure.
Detailed Description
The following description is presented to enable any person skilled in the art to make and use the present disclosure, and is provided in the context of a particular application and its requirements. Various local modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present disclosure is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the claims.
Those skilled in the art will appreciate that the terminology used in the present disclosure is for the purpose of describing particular example embodiments only and is not intended to be limiting. For example, as used herein, the singular forms "a", "an", "the" and "the" may include the plural forms as well, unless the context clearly indicates otherwise. The terms "comprises," "comprising," "includes," "including," "has," "having," "contains," "equipped with," and/or "provided," when used in this disclosure, are intended to specify the presence of stated integers, steps, operations, elements, components, and/or groups, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof, or groups thereof.
Those skilled in the art will appreciate that specific terminology has been used to describe the embodiments of the disclosure. For example, "an embodiment," "one embodiment," "some embodiments," "embodiments," and/or "embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present disclosure. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "an alternative embodiment" in various portions of this disclosure are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined as suitable in one or more embodiments of the disclosure.
It will be understood by those skilled in the art that, unless otherwise specified, the ordinal adjectives "first", "second", "third", etc., are used to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
Those skilled in the art will understand that aspects of the present disclosure may be illustrated and described in any of a number of patentable categories or contexts, including any new and useful processes, machines, manufacture, or compositions of matter, or any new and useful improvements thereof. Accordingly, aspects of the present disclosure may be implemented entirely in hardware (circuits, chips, logic devices, etc.), entirely in software (including firmware, resident software, micro-code, etc.) or a combination of both, which may be referred to herein generally as "blocks," modules, "" engines, "" units, "" components, "or" systems. Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer-readable media containing computer-readable program code embodied thereon.
Those skilled in the art will appreciate that an algorithm in the present disclosure is generally considered to be a self-consistent sequence of acts or operations leading to a desired result. These include physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, labels, characters, terms, numbers, or the like. It should be understood, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities.
Those skilled in the art will appreciate that discussions of "processing," "computing," "calculating," "determining," "creating," "analyzing," "checking," or the like, in the present disclosure may refer to the action and/or processes of a computer, computing platform, computing system, or other electronic computing device, that manipulate and/or transform data represented as physical, such as electronic, quantities within the computer's registers and/or memories into other data similarly represented as physical quantities within the computer's registers and/or memories or other information storage medium that may store instructions to perform operations and/or processes.
As one of effective ways for promoting the development of search engines, the information recommendation device and method mainly have the tasks of analyzing user behaviors according to user logs, mining user requirements and recommending interesting information for users.
In the recommendation algorithm research, researchers can assume that a two-dimensional relationship between users and interests forms a utility matrix, calculate the matrix by adopting matrix decomposition or other methods, and then recommend or push the users. However, in this type of study, most algorithms are static algorithms based on static data. When the user or amount of information increases, the algorithm cannot be dynamically updated, requiring a recalculation of the entire data set. This results in an increased system load and a waste of computing resources. Obviously, these algorithmic models have not been able to meet practical requirements and have not been able to adapt to the background of rapid expansion of users and information volumes.
As an efficient tool for representing words as real vectors, Word2vec utilizes the deep learning idea to simplify the processing of text contents into vector operation in a low-dimensional vector space, and the similarity in the vector space can be used for representing the similarity in text semantics. The algorithm model is not only efficient, but also has wide application scenes, and a recommendation system is included in the algorithm model. However, the algorithm is still static and cannot be updated dynamically as data increases.
In order to solve the above problems, the present application provides a method and an apparatus for dynamically updating information recommendation.
Fig. 1 is a flowchart of an information recommendation method according to one or more embodiments of the present disclosure.
As shown in fig. 1, the information recommendation method may include step S102, step S104, step S106, step S108, step S110, step S112, and step S114.
Step S102: raw data relating to content of interest to a user is obtained.
The raw data relating to content of interest to the user comprises at least one of: retrieval history data (or log) of the user, browsing history data (or log) of the user, download history data (or log) of the user, collection history data (or log) of the user, and chat history data (or log) of the user. The retrieval history data of the user can be keywords, retrieval types and the like input by the user in various online or offline search engines. The browsing history data of the user can be the contents of articles, documents, web pages, newspapers and/or magazines read by the user through various application software and/or the contents of databases accessed by the user. The download history data of the user can be articles, documents, web pages, newspapers, magazines, audio, videos, hyperlinks and the like downloaded by the user through various application software. The collection history data of the user can be contents collected by the user in various social software or stored on a local terminal. The chat history data of the user can be the content published by the user through various social software or the chat content. The above various historical data can be used for analyzing and refining the attention points and the preferences of the users so as to accurately recommend information to the users. For example, the data related to the user requirement may be a sequence of search words included in each user session in the user log. Although the information recommendation method and device are described in the specification by taking text data as an example, it should be understood that the data related to the user also includes video, audio, picture and other types of data. In some embodiments, the tag data may correspond to audio data, video data, picture data, or other entity data.
Step S104: preprocessing the raw data to generate a raw tag text data set, the raw tag text data set including a plurality of target tag texts.
The "tag" may be any of the aforementioned text, audio, video, pictures, or even hyperlinks. "tag text" may be a textual notation used to facilitate uniform identification of the tags when preprocessing raw data. For example, when the "tag" is a picture of Adidas shoes on a shopping website, its "tag text" may be the word "Adidas shoes"; when the "tag" is the audio of a song qilixiang on a music website, its "tag text" may be the word "song qilixiang"; when a "tag" is a professional document on a web, its "tag text" may be the title of the document. A "tag text data set" is a collection of tag text data, which may typically include multiple sets of tag text data, each set of tag text data may further include multiple tag texts.
In the tag data generated by the preprocessing, a plurality of target tag texts (e.g., a plurality of short texts) may be separated by special characters or character strings (e.g., space characters, tab characters). The number of characters per target tag text does not exceed a predetermined threshold. For example, the predetermined threshold may range from 20 to 50. Such target tag text may also be referred to as short text. For example, in a search engine, an author name with a number of characters less than 20, a keyword with a number of characters less than 30, a search formula or search term with a number of characters less than 40, a document title with a number of characters less than 50, and the like may all be referred to as short text.
In some embodiments, an original label text data set (also referred to as a training data set) for the label vector model may be generated based on the original data. In some embodiments, generating the original tag text data set may include the steps of:
(1) counting the occurrence frequency of the tags in the original data, and constructing a tag text vocabulary table based on the occurrence frequency, wherein the tag text vocabulary table comprises the occurrence frequency of each tag and the index of each tag;
(2) defining a context tag;
(3) setting low-frequency label filtering;
(4) traversing the original data line by line to generate label pairs (also called training data positive samples); and
(5) the tag pairs are shuffled randomly by an algorithm (e.g., a shuffling algorithm).
In some embodiments, the tag vocabulary may be as shown in Table 1.
TABLE 1
Label (R) | Label text | Frequency of occurrence | Index |
Adida ball shoes (Picture) | Adi ball shoes | 3 | 0 |
361 degree ball-point shoes (text) | 361 shoes for football game | 3 | 1 |
Ball shoes (text) | Ball shoes | 6 | 2 |
In some embodiments, defining the context of the tag includes determining a labelset that is adjacent to the center tag with a spacing of no more than a predetermined number of tags.
In some embodiments, the partial index pairs may be as shown in table 2.
TABLE 2
Index pair | Label pair |
(1,0) | (361 shoes, Adi shoes) |
(0,1) | (Adi shoes, 361 shoes) |
(0,2) | (Adi shoes, shoes) |
(2,0) | (shoes, Adi shoes) |
(1,2) | (361 shoes, shoes) |
(2,1) | (shoes for football 361) |
Step S106: and generating a label vector of each target label text through a label vector model based on the original label text data set.
A "tag vector" is a vector representation of a tag or tag text. For example, the tag vector may be derived by training a tag vector model based on a tag text dataset. In particular, a label vector for each target label text in the label text vocabulary may be trained by traversing the label vector model through a label text dataset. For each label vector in the label vector model, update iteration can also be achieved by performing random gradient descent (SGD) on each positive sample and N negative samples obtained by random sampling.
In some embodiments, the tag vector model may be an unsupervised model. In some embodiments, the tag vector model may be a word vector model, such as a continuous bag of words (CBOW) model or a Skip-gram (Skip-gram) model. For example, a training data set for the label vector model may be constructed based on the plurality of target label texts. For example, the training data set of the label vector model may include all of the plurality of target label texts, or may include only a part of the plurality of target label texts. In some embodiments, each target tag text may be characterized as a real vector of 100 dimensions.
The continuous bag of words model may include: an input layer, a projection layer, and an output layer. The input to the continuous bag of words model is a training data set. The continuous bag-of-words model input may also be a label text pair without considering memory, hard disk overhead, and training duration. The continuous bag of words model inputs a vector corresponding to the context of the central target tag text (i.e., one or more tags that precede and follow the current tag in the raw data), and outputs a vector corresponding to the central target tag text (i.e., the current tag). The output of the label vector model may be an iteratively updated label vector corresponding to each central label. The training goal of the label vector model is to make two labels that are adjacent (or within a certain label interval) to be "close" in vector space.
The output of the continuous bag-of-words model is an iteratively updated label vector, i.e., the continuous bag-of-words model itself, which is a self-iterative process. Taking the continuous bag of words model and the label text data set segment "Addie shoes/361 shoes/shoes" as an example, the training flow segment may include the following steps:
(1) reading a (positive) sample from the training dataset, e.g., (1, 0), indicating that the current center tag is 1(361 shoes) and the context tag is 0 (addison shoes);
(2) randomly sampling a center tag (for example, 361 sneakers) according to the occurrence frequency of each tag in a tag text vocabulary to obtain N negative samples, for example, (1, 2) and the like; and
(3) based on each sample integer pair, the corresponding label vector is iteratively updated.
The word skipping model may include: an input layer, a projection layer, and an output layer. The skipping model inputs a training data set and outputs a vector corresponding to context labels (i.e., one or more labels located before and after the current label in the label data) of the target label text after iterative updating.
For example, in training to generate a label vector, the label vector model may first be randomly initialized and then iteratively traversed through the label vector training data set to iteratively update the label vector model.
In one or more embodiments, the tag vector may be iterated through updates by optimizing an objective function L as follows:
where C represents a tag training data set or tag data, context (w) represents a context tag set of tag w, { w } represents a set of tags w, NEG (w) represents a negative sample tag set sampled for tag w,the tags in the context labelset that represent tag w,represent toAnd (3) sampling the obtained negative sample label set, wherein P () is a probability function, and u represents an element in a union of { w } and the negative sample label set.
The conditional probability in the objective function L is:
wherein, XwRepresenting the projection vector, v, to which the label w correspondsuFor the vector representation of the tag u, σ () is a preset activation function, such as a Sigmoid function.
The projection layer function may be a summation function, a mean function, or an identity function, etc. In this embodiment, the projection layer function of the continuous bag-of-words model may adopt a mean function, and the projection layer function of the word skipping model may adopt an identity function. XwThe method specifically comprises the following steps:
where, | context (w) | represents the total number of tags in the context labelset for tag w, vwA vector corresponding to the label w is represented,is composed ofIs represented by a vector of (a).
Each label vector in the continuous bag-of-words model and the skip word model can be updated iteratively by performing a random gradient descent (SGD) on each positive sample and N negative samples obtained by random sampling.
Step S108: and dynamically updating the label vector model based on the newly added label text data set.
Fig. 2 is a flow diagram of updating the tag vector according to one or more embodiments of the present disclosure. As shown in fig. 2, updating the tag vector model based on the newly added tag text dataset may include substep S1082, substep S1084, and substep S1086.
Substep S1082: new data related to the content of interest to the user is obtained.
The additional data related to the content of interest to the user may include at least one of: retrieval history data (or log) of the user, browsing history data (or log) of the user, download history data (or log) of the user, collection history data (or log) of the user, and chat history data (or log) of the user.
Substep S1084: and preprocessing the new data to generate a new tag text data set.
For example, the added tag text data set may include a plurality of added target tag texts.
The preprocessing of the newly added data is similar to the preprocessing of the original data, and therefore, the description thereof is omitted.
Substep S1082: and iteratively updating the label vector model based on the newly added label text data set.
For example, iteratively updating the tag vector model based on the newly added tag text dataset may include: integrating the newly added label text data set and the original label text data set to generate a newly added training data set; and then, based on the newly added training data set, a dynamic increment updating method is adopted to carry out iterative updating on the label vector model.
Specifically, when the label vector model is trained for the first time, all input vectors can be initialized randomly, and when training is finished, all vectors and a label text vocabulary are output.
The dynamic incremental updating method can comprise the following steps: at each update of the tag vector model:
loading all original label text vocabularies;
traversing the newly added label text data set or a training data set constructed based on the newly added label text, and updating a current label text vocabulary based on the newly added label text data set;
loading an original label vector model, and randomly initializing a vector corresponding to the newly added target label text;
traversing the newly added tag text data set, and iteratively updating a tag vector by using a negative sampling method and an SGD algorithm; and
and outputting the updated label vector model and the label text vocabulary after the preset iteration times.
In some embodiments, the inputs to the dynamic delta update include (a) a tag vector model (initial value) of a previous output; (b) updated tag text vocabulary (probability distribution); and (c) adding a new label text data set (training data set) after the preprocessing.
In the dynamic incremental update, the primary role of the original tag text data set is to provide the distribution of the original tag text data (i.e., the original tag text vocabulary). The existence meaning of the updated label text vocabulary is to provide the probability distribution condition of each label text after integration, and the probability distribution condition is used for randomly sampling the negative sample. Generally, the probability of occurrence of high frequency words is higher at the time of sampling. From a visual understanding, the more frequently text appears, the more common the semantics it carries.
The dynamic incremental update is to train/update the model with only "new" data. Training/updating is more convenient and faster because the added data tends to be at least one order of magnitude different from the original/historical data.
In one or more embodiments of the present disclosure, when updating the tag vector model by a dynamic incremental update method, a specific example of the preset number of iterations may be 5, but cannot be too large, so as to prevent over-training or semantic deviation caused by a newly added text training set. A specific example of the preset negative number of samples may be 15, but not too small, in case the original tag vector is not sufficiently updated.
In one or more embodiments of the present disclosure, in the process of updating the tag vector model by using the dynamic incremental update method, since the current model is initialized by using the original tag vector model and the vocabulary, the relationship information between the tags in the original text training data set is retained. Meanwhile, by using a negative sampling method and an SGD algorithm, not only are newly added label vectors updated in an iterative manner, but also the original label vectors are updated.
Step S110: and clustering the label vectors of all target label texts.
For example, all the label vectors may be subjected to clustering analysis according to a preset clustering algorithm, and classified and output according to clustering results. In this embodiment, the clustering algorithm may be a K-Means algorithm.
Step S112: the similarity of each label vector in each cluster to other label vectors is determined.
The calculation method of the similarity may be a cosine measure (cosine measure):
wherein the content of the first and second substances,representing a vectorThe length of (the die) of (a),representing a vectorLength (mode).
By comparing the similarity between the label vectors, the similarity between the labels can be known. For example, in each cluster, the first K most similar vectors of each label vector are calculated, K being a preset positive integer.
One or more embodiments of the present disclosure measure tag similarity by the cosine of the included angle of the tag vector. After all tag vectors are unitized, only floating point number addition and multiplication are involved, so the computation process can be further accelerated using multi-threaded techniques or GPU computations.
Step S114: and pushing information to the user based on the similarity.
For example, the relevant recommendation results may be calculated offline. For example, offline results of relevant recommendations may be retrieved from a database and displayed to a relevant page (e.g., via a browser or application) based on user information or behavior.
One or more embodiments of the present disclosure simplify the calculation flow of the conventional recommendation algorithm by implementing the information recommendation method using a tag vector model; the rapid updating of the label vector model is realized through a dynamic increment updating method, and the defects of the original static algorithm are overcome to a certain extent; through cluster analysis, an alternative recommendation result set is reduced, recommendation quality is improved, and time overhead and memory overhead of similarity calculation are reduced; the online response time of the recommendation system is greatly reduced through an offline calculation/updating and online retrieval mode.
Fig. 3 is a schematic diagram of an information recommendation device according to one or more embodiments of the present disclosure. As shown in fig. 3, the information recommendation apparatus 300 may include an original data acquisition unit 310, a preprocessing unit 320, a tag vector generation unit 330, a clustering unit 340, a similarity determination unit 350, an information pushing unit 360, and an updating unit 370.
The raw data acquisition unit 310 may be configured to acquire raw data related to content of interest to the user. For example, data related to content of interest to the user may be obtained from an operation log of the user.
The pre-processing unit 320 may be configured to pre-process the raw data to generate a raw tag text data set, the raw tag text data set comprising a plurality of target tag texts.
The tag vector generating unit 330 may be configured to generate a tag vector for each target tag text through a tag vector model based on the original tag text data set. The label vector model may be a continuous bag of words model or a skip word model.
The clustering unit 340 may be configured to cluster the tag vectors of all target tag texts. The clustering algorithm may be a K-Means algorithm.
The similarity determination unit 350 may be configured to cluster the tag vectors of all target tag texts. The calculation method of the similarity may be a cosine measure.
The information pushing unit 360 may be configured to push information to the user based on the similarity.
The updating unit 370 may be configured to dynamically update the tag vector model based on the newly added tag text dataset. The specific principle of dynamically updating the tag vector model based on the newly added tag text data set is similar to the steps mentioned in the foregoing embodiments for the information recommendation method, and is not described herein again.
One or more embodiments of the present disclosure simplify the calculation flow of the conventional recommendation algorithm by implementing an information recommendation apparatus using a tag vector model; the rapid updating of the label vector model is realized through a dynamic increment updating method, and the defects of the original static algorithm are overcome to a certain extent; through cluster analysis, an alternative recommendation result set is reduced, recommendation quality is improved, and time overhead and memory overhead of similarity calculation are reduced; the online response time of the recommendation system is greatly reduced through an offline calculation/updating and online retrieval mode.
FIG. 4 is a schematic diagram of a computing device in accordance with one or more embodiments of the present disclosure. Computing device 100 may include at least one storage medium having at least one set of instructions stored thereon; and at least one processor communicatively coupled to the at least one storage medium. When the at least one processor executes the at least one set of instructions, the at least one processor performs the aforementioned method.
In some example embodiments, computing device 100 may include, for example, a computing device, a mobile phone, a smart phone, a cellular phone, a notebook, a mobile computer, a laptop computer, a notebook computer, a desktop computer, a handheld device, a PDA device, a handheld PDA device, a wireless communication device, a PDA device incorporating a wireless communication device, and the like.
In some example embodiments, computing device 100 may include, for example, one or more of a processor 191, an input unit 192, an output unit 193, a storage unit 194, and/or a storage unit 195. Computing device 100 may optionally include other suitable hardware components and/or software components. In some example embodiments, some or all of the components of one or more of computing devices 100 may be enclosed in a common housing or packaging, and may be interconnected or operatively associated using one or more wired or wireless links. In other embodiments, one or more components of computing device 100 may be distributed in multiple or separate devices.
In some example embodiments, the processor 191 may comprise, for example, a Central Processing Unit (CPU), a Digital Signal Processor (DSP), one or more processor cores, a single-core processor, a dual-core processor, a multi-core processor, a microprocessor, a host processor, a controller, a plurality of processors or controllers, a chip, a microchip, one or more circuits, circuitry, a logic unit, an Integrated Circuit (IC), an application specific IC (asic), or any other suitable multi-functional or special purpose processor or controller. Processor 191 may execute instructions of an Operating System (OS) and/or one or more suitable applications of computing device 100, for example.
In some exemplary embodiments, the input unit 192 may include, for example, a keyboard, keypad, mouse, touch screen, touch pad, trackball, stylus, microphone, or other suitable pointing or input device. The output unit 193 may include, for example, a monitor, a screen, a touch screen, a flat panel display, a Light Emitting Diode (LED) display unit, a Liquid Crystal Display (LCD) display unit, a plasma display unit, one or more speakers or headphones, or other suitable output device.
In some exemplary embodiments, storage medium 194 may include, for example, Random Access Memory (RAM), read-only memory (ROM), Dynamic RAM (DRAM), synchronous DRAM (SD-RAM), flash memory, volatile memory, non-volatile memory, cache memory, buffers, short-term storage units, long-term storage units, hard disk drives, floppy disk drives, Compact Disk (CD) drives, CD-ROM drives, DVD drives, or other suitable removable or non-removable storage units. Storage media 194 may store, for example, data processed by computing device 100.
In some example embodiments, the storage medium 194 may store logic 195, and the logic 195 may include instructions, data, and/or code that, when executed by a machine, may cause the machine to perform methods, processes, and/or operations as described herein. The machine may include, for example, any suitable processing platform, computing device, processing device, computing system, processing system, computer, processor, or the like, and may be implemented using any suitable combination of hardware, software, firmware, or the like. Logic 195 may include or may be implemented as software, a software module, an application, a program, a subroutine, instructions, an instruction set, computing code, words, values, tokens, and the like. The instructions may include any suitable type of code (such as source code, compiled code, interpreted code, executable code, static code, dynamic code, and the like). The instructions may be implemented according to a predefined computer language, manner or syntax, for instructing a processor to perform a certain function. The instructions may be implemented using any suitable high-level, low-level, object-oriented, Visual, compiled and/or interpreted programming language, such as C, C + +, Java, BASIC, Python, Matlab, Pascal, Visual BASIC, assembly language, machine code, and the like.
In some example embodiments, computing device 100 may be configured to communicate with one or more other devices via a wireless and/or wired network. The network may include a wired network, a Local Area Network (LAN), a wireless LAN (wlan) network, a radio network, a cellular network, a wireless fidelity (WiFi) network, an IR network, a Bluetooth (BT) network, and the like.
In some example embodiments, computing device 100 may allow one or more users to interact with one or more processes, applications, and/or modules of computing device 100, e.g., as described herein.
In some example embodiments, computing device 100 may be configured to perform and/or carry out one or more operations, modules, processes, procedures, and/or the like.
In conclusion, upon reading the present detailed disclosure, those skilled in the art will appreciate that the foregoing detailed disclosure can be presented by way of example only, and not limitation. Those skilled in the art will appreciate that the present disclosure is intended to encompass reasonable variations, improvements, and modifications to the embodiments, even though not explicitly stated herein. Such alterations, improvements, and modifications are intended to be suggested by this disclosure, and are within the spirit and scope of the exemplary embodiments of this disclosure.
Claims (10)
1. An information recommendation method, comprising:
obtaining raw data related to content of interest to a user;
preprocessing the original data to generate an original tag text data set, wherein the original tag text data set comprises a plurality of target tag texts;
generating a label vector of each target label text through a label vector model based on the original label text data set;
clustering label vectors of all target label texts;
determining the similarity of each label vector in each cluster and other label vectors; and
pushing information to the user based on the similarity,
the information recommendation method further comprises the following steps: and dynamically updating the label vector model based on the newly added label text data set.
2. The information recommendation method of claim 1, wherein the raw data comprises at least one of: the retrieval history data of the user, the browsing history data of the user, the downloading history data of the user, the collection history data of the user and the chatting history data of the user.
3. The information recommendation method of claim 1, wherein the plurality of target tag texts are separated by spaces or special characters, and the number of characters of each of the plurality of target tag texts does not exceed a predetermined threshold.
4. The information recommendation method of claim 3, wherein the predetermined threshold value ranges from 20 to 50.
5. The information recommendation method of claim 1, wherein dynamically updating the tag vector model based on the newly added tag text dataset comprises:
acquiring new data related to the content in which the user is interested;
preprocessing the newly added data to generate the newly added tag text data set; and
and iteratively updating the label vector model based on the newly added label text data set.
6. An information recommendation apparatus, comprising:
a raw data acquisition unit configured to acquire raw data related to a content of interest to a user;
a preprocessing unit configured to preprocess the raw data to generate a raw tag text data set, the raw tag text data set including a plurality of target tag texts;
a tag vector generation unit configured to generate a tag vector of each target tag text through a tag vector model based on the original tag text data set;
a clustering unit configured to cluster the tag vectors of all target tag texts;
a similarity determination unit configured to determine a similarity of each label vector in each cluster to other label vectors; and
an information pushing unit configured to push information to the user based on the similarity,
wherein the information recommendation apparatus further comprises an updating unit configured to: and dynamically updating the label vector model based on the newly added label text data set.
7. The information recommendation device of claim 6, wherein the raw data comprises at least one of: the retrieval history data of the user, the browsing history data of the user, the downloading history data of the user, the collection history data of the user and the chatting history data of the user.
8. The information recommendation device of claim 6, wherein the plurality of target tag texts are separated by spaces or special characters, and a number of characters of each of the plurality of target tag texts does not exceed a predetermined threshold.
9. The information recommendation device of claim 8, wherein the predetermined threshold ranges from 20 to 50.
10. The information recommendation device of claim 6, wherein dynamically updating the tag vector model based on the newly added tag text dataset comprises:
acquiring new data related to the content in which the user is interested;
preprocessing the newly added data to generate the newly added text label data set;
and iteratively updating the label vector model based on the newly added label text data set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011336768.8A CN112417131A (en) | 2020-11-25 | 2020-11-25 | Information recommendation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011336768.8A CN112417131A (en) | 2020-11-25 | 2020-11-25 | Information recommendation method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112417131A true CN112417131A (en) | 2021-02-26 |
Family
ID=74843767
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011336768.8A Pending CN112417131A (en) | 2020-11-25 | 2020-11-25 | Information recommendation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112417131A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113222697A (en) * | 2021-05-11 | 2021-08-06 | 湖北三赫智能科技有限公司 | Commodity information pushing method, commodity information pushing device, computer equipment and readable storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111125495A (en) * | 2019-12-19 | 2020-05-08 | 京东方科技集团股份有限公司 | Information recommendation method, equipment and storage medium |
-
2020
- 2020-11-25 CN CN202011336768.8A patent/CN112417131A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111125495A (en) * | 2019-12-19 | 2020-05-08 | 京东方科技集团股份有限公司 | Information recommendation method, equipment and storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113222697A (en) * | 2021-05-11 | 2021-08-06 | 湖北三赫智能科技有限公司 | Commodity information pushing method, commodity information pushing device, computer equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11544474B2 (en) | Generation of text from structured data | |
US10762283B2 (en) | Multimedia document summarization | |
CN108319627B (en) | Keyword extraction method and keyword extraction device | |
US20180158078A1 (en) | Computer device and method for predicting market demand of commodities | |
CN111797214A (en) | FAQ database-based problem screening method and device, computer equipment and medium | |
CN108334951B (en) | Pre-statistics of data for nodes of a decision tree | |
CN111046221A (en) | Song recommendation method and device, terminal equipment and storage medium | |
CN113434636B (en) | Semantic-based approximate text searching method, semantic-based approximate text searching device, computer equipment and medium | |
CN111950279A (en) | Entity relationship processing method, device, equipment and computer readable storage medium | |
Lee et al. | Efficient image retrieval using advanced SURF and DCD on mobile platform | |
CN113688310A (en) | Content recommendation method, device, equipment and storage medium | |
CN112417133A (en) | Training method and device of ranking model | |
TW202001621A (en) | Corpus generating method and apparatus, and human-machine interaction processing method and apparatus | |
Su et al. | Hybrid recommender system based on deep learning model | |
Lee et al. | Extraction and prioritization of product attributes using an explainable neural network | |
Zhai et al. | Text classification of Chinese news based on multi-scale CNN and LSTM hybrid model | |
CN109241238B (en) | Article searching method and device and electronic equipment | |
CN111191011B (en) | Text label searching and matching method, device, equipment and storage medium | |
CN112417131A (en) | Information recommendation method and device | |
CN112417154B (en) | Method and device for determining similarity of documents | |
CN115964474A (en) | Policy keyword extraction method and device, storage medium and electronic equipment | |
CN107622129B (en) | Method and device for organizing knowledge base and computer storage medium | |
Tu et al. | A domain-independent text segmentation method for educational course content | |
CN113761213A (en) | Data query system and method based on knowledge graph and terminal equipment | |
Li et al. | Word-Graph2vec: An efficient word embedding approach on word co-occurrence graph using random walk technique |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: Room 001a, 11 / F, building 1, 588 Zixing Road, Minhang District, Shanghai, 200241 Applicant after: Shanghai chuangmi Shulian Intelligent Technology Development Co.,Ltd. Address before: Room 001a, 11 / F, building 1, 588 Zixing Road, Minhang District, Shanghai, 200241 Applicant before: SHANGHAI CHUANGMI TECHNOLOGY Co.,Ltd. |
|
CB02 | Change of applicant information |