CN115017293A

CN115017293A - Document recommendation method based on LDA topic model

Info

Publication number: CN115017293A
Application number: CN202210566870.XA
Authority: CN
Inventors: 范昕煜; 杨雨婷; 王又辰; 田宗凯; 栾真
Original assignee: Beijing Institute of Computer Technology and Applications
Current assignee: Beijing Institute of Computer Technology and Applications
Priority date: 2022-05-23
Filing date: 2022-05-23
Publication date: 2022-09-06

Abstract

The invention relates to a document recommendation method based on an LDA topic model, and belongs to the technical field of information. The invention uses LDA topic model to vectorize each document, outputs the topic probability of the document, and combines all document topic probabilities in the recommendation system to obtain a document-to-topic matrix. On the other hand, each user is given a theme probability by initializing a new user, the dimension of the user theme probability is kept consistent with that of the document theme probability, and then all the user theme probabilities are combined to obtain a user theme matrix. And finally, calculating the interest value of the user to the document and recommending the corresponding document to the user through the two theme probability matrixes of the user and the document. The recommendation method can be widely applied to a document recommendation system and is suitable for various documents.

Description

Document recommendation method based on LDA topic model

Technical Field

The invention belongs to the technical field of information, and particularly relates to a document recommendation method based on an LDA topic model.

Background

With the rapid development of information technology and the continuous abundance of information resources, information is explosively increased, and in the face of massive information resources, how to acquire information meeting the requirements of users from the massive information resources is a major problem in the current big data era. The mission of the document recommendation technology is to establish a connection between a user and a recommended article, however, when the user and the recommended article are confronted, a recommendation system is often subjected to a cold start problem. For new users or inactive users, and new articles or articles with a small number of presentations, the recommendation cannot be accurately made because the relationship between the new articles or the articles with a small number of presentations cannot be established due to lack of related data. Therefore, a method capable of fully utilizing the characteristics of the existing users and documents and establishing the connection is designed, and the method has important significance for solving the problem of cold start of the recommendation system.

Disclosure of Invention

Technical problem to be solved

The technical problem to be solved by the invention is as follows: how to solve the problem of cold start of text entering a recommendation system.

(II) technical scheme

In order to solve the technical problem, the invention provides a document recommendation method based on an LDA topic model, which comprises the following steps:

the method comprises the following steps of firstly, outputting a trained topic model by taking text data as input, and storing a user-topic distribution matrix, a document-topic distribution matrix and a user-document score matrix obtained by calculation into a memory;

and a second step of recommending the document by content based on the first step.

Preferably, the first step is specifically:

(1) training text data

Adopting LDA, taking text data as input, and outputting a trained topic model;

(2) computing document-topic distributions

When a document is imported, calculating the theme distribution through the theme model, namely document-theme distribution, wherein each row in a document-theme distribution matrix represents a document, and each column represents a theme;

(3) computing user-topic distributions

When a user logs in, reading user-theme distribution from a database, if the user is a new user, judging that the user has the same interest degree on all themes, initializing theme distribution with the same value, wherein each row in a user-theme distribution matrix represents one user, and each column represents one theme;

(4) computing user-document scores

The user-document score matrix is calculated by a user-theme distribution matrix and a theme-document distribution matrix:

user-document score matrix user-topic distribution matrix x topic-document distribution matrix

Each row in the user-document score matrix represents a score of a user corresponding to each document, each column represents a score of a document corresponding to each user, and the theme-document distribution matrix is a transpose of the document-theme distribution matrix;

the user-document score is an element in a user-document score matrix, three factors including user interest, browsing history and document heat are considered in calculating the user-document score, the user interest reflects the attention of the user to different subjects, the browsing history records the documents browsed by the user once, and the document heat reflects the explosion degree of the documents in the recommendation system; adding Sigmod (heat value) to each user-document score in the user-document score matrix, wherein the Sigmod (heat value) is a value obtained by performing Sigmod normalization on the heat value;

finally, according to the similarity between the user interest and the document theme, comprehensively considering browsing history and document heat, traversing all users and documents, and calculating user-document scores;

(5) storing the results in a memory

And storing the user-theme distribution matrix, the document-theme distribution matrix and the finally calculated user-document score matrix into a memory.

Preferably, the second step is specifically:

(1) calculating topic distribution of search content

Acquiring search content input by a user in a search box, and taking the search content as a document to be put into the trained topic model in the first step to calculate corresponding topic distribution;

(2) updating a user-topic distribution matrix

When a user searches input search content, judging that the user is interested in the topic searched this time, and updating the user interest at the moment, wherein the calculated topic distribution of the search content is added into a user-topic distribution matrix for updating the user-topic distribution matrix, so as to record the user behavior and recommend the document which is more interested by the user when the user searches next time;

(3) finding similar vectors by vector search tool Faiss

Introducing Faiss, reading the calculated document-theme distribution matrix from the memory and putting the document-theme distribution matrix into the Faiss as an index, after the index containing the document-theme distribution is created, putting the document-theme distribution of the user search content into the Faiss as a search vector to search for a similar vector, and taking the similar vector obtained by the Faiss search as a target vector;

(4) weighted random sampling to obtain the document to be recommended

Obtaining the distance d between the search vector and the target vector and the index of the target vector through Faiss, judging that the search vector is more similar to the target vector the closer the distance is, normalizing the distances of all the target vectors through Softmax, converting the distances into a group of values (0,1) which are equal to each other and are 1 in sum, and taking the values as the recommended weight w of each target vector:

w＝Softmax(-d)

assuming that n documents are extracted for recommendation, firstly, the distances of target vectors are sorted from small to large, then, the first 10n documents are intercepted as samples of samples, and finally, the n documents in the 10n documents are obtained through weighting random non-return samples.

Preferably, the method further comprises a third step of actively recommending the document based on the first step.

Preferably, the third step is specifically:

(1) obtaining recommended documents through a user-document matrix

When a user logs in the system, acquiring a current user id from the rear end, extracting a document of the current user and a user-document score from a user-document score matrix stored in a memory, sequencing the documents according to the sequence of the scores from large to small, and taking the top 10n documents as candidate recommended documents;

(2) weighted random sampling

And normalizing the score of each document to obtain the weight corresponding to each document, and then obtaining n documents in 10n candidate recommended documents as final recommended documents through weighted random non-return sampling.

Preferably, when calculating the user-document score, for each document, the added heat value of the browsing file is 1, the added heat value of the favorite file is 3, and the added heat value of the download file is 2.

Preferably, when calculating the user-document scores, each user-document score is added with Sigmod (heat value), and 10% of the score is finally deducted for the document that the user has browsed.

The invention also provides a document recommendation system realized by the method.

The invention also provides application of the method in a document recommendation system.

The invention also provides an application of the method in information technology.

(III) advantageous effects

The invention uses the LDA theme model to vectorize each document, outputs the theme probability of the document, and combines all the document theme probabilities in the recommendation system to obtain a document-theme matrix. On the other hand, each user is given a theme probability by initializing a new user, the dimension of the user theme probability is kept consistent with that of the document theme probability, and then all the user theme probabilities are combined to obtain a user theme matrix. And finally, calculating the interest value of the user to the document and recommending the corresponding document to the user through the two theme probability matrixes of the user and the document. The recommendation method can be widely applied to a document recommendation system and is suitable for various documents.

Drawings

FIG. 1 is a flow chart of model initialization according to the present invention;

FIG. 2 is a flow chart of content recommendation in accordance with the present invention

FIG. 3 is a flow chart of active recommendation according to the present invention.

Detailed Description

In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.

According to the invention, through researching the characteristics of the user and the text, and combining with an LDA (Latent Dirichlet Allocation) topic model, a relatively universal text model is constructed, and the model is applied to a document recommendation system, so that the problem of cold start of the text entering the recommendation system is solved, and the requirements of the user on personalized services are met.

The invention aims to provide a method for modeling and recommending new users and new documents, so that a recommendation system can still make relatively accurate recommendation under the conditions of thin data, inactive users and the like. For most documents, the textual content may be represented with a topic probability, i.e., the textual content is vectorized and the characteristics of the textual content are obtained. The invention uses LDA topic model to vectorize each document, outputs the topic probability of the document, and combines all document topic probabilities in the recommendation system to obtain a document-to-topic matrix. On the other hand, each user is given a theme probability by initializing a new user, the dimension of the user theme probability is kept consistent with that of the document theme probability, and then all the user theme probabilities are combined to obtain a user theme matrix. And finally, calculating the interest value of the user to the document and recommending the corresponding document to the user through the two theme probability matrixes of the user and the document. The recommendation method can be widely applied to a document recommendation system and is suitable for various documents.

In the invention, firstly, the text is preprocessed, the text characteristics are analyzed, and an LDA topic model is utilized to construct a text topic model. Then, calculating the subject probability of all documents and users, and obtaining a scoring matrix by matrix multiplication. Secondly, the invention provides a document recommendation method fusing user topic interest and user behavior, and a method for recommending according to contents and an active recommendation method based on a user topic model are established according to the established user topic model. The method comprises the following specific steps:

1. model initialization

The invention recommends the related articles through the information such as the topics, browsing records, document popularity and the like which are selected by the user.

Referring to fig. 1, the present steps specifically include:

(1) training text data

The document recommendation function needs to train a theme model in advance, and the specific algorithm is to adopt LDA and take text data as input to output the trained theme model.

(2) Computing document-topic distributions

When the document is imported, the topic distribution of the document is obtained through the topic model calculation, namely document-topic distribution, each row in the document-topic distribution matrix represents one document, and each column represents one topic.

(3) Computing user-topic distributions

When a user logs in, reading user-theme distribution from a database, if the user is a new user, judging that the user has the same interest degree in all themes, and initializing theme distribution with the same value. Each row in the user-topic distribution matrix represents a user and each column represents a topic.

(4) Computing user-document scores

The user-document score matrix is calculated from a user-topic distribution matrix and a topic-document distribution matrix (which is a transpose of the document-topic distribution matrix):

Each row in the user-document score matrix represents the score of one user corresponding to each document, each column represents the score of one document corresponding to each user, and three factors, namely user interest, browsing history and document popularity, need to be comprehensively considered in calculating the user-document score. The user interest reflects the attention of the user to different subjects, the browsing history records the documents browsed by the user, and the document heat reflects the explosion degree of the documents in the system. For each document, the increased heat value of the browsed file is 1, the increased heat value of the collected file is 3, and the increased heat value of the downloaded file is 2. The user-document score is an element in a user-document score matrix. Adding Sigmod (heat value) to each user-document score in the user-document score matrix; sigmod (heat value) means that the heat value is Sigmod normalized to be a number between 0 and 1, and for the document that the user has browsed, 10% of the score is finally deducted, so as to reduce the probability of repeatedly recommending the document. And finally, according to the similarity between the user interest and the document theme, comprehensively considering the browsing history and the document heat, traversing all the users and the documents, and calculating the scores of the users and the documents.

(5) Storing the results in a memory

2. Content-based recommendation

Referring to fig. 2, the present step specifically includes:

(1) calculating topic distribution of search content

And acquiring search content input by a user in the search box, and taking the search content as a document to be put into the trained topic model in the step 1 to calculate the corresponding topic distribution.

(2) Updating a user-topic distribution matrix

When the user searches the input search content, it is determined that the user is interested in the subject of the search, and at this time, the user interest needs to be updated. The recommendation system adds the calculated topic distribution of the searched content into the user-topic distribution matrix for updating the user-topic distribution matrix, so as to record the user behavior and recommend the document which is more interesting to the user when the user searches next time.

(3) Finding similar vectors by vector search tool Faiss

Referring to Faiss, the computed document-topic distribution matrix is read from memory and indexed into Faiss. After the index containing the document-topic distribution is created, the document-topic distribution of the user search content can be used as a search vector and put into Faiss to search for similar vectors.

(4) Weighted random sampling to obtain the document to be recommended

The distance d of the search vector from the target vector (i.e. the similar vector found by the Faiss search) and the index of the target vector can be found by the Faiss. The closer the distance, the more similar the search vector is determined to be to the target vector. The distance of all target vectors is normalized by Softmax, converted to a set of values between (0,1) and 1 in sum, which are treated as the recommended weight w for each target vector:

w＝Softmax(-d)

assuming that n documents are extracted for recommendation, firstly, the distances of target vectors need to be sorted from small to large, then the first 10n documents are intercepted as samples of samples, and finally the n documents in the 10n documents are obtained through weighting random non-return samples.

3. Active recommendation

Referring to fig. 3, the present step specifically includes:

(1) obtaining recommended documents through a user-document matrix

When a user logs in the system, the current user id is obtained from the rear end, the documents and the score information of the current user are extracted from the user-document score matrix stored in the memory, then the documents are sorted according to the order of the scores from large to small, and the top 10n documents are taken as candidate recommended documents.

(2) Weighted random sampling

And normalizing the score of each document to obtain the corresponding weight of each document, and then obtaining n documents in 10n candidate recommended documents as final recommended documents through weighted random non-return sampling.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A document recommendation method based on an LDA topic model is characterized by comprising the following steps:

2. The method according to claim 1, characterized in that the first step is in particular:

(1) training text data

Adopting LDA, taking text data as input, and outputting a trained topic model;

(2) computing document-topic distributions

(3) computing user-topic distributions

(4) computing user-document scores

(5) storing the results in a memory

3. The method according to claim 2, characterized in that the second step is embodied as:

(1) calculating topic distribution of search content

(2) updating a user-topic distribution matrix

When a user searches input search content, judging that the user is interested in the searched topic, and updating the user interest at the moment, wherein the calculated topic distribution of the search content is added into a user-topic distribution matrix to update the user-topic distribution matrix, so as to record user behaviors and recommend documents which are more interested by the user when the user searches next time;

(3) finding similar vectors by vector search tool Faiss

(4) weighted random sampling to obtain the document to be recommended

Obtaining the distance d between the search vector and the target vector and the index of the target vector through Faiss, judging that the search vector is more similar to the target vector as the distance is closer, normalizing the distances of all the target vectors through Softmax, converting the distances into a group of values (0,1) which are between each other and are 1 in sum, and taking the values as the recommended weight w of each target vector:

w＝Softmax(-d)

assuming that n documents are to be extracted for recommendation, the distances of the target vectors are ranked from small to large, then the top 10n documents are intercepted as samples of the samples, and finally the n documents in the 10n documents are obtained by weighting the samples without random replacement.

4. The method of claim 2, further comprising a third step of proactively recommending documents based on the first step.

5. The method according to claim 4, characterized in that the third step is in particular:

(1) obtaining recommended documents through a user-document matrix

(2) weighted random sampling

6. The method of claim 2, wherein when calculating the user-document score, for each document, the browsing file increases by a heat value of 1, the favorite file increases by a heat value of 3, and the download file increases by a heat value of 2.

7. A method as claimed in claim 2, wherein in calculating the user-document scores, each user-document score is added with Sigmod (heat value), and for documents that have been viewed by the user, 10% of the score is finally deducted.

8. A document recommendation system implemented using the method of any of claims 1 to 7.

9. Use of the method of any one of claims 1 to 7 in a document recommendation system.

10. Use of the method according to any one of claims 1 to 7 in information technology.