CN115017293A - Document recommendation method based on LDA topic model - Google Patents

Document recommendation method based on LDA topic model Download PDF

Info

Publication number
CN115017293A
CN115017293A CN202210566870.XA CN202210566870A CN115017293A CN 115017293 A CN115017293 A CN 115017293A CN 202210566870 A CN202210566870 A CN 202210566870A CN 115017293 A CN115017293 A CN 115017293A
Authority
CN
China
Prior art keywords
user
document
theme
topic
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210566870.XA
Other languages
Chinese (zh)
Inventor
范昕煜
杨雨婷
王又辰
田宗凯
栾真
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Computer Technology and Applications
Original Assignee
Beijing Institute of Computer Technology and Applications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Computer Technology and Applications filed Critical Beijing Institute of Computer Technology and Applications
Priority to CN202210566870.XA priority Critical patent/CN115017293A/en
Publication of CN115017293A publication Critical patent/CN115017293A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a document recommendation method based on an LDA topic model, and belongs to the technical field of information. The invention uses LDA topic model to vectorize each document, outputs the topic probability of the document, and combines all document topic probabilities in the recommendation system to obtain a document-to-topic matrix. On the other hand, each user is given a theme probability by initializing a new user, the dimension of the user theme probability is kept consistent with that of the document theme probability, and then all the user theme probabilities are combined to obtain a user theme matrix. And finally, calculating the interest value of the user to the document and recommending the corresponding document to the user through the two theme probability matrixes of the user and the document. The recommendation method can be widely applied to a document recommendation system and is suitable for various documents.

Description

Document recommendation method based on LDA topic model
Technical Field
The invention belongs to the technical field of information, and particularly relates to a document recommendation method based on an LDA topic model.
Background
With the rapid development of information technology and the continuous abundance of information resources, information is explosively increased, and in the face of massive information resources, how to acquire information meeting the requirements of users from the massive information resources is a major problem in the current big data era. The mission of the document recommendation technology is to establish a connection between a user and a recommended article, however, when the user and the recommended article are confronted, a recommendation system is often subjected to a cold start problem. For new users or inactive users, and new articles or articles with a small number of presentations, the recommendation cannot be accurately made because the relationship between the new articles or the articles with a small number of presentations cannot be established due to lack of related data. Therefore, a method capable of fully utilizing the characteristics of the existing users and documents and establishing the connection is designed, and the method has important significance for solving the problem of cold start of the recommendation system.
Disclosure of Invention
Technical problem to be solved
The technical problem to be solved by the invention is as follows: how to solve the problem of cold start of text entering a recommendation system.
(II) technical scheme
In order to solve the technical problem, the invention provides a document recommendation method based on an LDA topic model, which comprises the following steps:
the method comprises the following steps of firstly, outputting a trained topic model by taking text data as input, and storing a user-topic distribution matrix, a document-topic distribution matrix and a user-document score matrix obtained by calculation into a memory;
and a second step of recommending the document by content based on the first step.
Preferably, the first step is specifically:
(1) training text data
Adopting LDA, taking text data as input, and outputting a trained topic model;
(2) computing document-topic distributions
When a document is imported, calculating the theme distribution through the theme model, namely document-theme distribution, wherein each row in a document-theme distribution matrix represents a document, and each column represents a theme;
(3) computing user-topic distributions
When a user logs in, reading user-theme distribution from a database, if the user is a new user, judging that the user has the same interest degree on all themes, initializing theme distribution with the same value, wherein each row in a user-theme distribution matrix represents one user, and each column represents one theme;
(4) computing user-document scores
The user-document score matrix is calculated by a user-theme distribution matrix and a theme-document distribution matrix:
user-document score matrix user-topic distribution matrix x topic-document distribution matrix
Each row in the user-document score matrix represents a score of a user corresponding to each document, each column represents a score of a document corresponding to each user, and the theme-document distribution matrix is a transpose of the document-theme distribution matrix;
the user-document score is an element in a user-document score matrix, three factors including user interest, browsing history and document heat are considered in calculating the user-document score, the user interest reflects the attention of the user to different subjects, the browsing history records the documents browsed by the user once, and the document heat reflects the explosion degree of the documents in the recommendation system; adding Sigmod (heat value) to each user-document score in the user-document score matrix, wherein the Sigmod (heat value) is a value obtained by performing Sigmod normalization on the heat value;
finally, according to the similarity between the user interest and the document theme, comprehensively considering browsing history and document heat, traversing all users and documents, and calculating user-document scores;
(5) storing the results in a memory
And storing the user-theme distribution matrix, the document-theme distribution matrix and the finally calculated user-document score matrix into a memory.
Preferably, the second step is specifically:
(1) calculating topic distribution of search content
Acquiring search content input by a user in a search box, and taking the search content as a document to be put into the trained topic model in the first step to calculate corresponding topic distribution;
(2) updating a user-topic distribution matrix
When a user searches input search content, judging that the user is interested in the topic searched this time, and updating the user interest at the moment, wherein the calculated topic distribution of the search content is added into a user-topic distribution matrix for updating the user-topic distribution matrix, so as to record the user behavior and recommend the document which is more interested by the user when the user searches next time;
(3) finding similar vectors by vector search tool Faiss
Introducing Faiss, reading the calculated document-theme distribution matrix from the memory and putting the document-theme distribution matrix into the Faiss as an index, after the index containing the document-theme distribution is created, putting the document-theme distribution of the user search content into the Faiss as a search vector to search for a similar vector, and taking the similar vector obtained by the Faiss search as a target vector;
(4) weighted random sampling to obtain the document to be recommended
Obtaining the distance d between the search vector and the target vector and the index of the target vector through Faiss, judging that the search vector is more similar to the target vector the closer the distance is, normalizing the distances of all the target vectors through Softmax, converting the distances into a group of values (0,1) which are equal to each other and are 1 in sum, and taking the values as the recommended weight w of each target vector:
w=Softmax(-d)
assuming that n documents are extracted for recommendation, firstly, the distances of target vectors are sorted from small to large, then, the first 10n documents are intercepted as samples of samples, and finally, the n documents in the 10n documents are obtained through weighting random non-return samples.
Preferably, the method further comprises a third step of actively recommending the document based on the first step.
Preferably, the third step is specifically:
(1) obtaining recommended documents through a user-document matrix
When a user logs in the system, acquiring a current user id from the rear end, extracting a document of the current user and a user-document score from a user-document score matrix stored in a memory, sequencing the documents according to the sequence of the scores from large to small, and taking the top 10n documents as candidate recommended documents;
(2) weighted random sampling
And normalizing the score of each document to obtain the weight corresponding to each document, and then obtaining n documents in 10n candidate recommended documents as final recommended documents through weighted random non-return sampling.
Preferably, when calculating the user-document score, for each document, the added heat value of the browsing file is 1, the added heat value of the favorite file is 3, and the added heat value of the download file is 2.
Preferably, when calculating the user-document scores, each user-document score is added with Sigmod (heat value), and 10% of the score is finally deducted for the document that the user has browsed.
The invention also provides a document recommendation system realized by the method.
The invention also provides application of the method in a document recommendation system.
The invention also provides an application of the method in information technology.
(III) advantageous effects
The invention uses the LDA theme model to vectorize each document, outputs the theme probability of the document, and combines all the document theme probabilities in the recommendation system to obtain a document-theme matrix. On the other hand, each user is given a theme probability by initializing a new user, the dimension of the user theme probability is kept consistent with that of the document theme probability, and then all the user theme probabilities are combined to obtain a user theme matrix. And finally, calculating the interest value of the user to the document and recommending the corresponding document to the user through the two theme probability matrixes of the user and the document. The recommendation method can be widely applied to a document recommendation system and is suitable for various documents.
Drawings
FIG. 1 is a flow chart of model initialization according to the present invention;
FIG. 2 is a flow chart of content recommendation in accordance with the present invention
FIG. 3 is a flow chart of active recommendation according to the present invention.
Detailed Description
In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.
According to the invention, through researching the characteristics of the user and the text, and combining with an LDA (Latent Dirichlet Allocation) topic model, a relatively universal text model is constructed, and the model is applied to a document recommendation system, so that the problem of cold start of the text entering the recommendation system is solved, and the requirements of the user on personalized services are met.
The invention aims to provide a method for modeling and recommending new users and new documents, so that a recommendation system can still make relatively accurate recommendation under the conditions of thin data, inactive users and the like. For most documents, the textual content may be represented with a topic probability, i.e., the textual content is vectorized and the characteristics of the textual content are obtained. The invention uses LDA topic model to vectorize each document, outputs the topic probability of the document, and combines all document topic probabilities in the recommendation system to obtain a document-to-topic matrix. On the other hand, each user is given a theme probability by initializing a new user, the dimension of the user theme probability is kept consistent with that of the document theme probability, and then all the user theme probabilities are combined to obtain a user theme matrix. And finally, calculating the interest value of the user to the document and recommending the corresponding document to the user through the two theme probability matrixes of the user and the document. The recommendation method can be widely applied to a document recommendation system and is suitable for various documents.
In the invention, firstly, the text is preprocessed, the text characteristics are analyzed, and an LDA topic model is utilized to construct a text topic model. Then, calculating the subject probability of all documents and users, and obtaining a scoring matrix by matrix multiplication. Secondly, the invention provides a document recommendation method fusing user topic interest and user behavior, and a method for recommending according to contents and an active recommendation method based on a user topic model are established according to the established user topic model. The method comprises the following specific steps:
1. model initialization
The invention recommends the related articles through the information such as the topics, browsing records, document popularity and the like which are selected by the user.
Referring to fig. 1, the present steps specifically include:
(1) training text data
The document recommendation function needs to train a theme model in advance, and the specific algorithm is to adopt LDA and take text data as input to output the trained theme model.
(2) Computing document-topic distributions
When the document is imported, the topic distribution of the document is obtained through the topic model calculation, namely document-topic distribution, each row in the document-topic distribution matrix represents one document, and each column represents one topic.
(3) Computing user-topic distributions
When a user logs in, reading user-theme distribution from a database, if the user is a new user, judging that the user has the same interest degree in all themes, and initializing theme distribution with the same value. Each row in the user-topic distribution matrix represents a user and each column represents a topic.
(4) Computing user-document scores
The user-document score matrix is calculated from a user-topic distribution matrix and a topic-document distribution matrix (which is a transpose of the document-topic distribution matrix):
user-document score matrix user-topic distribution matrix x topic-document distribution matrix
Each row in the user-document score matrix represents the score of one user corresponding to each document, each column represents the score of one document corresponding to each user, and three factors, namely user interest, browsing history and document popularity, need to be comprehensively considered in calculating the user-document score. The user interest reflects the attention of the user to different subjects, the browsing history records the documents browsed by the user, and the document heat reflects the explosion degree of the documents in the system. For each document, the increased heat value of the browsed file is 1, the increased heat value of the collected file is 3, and the increased heat value of the downloaded file is 2. The user-document score is an element in a user-document score matrix. Adding Sigmod (heat value) to each user-document score in the user-document score matrix; sigmod (heat value) means that the heat value is Sigmod normalized to be a number between 0 and 1, and for the document that the user has browsed, 10% of the score is finally deducted, so as to reduce the probability of repeatedly recommending the document. And finally, according to the similarity between the user interest and the document theme, comprehensively considering the browsing history and the document heat, traversing all the users and the documents, and calculating the scores of the users and the documents.
(5) Storing the results in a memory
And storing the user-theme distribution matrix, the document-theme distribution matrix and the finally calculated user-document score matrix into a memory.
2. Content-based recommendation
Referring to fig. 2, the present step specifically includes:
(1) calculating topic distribution of search content
And acquiring search content input by a user in the search box, and taking the search content as a document to be put into the trained topic model in the step 1 to calculate the corresponding topic distribution.
(2) Updating a user-topic distribution matrix
When the user searches the input search content, it is determined that the user is interested in the subject of the search, and at this time, the user interest needs to be updated. The recommendation system adds the calculated topic distribution of the searched content into the user-topic distribution matrix for updating the user-topic distribution matrix, so as to record the user behavior and recommend the document which is more interesting to the user when the user searches next time.
(3) Finding similar vectors by vector search tool Faiss
Referring to Faiss, the computed document-topic distribution matrix is read from memory and indexed into Faiss. After the index containing the document-topic distribution is created, the document-topic distribution of the user search content can be used as a search vector and put into Faiss to search for similar vectors.
(4) Weighted random sampling to obtain the document to be recommended
The distance d of the search vector from the target vector (i.e. the similar vector found by the Faiss search) and the index of the target vector can be found by the Faiss. The closer the distance, the more similar the search vector is determined to be to the target vector. The distance of all target vectors is normalized by Softmax, converted to a set of values between (0,1) and 1 in sum, which are treated as the recommended weight w for each target vector:
w=Softmax(-d)
assuming that n documents are extracted for recommendation, firstly, the distances of target vectors need to be sorted from small to large, then the first 10n documents are intercepted as samples of samples, and finally the n documents in the 10n documents are obtained through weighting random non-return samples.
3. Active recommendation
Referring to fig. 3, the present step specifically includes:
(1) obtaining recommended documents through a user-document matrix
When a user logs in the system, the current user id is obtained from the rear end, the documents and the score information of the current user are extracted from the user-document score matrix stored in the memory, then the documents are sorted according to the order of the scores from large to small, and the top 10n documents are taken as candidate recommended documents.
(2) Weighted random sampling
And normalizing the score of each document to obtain the corresponding weight of each document, and then obtaining n documents in 10n candidate recommended documents as final recommended documents through weighted random non-return sampling.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A document recommendation method based on an LDA topic model is characterized by comprising the following steps:
the method comprises the following steps of firstly, outputting a trained topic model by taking text data as input, and storing a user-topic distribution matrix, a document-topic distribution matrix and a user-document score matrix obtained by calculation into a memory;
and a second step of recommending the document by content based on the first step.
2. The method according to claim 1, characterized in that the first step is in particular:
(1) training text data
Adopting LDA, taking text data as input, and outputting a trained topic model;
(2) computing document-topic distributions
When a document is imported, calculating the theme distribution through the theme model, namely document-theme distribution, wherein each row in a document-theme distribution matrix represents a document, and each column represents a theme;
(3) computing user-topic distributions
When a user logs in, reading user-theme distribution from a database, if the user is a new user, judging that the user has the same interest degree on all themes, initializing theme distribution with the same value, wherein each row in a user-theme distribution matrix represents one user, and each column represents one theme;
(4) computing user-document scores
The user-document score matrix is calculated by a user-theme distribution matrix and a theme-document distribution matrix:
user-document score matrix user-topic distribution matrix x topic-document distribution matrix
Each row in the user-document score matrix represents a score of a user corresponding to each document, each column represents a score of a document corresponding to each user, and the theme-document distribution matrix is a transpose of the document-theme distribution matrix;
the user-document score is an element in a user-document score matrix, three factors including user interest, browsing history and document heat are considered in calculating the user-document score, the user interest reflects the attention of the user to different subjects, the browsing history records the documents browsed by the user once, and the document heat reflects the explosion degree of the documents in the recommendation system; adding Sigmod (heat value) to each user-document score in the user-document score matrix, wherein the Sigmod (heat value) is a value obtained by performing Sigmod normalization on the heat value;
finally, according to the similarity between the user interest and the document theme, comprehensively considering browsing history and document heat, traversing all users and documents, and calculating user-document scores;
(5) storing the results in a memory
And storing the user-theme distribution matrix, the document-theme distribution matrix and the finally calculated user-document score matrix into a memory.
3. The method according to claim 2, characterized in that the second step is embodied as:
(1) calculating topic distribution of search content
Acquiring search content input by a user in a search box, and taking the search content as a document to be put into the trained topic model in the first step to calculate corresponding topic distribution;
(2) updating a user-topic distribution matrix
When a user searches input search content, judging that the user is interested in the searched topic, and updating the user interest at the moment, wherein the calculated topic distribution of the search content is added into a user-topic distribution matrix to update the user-topic distribution matrix, so as to record user behaviors and recommend documents which are more interested by the user when the user searches next time;
(3) finding similar vectors by vector search tool Faiss
Introducing Faiss, reading the calculated document-theme distribution matrix from the memory and putting the document-theme distribution matrix into the Faiss as an index, after the index containing the document-theme distribution is created, putting the document-theme distribution of the user search content into the Faiss as a search vector to search for a similar vector, and taking the similar vector obtained by the Faiss search as a target vector;
(4) weighted random sampling to obtain the document to be recommended
Obtaining the distance d between the search vector and the target vector and the index of the target vector through Faiss, judging that the search vector is more similar to the target vector as the distance is closer, normalizing the distances of all the target vectors through Softmax, converting the distances into a group of values (0,1) which are between each other and are 1 in sum, and taking the values as the recommended weight w of each target vector:
w=Softmax(-d)
assuming that n documents are to be extracted for recommendation, the distances of the target vectors are ranked from small to large, then the top 10n documents are intercepted as samples of the samples, and finally the n documents in the 10n documents are obtained by weighting the samples without random replacement.
4. The method of claim 2, further comprising a third step of proactively recommending documents based on the first step.
5. The method according to claim 4, characterized in that the third step is in particular:
(1) obtaining recommended documents through a user-document matrix
When a user logs in the system, acquiring a current user id from the rear end, extracting a document of the current user and a user-document score from a user-document score matrix stored in a memory, sequencing the documents according to the sequence of the scores from large to small, and taking the top 10n documents as candidate recommended documents;
(2) weighted random sampling
And normalizing the score of each document to obtain the corresponding weight of each document, and then obtaining n documents in 10n candidate recommended documents as final recommended documents through weighted random non-return sampling.
6. The method of claim 2, wherein when calculating the user-document score, for each document, the browsing file increases by a heat value of 1, the favorite file increases by a heat value of 3, and the download file increases by a heat value of 2.
7. A method as claimed in claim 2, wherein in calculating the user-document scores, each user-document score is added with Sigmod (heat value), and for documents that have been viewed by the user, 10% of the score is finally deducted.
8. A document recommendation system implemented using the method of any of claims 1 to 7.
9. Use of the method of any one of claims 1 to 7 in a document recommendation system.
10. Use of the method according to any one of claims 1 to 7 in information technology.
CN202210566870.XA 2022-05-23 2022-05-23 Document recommendation method based on LDA topic model Pending CN115017293A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210566870.XA CN115017293A (en) 2022-05-23 2022-05-23 Document recommendation method based on LDA topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210566870.XA CN115017293A (en) 2022-05-23 2022-05-23 Document recommendation method based on LDA topic model

Publications (1)

Publication Number Publication Date
CN115017293A true CN115017293A (en) 2022-09-06

Family

ID=83069438

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210566870.XA Pending CN115017293A (en) 2022-05-23 2022-05-23 Document recommendation method based on LDA topic model

Country Status (1)

Country Link
CN (1) CN115017293A (en)

Similar Documents

Publication Publication Date Title
CN109829104B (en) Semantic similarity based pseudo-correlation feedback model information retrieval method and system
CN108829822B (en) Media content recommendation method and device, storage medium and electronic device
CN109885773B (en) Personalized article recommendation method, system, medium and equipment
US9846836B2 (en) Modeling interestingness with deep neural networks
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
US8073877B2 (en) Scalable semi-structured named entity detection
CN108280114B (en) Deep learning-based user literature reading interest analysis method
US8204874B2 (en) Abbreviation handling in web search
US9171078B2 (en) Automatic recommendation of vertical search engines
EP1995669A1 (en) Ontology-content-based filtering method for personalized newspapers
US20060235875A1 (en) Method and system for identifying object information
US20120323968A1 (en) Learning Discriminative Projections for Text Similarity Measures
CN106708929B (en) Video program searching method and device
US20080319973A1 (en) Recommending content using discriminatively trained document similarity
CN110737756B (en) Method, apparatus, device and medium for determining answer to user input data
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning
CN111400584A (en) Association word recommendation method and device, computer equipment and storage medium
CN110866102A (en) Search processing method
Ahmed et al. Named entity recognition by using maximum entropy
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN106570196B (en) Video program searching method and device
CN114153965A (en) Content and map combined public opinion event recommendation method, system and terminal
CN111460808B (en) Synonymous text recognition and content recommendation method and device and electronic equipment
CN113688633A (en) Outline determination method and device
Bender et al. Unsupervised Estimation of Subjective Content Descriptions in an Information System.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination