CN109508421B - Word vector-based document recommendation method - Google Patents
Word vector-based document recommendation method Download PDFInfo
- Publication number
- CN109508421B CN109508421B CN201811415870.XA CN201811415870A CN109508421B CN 109508421 B CN109508421 B CN 109508421B CN 201811415870 A CN201811415870 A CN 201811415870A CN 109508421 B CN109508421 B CN 109508421B
- Authority
- CN
- China
- Prior art keywords
- document
- user
- reading
- users
- documents
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a word vector-based document recommendation method, which mainly utilizes a neural network language model to extract feature vectors of texts from document reading sequences of users, extracts global interests and reading context interests of the users from complete reading sequences and recent reading sub-sequences of the users, and finally comprehensively considers the global interests and the current reading context interests of the users during recommendation, so that recommended documents can meet the actual requirements and preferences of the users.
Description
Technical Field
The invention relates to a document recommendation method based on word vectors.
Background
Early literature recommendation mainly adopts a content-based recommendation algorithm, and the bottom layer characteristics of the literature are analyzed through text labels, so that the literature with similar content is put into a recommended reading list. For example, the document recommendation model based on MFCC and GMM is proposed to extract the text label features of the documents, but the extraction of the label data is very time-consuming, and nowadays, the documents are updated rapidly, and tens of thousands of new documents are pushed out every day, so the document recommendation based on the text label features is gradually eliminated.
Since the Tapestry system adopts collaborative filtering technology to solve the information excess problem, collaborative filtering is rapidly applied to recommendations of other fields. The famous foreign literature platform adopts a collaborative filtering technology, user behavior records are put into a server, a plurality of 'nearest neighbors' with similar interest and preference are found out according to the user behavior records, and finally documents which are favorite in the nearest neighbors but not browsed by a target user are recommended to the target user. In the latest research in China, Wangjun and the like propose a hierarchical document recommendation system concept, on one hand, document recommendation for collaborative filtering is carried out by adopting document preference similarity among users, on the other hand, the similarity of document contents comprises multiple dimensions such as theme, emotion, calligraphic method, word and the like, the two aspects are connected, and the advantages of the two aspects are fully played, so that the recommendation satisfaction degree is improved.
Unlike recommendations in other areas, users may have personal interests or purposes to assist in work and study while reading documents, thus creating a difference between document recommendations and mainstream e-commerce recommendations. In e-commerce, movie recommendations, it is relatively easy to collect explicit scoring tasks, as users would prefer to actively feedback to the system because they pay more financial or time cost (e.g., buy items for a fee, spend two hours watching a movie). The document product is low in cost, a user cannot specially score the document product, and the system can only record user context information (such as reading behavior of the user, user registration information and the like).
Disclosure of Invention
The purpose of the invention is as follows: aiming at the defects of the traditional literature recommendation system, the invention provides a method for adding a user reading list context based on a word vector in a literature recommendation algorithm.
The technical scheme is as follows: the invention discloses a word vector-based document recommendation method, which comprises the following steps of:
the method comprises the following steps:
step 1, extracting characteristics of user reading documents and contexts based on a neural network language model;
step 2, calculating a global interest vector and a context interest vector read by the user based on the document sequence characteristics of the user;
and 3, establishing a mathematical model to calculate the user similarity and the literature interest index, and realizing the literature recommendation of the user reading.
2. The method of claim 1, wherein step 1 comprises:
step 1-1, acquiring a complete document reading sequence of a user, wherein each record in the reading sequence comprises a document ID, reading time and document provenance;
step 1-2, grouping complete document reading sequences of users according to reading time and document provenance to obtain sub-sequences, setting a reading interval time threshold (for example, 8 hours), wherein records which do not exceed the interval time threshold and have the same document provenance are divided into the same sub-sequences, and records which exceed the interval time threshold or have different document provenance are divided into different sub-sequences;
step 1-3, processing the complete document reading sequences of all users by using a Word2vec model (reference: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013, January 17). effective Estimation of Word retrieval in vector space. arxiv. org.) in a neural network language model to obtain the feature vectors of each document with coarse granularity, and processing the subsequences of all users by using the Word2vec language model to obtain the feature vectors of each document with fine granularity, wherein documents with similar reading contexts have similar feature vectors.
And (3) for the feature vectors obtained in the steps 1-3, adjusting the dimensionality of the feature vectors according to the requirements on efficiency and accuracy, increasing the dimensionality of the feature vectors if more accurate recommendation results are needed, and reducing the dimensionality of the feature vectors (set according to actual conditions) if higher calculation efficiency is needed.
The step 2 comprises the following steps:
step 2-1, averaging all document coarse-grained feature vectors in a complete document reading sequence of a user to obtain a global interest vector of the user;
and 2-2, averaging all document fine-grained feature vectors in the document reading subsequence of the user to obtain a reading context interest vector of the user.
The step 3 comprises the following steps:
step 3-1, calculating the similarity between users according to the global interest vector of the users and the reading sequence of the complete literature;
step 3-2, calculating the interest number of the target user to the literature;
and 3-3, sequencing all the documents by using the calculation results of the step 3-1 and the step 3-2, and recommending the top N results to the target user.
Step 3-1 comprises: calculating the similarity between the users according to the global interest vector of the users and the reading sequence of the complete documents, wherein the calculation formula is as follows:
wherein mu represents a target user, v represents another user in the database, sim (mu, v) represents the similarity between the target user mu and another user v in the database, and MμRepresents a collection of documents, M, read by a user, muνRepresenting a collection of documents read by a user v, M representing MμAnd MνOne of the documents in the intersection is,representing the cosine similarity of the user global interest vector, and λ and θ are weighting coefficients (greater than 0).
Step 3-2 comprises: and calculating the interest of the target user in the literature, wherein the calculation formula is as follows:
wherein p isi(μ, m) represents the interest of the target user μ in the document m, Uμ,kRepresents the set of k users, U, most similar to the target user mumRepresenting a collection of users who have read the document m,a reading context interest vector representing the user mu,representing the fine-grained feature vector of document m,to representAndcosine similarity of (c), ω andis a weight coefficient (greater than 0).
The invention is based on the thought of Word vector Word2vec for the first time, obtains different granularity characteristics of documents from a complete reading sequence and a subsequence of a user by using a neural network language model Skip-gram, respectively expresses the different granularity characteristics as a coarse granularity characteristic vector and a fine granularity characteristic vector, and provides a reliable solution for the problem of difficult document characteristic extraction. The global interest and the reading context interest of the user are obtained according to the complete reading sequence of the user and the document feature vector in the latest reading subsequence, and a feasible idea is provided for the problem that the reading context of the user is difficult to extract and model. The recommendation method capable of comprehensively considering the global interest and the reading context interest of the user is provided, so that the recommended documents can better accord with the current preference of the user, the search cost of the user is reduced, and the satisfaction degree of the user is improved.
Drawings
The foregoing and other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a schematic diagram of a recommendation system architecture of a document recommendation method based on a word vector model according to the present invention.
FIG. 2 is a schematic diagram of a user document preference prediction process of the document recommendation method based on the word vector model according to the present invention.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
As shown in fig. 1 and 2, the present invention specifically includes the following steps:
step 1, 25000 complete literature reading sequence historical records of 1000 users are obtained, and each record in the reading sequence comprises an literature ID, reading time and an literature provenance;
step 2, setting a reading interval time threshold value to be 8 hours according to reading time and document provenance, grouping complete document reading sequences of a user to obtain 3300 sub-sequences, wherein the reading time interval is shorter than 8 hours, records with the same document provenance are divided into the same sub-sequence, and the reading time interval is longer than 8 hours or records with different document provenance are divided into different sub-sequences;
step 3, processing the complete document reading sequences of all users by using a Word2vec model in the neural network language model to obtain a coarse-grained feature vector of each document, and processing the subsequences of all users by using the Word2vec language model to obtain a fine-grained feature vector of each document, wherein the documents with similar reading contexts have similar feature vectors;
and 4, converting the dimensionality of the feature vector into 16 dimensions according to the requirements on efficiency and accuracy.
Step 5, averaging all document coarse-grained feature vectors in the complete document reading sequence of the user to obtain a global interest vector of the user;
and 6, averaging all the fine-grained feature vectors of the documents in the recent document reading subsequence of the user to obtain a reading context interest vector of the user.
And 7, calculating the similarity between the users according to the global interest vector of the users and the reading sequence of the complete literature, wherein the calculation formula is as follows:
wherein mu represents a target user, v represents another user in the database, MμRepresents a collection of documents, M, read by a user, muνRepresenting a collection of documents read by a user v,expressing the cosine similarity of the user global interest vector, wherein lambda and theta are weight coefficients, and the values are both 1;
step 8, calculating the interest of the target user in the literature, wherein the calculation formula is as follows:
where μ denotes the target user, Uμ,kRepresents the set of k users, U, most similar to mumRepresenting a collection of users who have read the document m,a reading context interest vector representing the user mu,representing the fine-grained feature vector of document m,denotes the cosine similarity of the two, ω andis a weight coefficient, where the values are all 1;
and 9, aiming at the target user, sequencing the user interest degrees of all the documents by using the calculation result in the step 8, recommending the first 3-5 results with the maximum interest degrees to the target user in the reading process of the user, and realizing the functions of extended reading and personalized recommendation in the reading experience of the target user.
The present invention provides a word vector-based document recommendation method, and a number of methods and approaches for implementing the technical solution, and the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, a number of modifications and embellishments can be made without departing from the principle of the present invention, and these modifications and embellishments should also be regarded as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.
Claims (1)
1. A document recommendation method based on word vectors is characterized by comprising the following steps:
step 1, extracting characteristics of user reading documents and contexts based on a neural network language model;
step 2, calculating a global interest vector and a context interest vector read by the user based on the document sequence characteristics of the user;
step 3, establishing a mathematical model to calculate the similarity of the users and the literature interest index, and realizing the literature recommendation of the user reading;
the step 1 comprises the following steps:
step 1-1, acquiring a complete document reading sequence of a user, wherein each record in the reading sequence comprises a document ID, reading time and document provenance;
step 1-2, grouping complete document reading sequences of users according to reading time and document provenance to obtain sub-sequences, setting a reading interval time threshold, dividing records which do not exceed the interval time threshold and have the same document provenance into the same sub-sequence, and dividing records which exceed the interval time threshold or have different document provenance into different sub-sequences;
1-3, processing complete document reading sequences of all users by using a Word2vec model in a neural network language model to obtain a coarse-grained feature vector of each document, and processing subsequences of all users by using the Word2vec language model to obtain a fine-grained feature vector of each document, wherein documents with similar reading contexts have similar feature vectors;
for the feature vectors obtained in the step 1-3, adjusting the dimensionality of the feature vectors according to the requirements on efficiency and accuracy, increasing the dimensionality of the feature vectors if more accurate recommendation results are needed, and reducing the dimensionality of the feature vectors if higher calculation efficiency is needed;
the step 2 comprises the following steps:
step 2-1, averaging all document coarse-grained feature vectors in a complete document reading sequence of a user to obtain a global interest vector of the user;
step 2-2, averaging all document fine-grained feature vectors in the user document reading subsequence to obtain a reading context interest vector of the user;
the step 3 comprises the following steps:
step 3-1, calculating the similarity between users according to the global interest vector of the users and the reading sequence of the complete literature;
step 3-2, calculating the interest number of the target user to the literature;
3-3, sorting all the documents by using the calculation results of the step 3-1 and the step 3-2, and recommending the first N results to a target user;
step 3-1 comprises: calculating the similarity between the users according to the global interest vector of the users and the reading sequence of the complete documents, wherein the calculation formula is as follows:
wherein mu represents a target user, v represents another user in the database, sim (mu, v) represents the similarity between the target user mu and another user v in the database, and MμRepresents a collection of documents, M, read by a user, muνRepresenting a collection of documents read by a user v, M representing MμAnd MνOne of the documents in the intersection is,representing cosine similarity of the user global interest vector, wherein lambda and theta are weight coefficients;
step 3-2 comprises: and calculating the interest of the target user in the literature, wherein the calculation formula is as follows:
wherein p isi(μ, m) represents the interest of the target user μ in the document m, Uμ,kRepresents the set of k users, U, most similar to the target user mumRepresenting a collection of users who have read the document m,a reading context interest vector representing the user mu,representing the fine-grained feature vector of document m,to representAndcosine similarity of (c), ω andare the weight coefficients.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811415870.XA CN109508421B (en) | 2018-11-26 | 2018-11-26 | Word vector-based document recommendation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811415870.XA CN109508421B (en) | 2018-11-26 | 2018-11-26 | Word vector-based document recommendation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109508421A CN109508421A (en) | 2019-03-22 |
CN109508421B true CN109508421B (en) | 2020-11-13 |
Family
ID=65750530
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811415870.XA Active CN109508421B (en) | 2018-11-26 | 2018-11-26 | Word vector-based document recommendation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109508421B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110929209B (en) * | 2019-12-06 | 2023-06-20 | 北京百度网讯科技有限公司 | Method and device for transmitting information |
CN114281961B (en) * | 2021-11-15 | 2024-07-26 | 北京智谱华章科技有限公司 | Scientific literature interest evaluation method and device based on biological dynamics model |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102929928A (en) * | 2012-09-21 | 2013-02-13 | 北京格致璞科技有限公司 | Multidimensional-similarity-based personalized news recommendation method |
CN105279288A (en) * | 2015-12-04 | 2016-01-27 | 深圳大学 | Online content recommending method based on deep neural network |
CN107357793A (en) * | 2016-05-10 | 2017-11-17 | 腾讯科技(深圳)有限公司 | Information recommendation method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8429106B2 (en) * | 2008-12-12 | 2013-04-23 | Atigeo Llc | Providing recommendations using information determined for domains of interest |
-
2018
- 2018-11-26 CN CN201811415870.XA patent/CN109508421B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102929928A (en) * | 2012-09-21 | 2013-02-13 | 北京格致璞科技有限公司 | Multidimensional-similarity-based personalized news recommendation method |
CN105279288A (en) * | 2015-12-04 | 2016-01-27 | 深圳大学 | Online content recommending method based on deep neural network |
CN107357793A (en) * | 2016-05-10 | 2017-11-17 | 腾讯科技(深圳)有限公司 | Information recommendation method and device |
Also Published As
Publication number | Publication date |
---|---|
CN109508421A (en) | 2019-03-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107102989B (en) | Entity disambiguation method based on word vector and convolutional neural network | |
CN107944913B (en) | High-potential user purchase intention prediction method based on big data user behavior analysis | |
Wei et al. | Collaborative filtering and deep learning based recommendation system for cold start items | |
CN107562742B (en) | Image data processing method and device | |
CN107133277B (en) | A kind of tourist attractions recommended method based on Dynamic Theme model and matrix decomposition | |
CN103886067A (en) | Method for recommending books through label implied topic | |
CN104935963A (en) | Video recommendation method based on timing sequence data mining | |
Ullah et al. | Image-based service recommendation system: A JPEG-coefficient RFs approach | |
CN111159341B (en) | Information recommendation method and device based on user investment and financial management preference | |
CN108804577B (en) | Method for estimating interest degree of information tag | |
CN109508421B (en) | Word vector-based document recommendation method | |
CN112434533A (en) | Entity disambiguation method, apparatus, electronic device, and computer-readable storage medium | |
JP5481295B2 (en) | Object recommendation device, object recommendation method, object recommendation program, and object recommendation system | |
CN115867919A (en) | Graph structure aware incremental learning for recommendation systems | |
CN111538846A (en) | Third-party library recommendation method based on mixed collaborative filtering | |
Yamamoto et al. | PBG at the NTCIR-13 Lifelog-2 LAT, LSAT, and LEST Tasks. | |
CN116521906A (en) | Meta description generation method, device, equipment and medium thereof | |
Chang et al. | An interactive approach to integrating external textual knowledge for multimodal lifelog retrieval | |
CN113095883B (en) | Video payment user prediction method and system based on deep cross attention network | |
CN112052388B (en) | Method and system for recommending food stores | |
CN117956232A (en) | Video recommendation method and device | |
CN114880572B (en) | Intelligent news client recommendation system | |
CN115203532B (en) | Project recommendation method and device, electronic equipment and storage medium | |
CN113688281B (en) | Video recommendation method and system based on deep learning behavior sequence | |
CN115905682A (en) | Interest point recommendation method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: No.1 Lingshan South Road, Qixia District, Nanjing, Jiangsu Province, 210000 Applicant after: THE 28TH RESEARCH INSTITUTE OF CHINA ELECTRONICS TECHNOLOGY Group Corp. Address before: 210007 No. 1 East Street, alfalfa garden, Jiangsu, Nanjing Applicant before: THE 28TH RESEARCH INSTITUTE OF CHINA ELECTRONICS TECHNOLOGY Group Corp. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |