CN109508421B - Word vector-based document recommendation method - Google Patents

Word vector-based document recommendation method Download PDF

Info

Publication number
CN109508421B
CN109508421B CN201811415870.XA CN201811415870A CN109508421B CN 109508421 B CN109508421 B CN 109508421B CN 201811415870 A CN201811415870 A CN 201811415870A CN 109508421 B CN109508421 B CN 109508421B
Authority
CN
China
Prior art keywords
document
user
reading
users
documents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811415870.XA
Other languages
Chinese (zh)
Other versions
CN109508421A (en
Inventor
后弘毅
杨权
梁栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 28 Research Institute
Original Assignee
CETC 28 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 28 Research Institute filed Critical CETC 28 Research Institute
Priority to CN201811415870.XA priority Critical patent/CN109508421B/en
Publication of CN109508421A publication Critical patent/CN109508421A/en
Application granted granted Critical
Publication of CN109508421B publication Critical patent/CN109508421B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a word vector-based document recommendation method, which mainly utilizes a neural network language model to extract feature vectors of texts from document reading sequences of users, extracts global interests and reading context interests of the users from complete reading sequences and recent reading sub-sequences of the users, and finally comprehensively considers the global interests and the current reading context interests of the users during recommendation, so that recommended documents can meet the actual requirements and preferences of the users.

Description

Word vector-based document recommendation method
Technical Field
The invention relates to a document recommendation method based on word vectors.
Background
Early literature recommendation mainly adopts a content-based recommendation algorithm, and the bottom layer characteristics of the literature are analyzed through text labels, so that the literature with similar content is put into a recommended reading list. For example, the document recommendation model based on MFCC and GMM is proposed to extract the text label features of the documents, but the extraction of the label data is very time-consuming, and nowadays, the documents are updated rapidly, and tens of thousands of new documents are pushed out every day, so the document recommendation based on the text label features is gradually eliminated.
Since the Tapestry system adopts collaborative filtering technology to solve the information excess problem, collaborative filtering is rapidly applied to recommendations of other fields. The famous foreign literature platform adopts a collaborative filtering technology, user behavior records are put into a server, a plurality of 'nearest neighbors' with similar interest and preference are found out according to the user behavior records, and finally documents which are favorite in the nearest neighbors but not browsed by a target user are recommended to the target user. In the latest research in China, Wangjun and the like propose a hierarchical document recommendation system concept, on one hand, document recommendation for collaborative filtering is carried out by adopting document preference similarity among users, on the other hand, the similarity of document contents comprises multiple dimensions such as theme, emotion, calligraphic method, word and the like, the two aspects are connected, and the advantages of the two aspects are fully played, so that the recommendation satisfaction degree is improved.
Unlike recommendations in other areas, users may have personal interests or purposes to assist in work and study while reading documents, thus creating a difference between document recommendations and mainstream e-commerce recommendations. In e-commerce, movie recommendations, it is relatively easy to collect explicit scoring tasks, as users would prefer to actively feedback to the system because they pay more financial or time cost (e.g., buy items for a fee, spend two hours watching a movie). The document product is low in cost, a user cannot specially score the document product, and the system can only record user context information (such as reading behavior of the user, user registration information and the like).
Disclosure of Invention
The purpose of the invention is as follows: aiming at the defects of the traditional literature recommendation system, the invention provides a method for adding a user reading list context based on a word vector in a literature recommendation algorithm.
The technical scheme is as follows: the invention discloses a word vector-based document recommendation method, which comprises the following steps of:
the method comprises the following steps:
step 1, extracting characteristics of user reading documents and contexts based on a neural network language model;
step 2, calculating a global interest vector and a context interest vector read by the user based on the document sequence characteristics of the user;
and 3, establishing a mathematical model to calculate the user similarity and the literature interest index, and realizing the literature recommendation of the user reading.
2. The method of claim 1, wherein step 1 comprises:
step 1-1, acquiring a complete document reading sequence of a user, wherein each record in the reading sequence comprises a document ID, reading time and document provenance;
step 1-2, grouping complete document reading sequences of users according to reading time and document provenance to obtain sub-sequences, setting a reading interval time threshold (for example, 8 hours), wherein records which do not exceed the interval time threshold and have the same document provenance are divided into the same sub-sequences, and records which exceed the interval time threshold or have different document provenance are divided into different sub-sequences;
step 1-3, processing the complete document reading sequences of all users by using a Word2vec model (reference: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013, January 17). effective Estimation of Word retrieval in vector space. arxiv. org.) in a neural network language model to obtain the feature vectors of each document with coarse granularity, and processing the subsequences of all users by using the Word2vec language model to obtain the feature vectors of each document with fine granularity, wherein documents with similar reading contexts have similar feature vectors.
And (3) for the feature vectors obtained in the steps 1-3, adjusting the dimensionality of the feature vectors according to the requirements on efficiency and accuracy, increasing the dimensionality of the feature vectors if more accurate recommendation results are needed, and reducing the dimensionality of the feature vectors (set according to actual conditions) if higher calculation efficiency is needed.
The step 2 comprises the following steps:
step 2-1, averaging all document coarse-grained feature vectors in a complete document reading sequence of a user to obtain a global interest vector of the user;
and 2-2, averaging all document fine-grained feature vectors in the document reading subsequence of the user to obtain a reading context interest vector of the user.
The step 3 comprises the following steps:
step 3-1, calculating the similarity between users according to the global interest vector of the users and the reading sequence of the complete literature;
step 3-2, calculating the interest number of the target user to the literature;
and 3-3, sequencing all the documents by using the calculation results of the step 3-1 and the step 3-2, and recommending the top N results to the target user.
Step 3-1 comprises: calculating the similarity between the users according to the global interest vector of the users and the reading sequence of the complete documents, wherein the calculation formula is as follows:
Figure BDA0001879446400000031
wherein mu represents a target user, v represents another user in the database, sim (mu, v) represents the similarity between the target user mu and another user v in the database, and MμRepresents a collection of documents, M, read by a user, muνRepresenting a collection of documents read by a user v, M representing MμAnd MνOne of the documents in the intersection is,
Figure BDA0001879446400000032
representing the cosine similarity of the user global interest vector, and λ and θ are weighting coefficients (greater than 0).
Step 3-2 comprises: and calculating the interest of the target user in the literature, wherein the calculation formula is as follows:
Figure BDA0001879446400000033
wherein p isi(μ, m) represents the interest of the target user μ in the document m, Uμ,kRepresents the set of k users, U, most similar to the target user mumRepresenting a collection of users who have read the document m,
Figure BDA0001879446400000034
a reading context interest vector representing the user mu,
Figure BDA0001879446400000035
representing the fine-grained feature vector of document m,
Figure BDA0001879446400000036
to represent
Figure BDA0001879446400000037
And
Figure BDA0001879446400000038
cosine similarity of (c), ω and
Figure BDA0001879446400000039
is a weight coefficient (greater than 0).
The invention is based on the thought of Word vector Word2vec for the first time, obtains different granularity characteristics of documents from a complete reading sequence and a subsequence of a user by using a neural network language model Skip-gram, respectively expresses the different granularity characteristics as a coarse granularity characteristic vector and a fine granularity characteristic vector, and provides a reliable solution for the problem of difficult document characteristic extraction. The global interest and the reading context interest of the user are obtained according to the complete reading sequence of the user and the document feature vector in the latest reading subsequence, and a feasible idea is provided for the problem that the reading context of the user is difficult to extract and model. The recommendation method capable of comprehensively considering the global interest and the reading context interest of the user is provided, so that the recommended documents can better accord with the current preference of the user, the search cost of the user is reduced, and the satisfaction degree of the user is improved.
Drawings
The foregoing and other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a schematic diagram of a recommendation system architecture of a document recommendation method based on a word vector model according to the present invention.
FIG. 2 is a schematic diagram of a user document preference prediction process of the document recommendation method based on the word vector model according to the present invention.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
As shown in fig. 1 and 2, the present invention specifically includes the following steps:
step 1, 25000 complete literature reading sequence historical records of 1000 users are obtained, and each record in the reading sequence comprises an literature ID, reading time and an literature provenance;
step 2, setting a reading interval time threshold value to be 8 hours according to reading time and document provenance, grouping complete document reading sequences of a user to obtain 3300 sub-sequences, wherein the reading time interval is shorter than 8 hours, records with the same document provenance are divided into the same sub-sequence, and the reading time interval is longer than 8 hours or records with different document provenance are divided into different sub-sequences;
step 3, processing the complete document reading sequences of all users by using a Word2vec model in the neural network language model to obtain a coarse-grained feature vector of each document, and processing the subsequences of all users by using the Word2vec language model to obtain a fine-grained feature vector of each document, wherein the documents with similar reading contexts have similar feature vectors;
and 4, converting the dimensionality of the feature vector into 16 dimensions according to the requirements on efficiency and accuracy.
Step 5, averaging all document coarse-grained feature vectors in the complete document reading sequence of the user to obtain a global interest vector of the user;
and 6, averaging all the fine-grained feature vectors of the documents in the recent document reading subsequence of the user to obtain a reading context interest vector of the user.
And 7, calculating the similarity between the users according to the global interest vector of the users and the reading sequence of the complete literature, wherein the calculation formula is as follows:
Figure BDA0001879446400000041
wherein mu represents a target user, v represents another user in the database, MμRepresents a collection of documents, M, read by a user, muνRepresenting a collection of documents read by a user v,
Figure BDA0001879446400000042
expressing the cosine similarity of the user global interest vector, wherein lambda and theta are weight coefficients, and the values are both 1;
step 8, calculating the interest of the target user in the literature, wherein the calculation formula is as follows:
Figure BDA0001879446400000043
where μ denotes the target user, Uμ,kRepresents the set of k users, U, most similar to mumRepresenting a collection of users who have read the document m,
Figure BDA0001879446400000051
a reading context interest vector representing the user mu,
Figure BDA0001879446400000052
representing the fine-grained feature vector of document m,
Figure BDA0001879446400000053
denotes the cosine similarity of the two, ω and
Figure BDA0001879446400000054
is a weight coefficient, where the values are all 1;
and 9, aiming at the target user, sequencing the user interest degrees of all the documents by using the calculation result in the step 8, recommending the first 3-5 results with the maximum interest degrees to the target user in the reading process of the user, and realizing the functions of extended reading and personalized recommendation in the reading experience of the target user.
The present invention provides a word vector-based document recommendation method, and a number of methods and approaches for implementing the technical solution, and the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, a number of modifications and embellishments can be made without departing from the principle of the present invention, and these modifications and embellishments should also be regarded as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims (1)

1. A document recommendation method based on word vectors is characterized by comprising the following steps:
step 1, extracting characteristics of user reading documents and contexts based on a neural network language model;
step 2, calculating a global interest vector and a context interest vector read by the user based on the document sequence characteristics of the user;
step 3, establishing a mathematical model to calculate the similarity of the users and the literature interest index, and realizing the literature recommendation of the user reading;
the step 1 comprises the following steps:
step 1-1, acquiring a complete document reading sequence of a user, wherein each record in the reading sequence comprises a document ID, reading time and document provenance;
step 1-2, grouping complete document reading sequences of users according to reading time and document provenance to obtain sub-sequences, setting a reading interval time threshold, dividing records which do not exceed the interval time threshold and have the same document provenance into the same sub-sequence, and dividing records which exceed the interval time threshold or have different document provenance into different sub-sequences;
1-3, processing complete document reading sequences of all users by using a Word2vec model in a neural network language model to obtain a coarse-grained feature vector of each document, and processing subsequences of all users by using the Word2vec language model to obtain a fine-grained feature vector of each document, wherein documents with similar reading contexts have similar feature vectors;
for the feature vectors obtained in the step 1-3, adjusting the dimensionality of the feature vectors according to the requirements on efficiency and accuracy, increasing the dimensionality of the feature vectors if more accurate recommendation results are needed, and reducing the dimensionality of the feature vectors if higher calculation efficiency is needed;
the step 2 comprises the following steps:
step 2-1, averaging all document coarse-grained feature vectors in a complete document reading sequence of a user to obtain a global interest vector of the user;
step 2-2, averaging all document fine-grained feature vectors in the user document reading subsequence to obtain a reading context interest vector of the user;
the step 3 comprises the following steps:
step 3-1, calculating the similarity between users according to the global interest vector of the users and the reading sequence of the complete literature;
step 3-2, calculating the interest number of the target user to the literature;
3-3, sorting all the documents by using the calculation results of the step 3-1 and the step 3-2, and recommending the first N results to a target user;
step 3-1 comprises: calculating the similarity between the users according to the global interest vector of the users and the reading sequence of the complete documents, wherein the calculation formula is as follows:
Figure FDA0002662413470000021
wherein mu represents a target user, v represents another user in the database, sim (mu, v) represents the similarity between the target user mu and another user v in the database, and MμRepresents a collection of documents, M, read by a user, muνRepresenting a collection of documents read by a user v, M representing MμAnd MνOne of the documents in the intersection is,
Figure FDA0002662413470000022
representing cosine similarity of the user global interest vector, wherein lambda and theta are weight coefficients;
step 3-2 comprises: and calculating the interest of the target user in the literature, wherein the calculation formula is as follows:
Figure FDA0002662413470000023
wherein p isi(μ, m) represents the interest of the target user μ in the document m, Uμ,kRepresents the set of k users, U, most similar to the target user mumRepresenting a collection of users who have read the document m,
Figure FDA0002662413470000024
a reading context interest vector representing the user mu,
Figure FDA0002662413470000025
representing the fine-grained feature vector of document m,
Figure FDA0002662413470000026
to represent
Figure FDA0002662413470000027
And
Figure FDA0002662413470000028
cosine similarity of (c), ω and
Figure DEST_PATH_BDA0001879446400000039
are the weight coefficients.
CN201811415870.XA 2018-11-26 2018-11-26 Word vector-based document recommendation method Active CN109508421B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811415870.XA CN109508421B (en) 2018-11-26 2018-11-26 Word vector-based document recommendation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811415870.XA CN109508421B (en) 2018-11-26 2018-11-26 Word vector-based document recommendation method

Publications (2)

Publication Number Publication Date
CN109508421A CN109508421A (en) 2019-03-22
CN109508421B true CN109508421B (en) 2020-11-13

Family

ID=65750530

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811415870.XA Active CN109508421B (en) 2018-11-26 2018-11-26 Word vector-based document recommendation method

Country Status (1)

Country Link
CN (1) CN109508421B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929209B (en) * 2019-12-06 2023-06-20 北京百度网讯科技有限公司 Method and device for transmitting information
CN114281961B (en) * 2021-11-15 2024-07-26 北京智谱华章科技有限公司 Scientific literature interest evaluation method and device based on biological dynamics model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929928A (en) * 2012-09-21 2013-02-13 北京格致璞科技有限公司 Multidimensional-similarity-based personalized news recommendation method
CN105279288A (en) * 2015-12-04 2016-01-27 深圳大学 Online content recommending method based on deep neural network
CN107357793A (en) * 2016-05-10 2017-11-17 腾讯科技(深圳)有限公司 Information recommendation method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8429106B2 (en) * 2008-12-12 2013-04-23 Atigeo Llc Providing recommendations using information determined for domains of interest

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929928A (en) * 2012-09-21 2013-02-13 北京格致璞科技有限公司 Multidimensional-similarity-based personalized news recommendation method
CN105279288A (en) * 2015-12-04 2016-01-27 深圳大学 Online content recommending method based on deep neural network
CN107357793A (en) * 2016-05-10 2017-11-17 腾讯科技(深圳)有限公司 Information recommendation method and device

Also Published As

Publication number Publication date
CN109508421A (en) 2019-03-22

Similar Documents

Publication Publication Date Title
CN107102989B (en) Entity disambiguation method based on word vector and convolutional neural network
CN107944913B (en) High-potential user purchase intention prediction method based on big data user behavior analysis
Wei et al. Collaborative filtering and deep learning based recommendation system for cold start items
CN107562742B (en) Image data processing method and device
CN107133277B (en) A kind of tourist attractions recommended method based on Dynamic Theme model and matrix decomposition
CN103886067A (en) Method for recommending books through label implied topic
CN104935963A (en) Video recommendation method based on timing sequence data mining
Ullah et al. Image-based service recommendation system: A JPEG-coefficient RFs approach
CN111159341B (en) Information recommendation method and device based on user investment and financial management preference
CN108804577B (en) Method for estimating interest degree of information tag
CN109508421B (en) Word vector-based document recommendation method
CN112434533A (en) Entity disambiguation method, apparatus, electronic device, and computer-readable storage medium
JP5481295B2 (en) Object recommendation device, object recommendation method, object recommendation program, and object recommendation system
CN115867919A (en) Graph structure aware incremental learning for recommendation systems
CN111538846A (en) Third-party library recommendation method based on mixed collaborative filtering
Yamamoto et al. PBG at the NTCIR-13 Lifelog-2 LAT, LSAT, and LEST Tasks.
CN116521906A (en) Meta description generation method, device, equipment and medium thereof
Chang et al. An interactive approach to integrating external textual knowledge for multimodal lifelog retrieval
CN113095883B (en) Video payment user prediction method and system based on deep cross attention network
CN112052388B (en) Method and system for recommending food stores
CN117956232A (en) Video recommendation method and device
CN114880572B (en) Intelligent news client recommendation system
CN115203532B (en) Project recommendation method and device, electronic equipment and storage medium
CN113688281B (en) Video recommendation method and system based on deep learning behavior sequence
CN115905682A (en) Interest point recommendation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: No.1 Lingshan South Road, Qixia District, Nanjing, Jiangsu Province, 210000

Applicant after: THE 28TH RESEARCH INSTITUTE OF CHINA ELECTRONICS TECHNOLOGY Group Corp.

Address before: 210007 No. 1 East Street, alfalfa garden, Jiangsu, Nanjing

Applicant before: THE 28TH RESEARCH INSTITUTE OF CHINA ELECTRONICS TECHNOLOGY Group Corp.

GR01 Patent grant
GR01 Patent grant