CN109508421B

CN109508421B - Word vector-based document recommendation method

Info

Publication number: CN109508421B
Application number: CN201811415870.XA
Authority: CN
Inventors: 后弘毅; 杨权; 梁栋
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2018-11-26
Filing date: 2018-11-26
Publication date: 2020-11-13
Anticipated expiration: 2038-11-26
Also published as: CN109508421A

Abstract

The invention discloses a word vector-based document recommendation method, which mainly utilizes a neural network language model to extract feature vectors of texts from document reading sequences of users, extracts global interests and reading context interests of the users from complete reading sequences and recent reading sub-sequences of the users, and finally comprehensively considers the global interests and the current reading context interests of the users during recommendation, so that recommended documents can meet the actual requirements and preferences of the users.

Description

Word vector-based document recommendation method

Technical Field

The invention relates to a document recommendation method based on word vectors.

Background

Early literature recommendation mainly adopts a content-based recommendation algorithm, and the bottom layer characteristics of the literature are analyzed through text labels, so that the literature with similar content is put into a recommended reading list. For example, the document recommendation model based on MFCC and GMM is proposed to extract the text label features of the documents, but the extraction of the label data is very time-consuming, and nowadays, the documents are updated rapidly, and tens of thousands of new documents are pushed out every day, so the document recommendation based on the text label features is gradually eliminated.

Since the Tapestry system adopts collaborative filtering technology to solve the information excess problem, collaborative filtering is rapidly applied to recommendations of other fields. The famous foreign literature platform adopts a collaborative filtering technology, user behavior records are put into a server, a plurality of 'nearest neighbors' with similar interest and preference are found out according to the user behavior records, and finally documents which are favorite in the nearest neighbors but not browsed by a target user are recommended to the target user. In the latest research in China, Wangjun and the like propose a hierarchical document recommendation system concept, on one hand, document recommendation for collaborative filtering is carried out by adopting document preference similarity among users, on the other hand, the similarity of document contents comprises multiple dimensions such as theme, emotion, calligraphic method, word and the like, the two aspects are connected, and the advantages of the two aspects are fully played, so that the recommendation satisfaction degree is improved.

Unlike recommendations in other areas, users may have personal interests or purposes to assist in work and study while reading documents, thus creating a difference between document recommendations and mainstream e-commerce recommendations. In e-commerce, movie recommendations, it is relatively easy to collect explicit scoring tasks, as users would prefer to actively feedback to the system because they pay more financial or time cost (e.g., buy items for a fee, spend two hours watching a movie). The document product is low in cost, a user cannot specially score the document product, and the system can only record user context information (such as reading behavior of the user, user registration information and the like).

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defects of the traditional literature recommendation system, the invention provides a method for adding a user reading list context based on a word vector in a literature recommendation algorithm.

The technical scheme is as follows: the invention discloses a word vector-based document recommendation method, which comprises the following steps of:

the method comprises the following steps:

step 1, extracting characteristics of user reading documents and contexts based on a neural network language model;

step 2, calculating a global interest vector and a context interest vector read by the user based on the document sequence characteristics of the user;

and 3, establishing a mathematical model to calculate the user similarity and the literature interest index, and realizing the literature recommendation of the user reading.

2. The method of claim 1, wherein step 1 comprises:

step 1-1, acquiring a complete document reading sequence of a user, wherein each record in the reading sequence comprises a document ID, reading time and document provenance;

step 1-2, grouping complete document reading sequences of users according to reading time and document provenance to obtain sub-sequences, setting a reading interval time threshold (for example, 8 hours), wherein records which do not exceed the interval time threshold and have the same document provenance are divided into the same sub-sequences, and records which exceed the interval time threshold or have different document provenance are divided into different sub-sequences;

step 1-3, processing the complete document reading sequences of all users by using a Word2vec model (reference: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013, January 17). effective Estimation of Word retrieval in vector space. arxiv. org.) in a neural network language model to obtain the feature vectors of each document with coarse granularity, and processing the subsequences of all users by using the Word2vec language model to obtain the feature vectors of each document with fine granularity, wherein documents with similar reading contexts have similar feature vectors.

And (3) for the feature vectors obtained in the steps 1-3, adjusting the dimensionality of the feature vectors according to the requirements on efficiency and accuracy, increasing the dimensionality of the feature vectors if more accurate recommendation results are needed, and reducing the dimensionality of the feature vectors (set according to actual conditions) if higher calculation efficiency is needed.

The step 2 comprises the following steps:

step 2-1, averaging all document coarse-grained feature vectors in a complete document reading sequence of a user to obtain a global interest vector of the user;

and 2-2, averaging all document fine-grained feature vectors in the document reading subsequence of the user to obtain a reading context interest vector of the user.

The step 3 comprises the following steps:

step 3-1, calculating the similarity between users according to the global interest vector of the users and the reading sequence of the complete literature;

step 3-2, calculating the interest number of the target user to the literature;

and 3-3, sequencing all the documents by using the calculation results of the step 3-1 and the step 3-2, and recommending the top N results to the target user.

Step 3-1 comprises: calculating the similarity between the users according to the global interest vector of the users and the reading sequence of the complete documents, wherein the calculation formula is as follows:

wherein mu represents a target user, v represents another user in the database, sim (mu, v) represents the similarity between the target user mu and another user v in the database, and M_μRepresents a collection of documents, M, read by a user, mu_νRepresenting a collection of documents read by a user v, M representing M_μAnd M_νOne of the documents in the intersection is,

representing the cosine similarity of the user global interest vector, and λ and θ are weighting coefficients (greater than 0).

Step 3-2 comprises: and calculating the interest of the target user in the literature, wherein the calculation formula is as follows:

wherein p isⁱ(μ, m) represents the interest of the target user μ in the document m, U_μ,kRepresents the set of k users, U, most similar to the target user mu_mRepresenting a collection of users who have read the document m,

a reading context interest vector representing the user mu,

representing the fine-grained feature vector of document m,

to represent

And

cosine similarity of (c), ω and

is a weight coefficient (greater than 0).

The invention is based on the thought of Word vector Word2vec for the first time, obtains different granularity characteristics of documents from a complete reading sequence and a subsequence of a user by using a neural network language model Skip-gram, respectively expresses the different granularity characteristics as a coarse granularity characteristic vector and a fine granularity characteristic vector, and provides a reliable solution for the problem of difficult document characteristic extraction. The global interest and the reading context interest of the user are obtained according to the complete reading sequence of the user and the document feature vector in the latest reading subsequence, and a feasible idea is provided for the problem that the reading context of the user is difficult to extract and model. The recommendation method capable of comprehensively considering the global interest and the reading context interest of the user is provided, so that the recommended documents can better accord with the current preference of the user, the search cost of the user is reduced, and the satisfaction degree of the user is improved.

Drawings

The foregoing and other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.

FIG. 1 is a schematic diagram of a recommendation system architecture of a document recommendation method based on a word vector model according to the present invention.

FIG. 2 is a schematic diagram of a user document preference prediction process of the document recommendation method based on the word vector model according to the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

As shown in fig. 1 and 2, the present invention specifically includes the following steps:

step 1, 25000 complete literature reading sequence historical records of 1000 users are obtained, and each record in the reading sequence comprises an literature ID, reading time and an literature provenance;

step 2, setting a reading interval time threshold value to be 8 hours according to reading time and document provenance, grouping complete document reading sequences of a user to obtain 3300 sub-sequences, wherein the reading time interval is shorter than 8 hours, records with the same document provenance are divided into the same sub-sequence, and the reading time interval is longer than 8 hours or records with different document provenance are divided into different sub-sequences;

step 3, processing the complete document reading sequences of all users by using a Word2vec model in the neural network language model to obtain a coarse-grained feature vector of each document, and processing the subsequences of all users by using the Word2vec language model to obtain a fine-grained feature vector of each document, wherein the documents with similar reading contexts have similar feature vectors;

and 4, converting the dimensionality of the feature vector into 16 dimensions according to the requirements on efficiency and accuracy.

Step 5, averaging all document coarse-grained feature vectors in the complete document reading sequence of the user to obtain a global interest vector of the user;

and 6, averaging all the fine-grained feature vectors of the documents in the recent document reading subsequence of the user to obtain a reading context interest vector of the user.

And 7, calculating the similarity between the users according to the global interest vector of the users and the reading sequence of the complete literature, wherein the calculation formula is as follows:

wherein mu represents a target user, v represents another user in the database, M_μRepresents a collection of documents, M, read by a user, mu_νRepresenting a collection of documents read by a user v,

expressing the cosine similarity of the user global interest vector, wherein lambda and theta are weight coefficients, and the values are both 1;

step 8, calculating the interest of the target user in the literature, wherein the calculation formula is as follows:

where μ denotes the target user, U_μ,kRepresents the set of k users, U, most similar to mu_mRepresenting a collection of users who have read the document m,

a reading context interest vector representing the user mu,

representing the fine-grained feature vector of document m,

denotes the cosine similarity of the two, ω and

is a weight coefficient, where the values are all 1;

and 9, aiming at the target user, sequencing the user interest degrees of all the documents by using the calculation result in the step 8, recommending the first 3-5 results with the maximum interest degrees to the target user in the reading process of the user, and realizing the functions of extended reading and personalized recommendation in the reading experience of the target user.

The present invention provides a word vector-based document recommendation method, and a number of methods and approaches for implementing the technical solution, and the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, a number of modifications and embellishments can be made without departing from the principle of the present invention, and these modifications and embellishments should also be regarded as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims

1. A document recommendation method based on word vectors is characterized by comprising the following steps:

step 3, establishing a mathematical model to calculate the similarity of the users and the literature interest index, and realizing the literature recommendation of the user reading;

the step 1 comprises the following steps:

step 1-2, grouping complete document reading sequences of users according to reading time and document provenance to obtain sub-sequences, setting a reading interval time threshold, dividing records which do not exceed the interval time threshold and have the same document provenance into the same sub-sequence, and dividing records which exceed the interval time threshold or have different document provenance into different sub-sequences;

1-3, processing complete document reading sequences of all users by using a Word2vec model in a neural network language model to obtain a coarse-grained feature vector of each document, and processing subsequences of all users by using the Word2vec language model to obtain a fine-grained feature vector of each document, wherein documents with similar reading contexts have similar feature vectors;

for the feature vectors obtained in the step 1-3, adjusting the dimensionality of the feature vectors according to the requirements on efficiency and accuracy, increasing the dimensionality of the feature vectors if more accurate recommendation results are needed, and reducing the dimensionality of the feature vectors if higher calculation efficiency is needed;

the step 2 comprises the following steps:

step 2-2, averaging all document fine-grained feature vectors in the user document reading subsequence to obtain a reading context interest vector of the user;

the step 3 comprises the following steps:

step 3-2, calculating the interest number of the target user to the literature;

3-3, sorting all the documents by using the calculation results of the step 3-1 and the step 3-2, and recommending the first N results to a target user;

representing cosine similarity of the user global interest vector, wherein lambda and theta are weight coefficients;

a reading context interest vector representing the user mu,

representing the fine-grained feature vector of document m,

to represent

And

cosine similarity of (c), ω and

are the weight coefficients.