CN109508421A

CN109508421A - A kind of literature recommendation method based on term vector

Info

Publication number: CN109508421A
Application number: CN201811415870.XA
Authority: CN
Inventors: 后弘毅; 杨权; 梁栋
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2018-11-26
Filing date: 2018-11-26
Publication date: 2019-03-22
Anticipated expiration: 2038-11-26
Also published as: CN109508421B

Abstract

The literature recommendation method based on term vector that the invention discloses a kind of, mainly utilize neural network language model, the feature vector of text is extracted from the literature reading sequence of user, the global interest for extracting user in subsequence is read from the complete reading sequence of user and recently again and reads context interest, the global interest and the current context interest read of user is finally comprehensively considered when recommending, the fact that so as to allow the document of recommendation to meet user demand and preference.

Description

A kind of literature recommendation method based on term vector

Technical field

The literature recommendation method based on term vector that the present invention relates to a kind of.

Background technique

The literature recommendation of early stage mainly uses content-based recommendation algorithm, and the bottom for analyzing document by text label is special Sign, and then the similar document of content is put into recommended article list.Such as ox shore proposes that the individual character based on MFCC and GMM recommends mould Type extracts document text label feature, but the extraction of label data is very time-consuming, and document updates quickly now, all can daily New ten hundreds of documents are released, so the literature recommendation for being based purely on text label feature is gradually eliminated.

Since Tapestry system is collaborative filtering, that is, rapid after solution information Overdosing problems use collaborative filtering It is applied to the recommendation in other fields.Certain famous foreign literature platform is to be remembered user behavior using collaborative filtering Recording playback enters in server, and it is similar " arest neighbors " to find out several interest preferences accordingly, finally likes arest neighbors, but target user The literature recommendation not browsed is to target user.And in current research at home, Wang Jun etc. proposes level literature recommendation system On the one hand system concept takes the literature recommendation of document preference similarity progress collaborative filtering between user, on the other hand, literature content It is similar include multiple dimensions such as theme, emotion, the technique of writing, word, two aspects are connected, advantage both is given full play to, from And it improves and recommends satisfaction.

It is different from the recommendation in other fields, what user may learn when reading document for personal interest or back work Purposes, thus cause the difference of literature recommendation and the recommendation of mainstream electric business.In electric business, film are recommended, because user will pay Bigger economy or time cost (such as to pay and buy article, spend two hours viewing films), user are more willing to actively to system Feedback is made, collecting explicit scoring task can be relatively easy.And document product itself expends that cost is relatively low, and user can't be specially Scoring behavior is carried out to it, system can only carry out user's language ambience information (reading behavior, the user's registration information of such as user) Record.

Summary of the invention

Goal of the invention: the present invention is directed to the shortcomings that traditional literature recommender system, proposes one kind base in literature recommendation algorithm In the method that user's reading list context is added in term vector.

Technical solution: the literature recommendation method based on term vector that the invention discloses a kind of, comprising the following steps:

The following steps are included:

Step 1, it is based on neural network language model, document is read to user and context carries out feature extraction；

Step 2, document sequence signature based on user calculate global interest vector that user reads and context interest to Amount；

Step 3, founding mathematical models calculate user's similarity and document interest index, and the document for realizing that user reads pushes away It recommends.

2, the method according to claim 1, wherein step 1 includes:

Step 1-1, the entire document for obtaining user read sequence, and the every record read in sequence includes document ID, reads Read time, document source；

Step 1-2 reads sequence to the entire document of user and is grouped, obtain according to reading time and document source Subsequence is arranged reading interval time threshold (such as 8 hours), is no more than the identical record of interval time threshold value and document source It can assign to inside the same subsequence, and can be assigned to more than records interval time threshold value or that document source is different different Inside subsequence；

Step 1-3, using in neural network language model Word2vec model (bibliography: Mikolov, T., Chen,K.,Corrado,G.,&Dean,J.(2013,January 17).Efficient Estimation of Word Representations in VectorSpace.arXiv.org.) entire document of all users of processing reads sequence, it obtains The feature vector of every document coarseness handles the subsequence of all users using Word2vec language model, obtains every text Offer fine-grained feature vector, wherein there is similar feature vector with the similar document for reading context.

For the feature vector that step 1-3 is obtained, the dimension of feature vector is adjusted according to the demand to efficiency and accuracy, Then increase feature vector dimension if necessary to more accurate recommendation results, if necessary to higher computational efficiency then reduce feature to Measure dimension (being arranged according to the actual situation).

Step 2 includes:

The entire document of user is read all document coarseness feature vectors in sequence and is averaged, obtained by step 2-1 To the global interest vector of user；

Step 2-2 is averaged all document fine granularity feature vectors in user literature reading subsequence, is used The reading context interest vector at family.

Step 3 includes:

Step 3-1 reads the similarity between sequence calculating user according to the global interest vector and entire document of user；

Step 3-2 calculates target user to the interest number of document；

Step 3-3 is ranked up all documents using the calculated result of step 3-1 and step 3-2, top n result Recommend target user.

Step 3-1 includes: the phase read between sequence calculating user according to the global interest vector and entire document of user Like degree, its calculation formula is:

Wherein, μ indicate target user, ν indicate database in another user, sim (μ, ν) indicate target user μ and The similarity of another user ν, M in database_μIndicate the literature collection that user μ is read, M_νIndicate what user ν was read Literature collection, m indicate M_μAnd M_νA document in intersection,Indicate that the cosine of user's overall situation interest vector is similar Degree, λ and θ are weight coefficient (being greater than 0).

Step 3-2 includes: the interest for calculating target user to document, calculation formula are as follows:

Wherein, pⁱ(μ, m) indicates interest of the target user μ to document m, U_μ,kIt indicates and target user μ most like k The set of user, U_mIndicate that the user for reading document m gathers,Indicate the reading context interest vector of user μ, Indicate the fine granularity feature vector of document m,It indicatesWithCosine similarity, ω andIt is weight system Number (is greater than 0).

Thought of the present invention for the first time based on term vector Word2vec, using neural network language model Skip-gram from The different grain size feature that document is obtained in the complete reading sequence at family and subsequence is expressed as coarseness feature vector and thin Grain size characteristic vector extracts difficult problem for document feature and provides a kind of reliable solution.According to the complete of user It reads sequence and reads the document feature vector in subsequence recently, obtain the global interest of user and read context interest, The extraction of context, which is read, for user and models difficult problem provides a kind of feasible thinking.Proposing one kind can integrate Consider user's overall situation interest and read the recommended method of context interest, enables to the document recommended more to meet user current Preference, to reduce the searching cost of user and improve the satisfaction of user.

Detailed description of the invention

The present invention is done with reference to the accompanying drawings and detailed description and is further illustrated, it is of the invention above-mentioned or Otherwise advantage will become apparent.

Fig. 1 is the recommender system configuration diagram of the literature recommendation method of word-based vector model of the invention.

Fig. 2 is the pre- flow gauge signal of user's document preference of the literature recommendation method of word-based vector model of the invention Figure.

Specific embodiment

The present invention will be further described with reference to the accompanying drawings and embodiments.

As depicted in figs. 1 and 2, the present invention specifically includes the following steps:

Step 1,25000 entire documents for obtaining 1000 users read sequence history record, read every in sequence Item record includes document ID, reading time, document source；

Step 2, according to reading time and document source, it is 8 hours that reading interval time threshold, which is arranged, to the complete of user Literature reading sequence is grouped, and obtains 3300 subsequences, reading time interval is shorter than 8 hours and the identical note in document source Record can be assigned to inside the same subsequence, and the record that reading time interval is longer than 8 hours or document source is different can be assigned to Inside different subsequences；

Step 3, it is read using the entire document of all users of Word2vec model treatment in neural network language model Sequence obtains the feature vector of every document coarseness, and the subsequence of all users is handled using Word2vec language model, is obtained To the fine-grained feature vector of every document, wherein there is similar feature vector with the similar document for reading context；

Step 4, according to the demand to efficiency and accuracy, say that the dimension of feature vector is converted into 16 dimensions.

Step 5, the entire document of user is read all document coarseness feature vectors in sequence to be averaged, is obtained The global interest vector of user；

Step 6, all document fine granularity feature vectors in the nearest literature reading subsequence of user are averaged, are obtained To the reading context interest vector of user.

Step 7, the similarity between sequence calculating user is read according to the global interest vector and entire document of user, Calculation formula are as follows:

Wherein, μ indicates that target user, ν indicate another user in database, M_μIndicate the document that user μ is read Set, M_νIndicate the literature collection that user ν is read,Indicate the cosine similarity of user's overall situation interest vector, λ It is weight coefficient with θ, value is 1 herein；

Step 8, interest of the target user to document, calculation formula are calculated are as follows:

Wherein, μ indicates target user, U_μ,kIndicate the set with k μ most like user, U_mDocument m was read in expression User set,Indicate the reading context interest vector of user μ,Indicate the fine granularity feature vector of document m,Indicate both cosine similarity, ω andIt is weight coefficient, value is 1 herein；

Step 9, for target user, using the calculated result in step 8 all documents are carried out with the row of user interest degree The maximum preceding 3-5 result of interest-degree is recommended target user in the reading process of user by sequence, realizes that target user reads Extension in experience is read and the function of personalized recommendation.

The literature recommendation method based on term vector that the present invention provides a kind of, implements method and the way of the technical solution There are many diameter, the above is only a preferred embodiment of the present invention, it is noted that for those skilled in the art For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications should also regard For protection scope of the present invention.All undefined components in this embodiment can be implemented in the prior art.

Claims

1. a kind of literature recommendation method based on term vector, which comprises the following steps:

Step 2, the document sequence signature based on user calculates the global interest vector and context interest vector that user reads；

Step 3, founding mathematical models calculate user's similarity and document interest index, realize the literature recommendation that user reads.

2. the method according to claim 1, wherein step 1 includes:

Step 1-1, the entire document for obtaining user read sequence, when reading every record in sequence including document ID, reading Between, document source；

Step 1-2 reads sequence to the entire document of user and is grouped, obtain sub- sequence according to reading time and document source Column are arranged reading interval time threshold, can assign to same height no more than the identical record of interval time threshold value and document source It inside sequence, and is more than that records interval time threshold value or that document source is different can be assigned to inside different subsequences；

Step 1-3 reads sequence using the entire document of all users of Word2vec model treatment in neural network language model Column, are obtained the feature vector of every document coarseness, the subsequence of all users are handled using Word2vec language model, is obtained The fine-grained feature vector of every document, wherein there is similar feature vector with the similar document for reading context.

3. according to the method described in claim 2, it is characterized in that, for the feature vector that step 1-3 is obtained, according to efficiency With the dimension of the demand adjustment feature vector of accuracy, then increase feature vector dimension if necessary to more accurate recommendation results, Feature vector dimension is then reduced if necessary to higher computational efficiency.

4. according to the method described in claim 3, it is characterized in that, step 2 includes:

The entire document of user is read all document coarseness feature vectors in sequence and is averaged, used by step 2-1 The global interest vector at family；

Step 2-2 is averaged all document fine granularity feature vectors in user literature reading subsequence, obtains user's Read context interest vector.

5. according to the method described in claim 4, it is characterized in that, step 3 includes:

Step 3-2 calculates target user to the interest number of document；

Step 3-3 is ranked up all documents using the calculated result of step 3-1 and step 3-2, and top n result is recommended To target user.

6. according to the method described in claim 5, it is characterized in that, step 3-1 include: according to the global interest vector of user and Entire document reads the similarity between sequence calculating user, its calculation formula is:

Wherein, μ indicates that target user, ν indicate that another user in database, sim (μ, ν) indicate target user μ and data The similarity of another user ν, M in library_μIndicate the literature collection that user μ is read, M_νIndicate the document that user ν is read Set, m indicate M_μAnd M_νA document in intersection,Indicate the cosine similarity of user's overall situation interest vector, λ It is weight coefficient with θ.

7. according to the method described in claim 6, it is characterized in that, step 3-2 includes: to calculate target user to the emerging of document Interest, calculation formula are as follows:

Wherein, pⁱ(μ, m) indicates interest of the target user μ to document m, U_μ,kIt indicates with k target user μ most like user's Set, U_mIndicate that the user for reading document m gathers,Indicate the reading context interest vector of user μ,Indicate text The fine granularity feature vector of m is offered,It indicatesWithCosine similarity, ω and θ are weight coefficients.