CN111061996A

CN111061996A - Recommendation algorithm combining Word2vec Word vector and LSH locality sensitive hashing

Info

Publication number: CN111061996A
Application number: CN201911249921.0A
Authority: CN
Inventors: 吴晟; 舒珏淋
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2020-04-24

Abstract

The invention discloses a matrix decomposition recommendation algorithm combining a Word2Vec Word vector model and LSH local sensitive hashing based on cosine similarity. Firstly, converting Word similarity into vector similarity by using a Word2Vec model, then obtaining a project similarity matrix by LSH local sensitive hash high-speed calculation based on cosine similarity, combining the obtained project similarity matrix with original scores, outputting pre-scores of the non-scored projects, and filling the pre-scores into a training set; and finally, taking the training set as the input of an ALS matrix decomposition algorithm to obtain a recommendation result. Compared with the traditional collaborative filtering recommendation algorithm, the improved algorithm has a lower MAE value and better performance, and can effectively solve the problem of untimely recommendation in a large amount of data.

Description

Recommendation algorithm combining Word2vec Word vector and LSH locality sensitive hashing

Technical Field

The invention provides a Word2Vec Word vector model and an LSH local sensitive hash based on cosine similarity based matrix decomposition recommendation algorithm, which mainly aims at the problem that although the ALS matrix decomposition recommendation algorithm is superior to a neighborhood based collaborative filtering algorithm, the ALS matrix decomposition recommendation algorithm still has data sparseness and accuracy and timeliness of recommendation in a large amount of data, so that user experience is poor. The method belongs to the field of data mining based on a cloud computing platform.

Background

Machine learning and data mining are continuously progressing in the repeated updating and perfecting of large data platforms. Among them, the recommendation system is one of the representatives. The basic idea of the recommendation engine is to infer the user's preferences and to assist the overall process by exploring the associations between objects. In this regard, it is complementary to the search engine that also makes the prediction. But unlike search engines, the content that the recommendation engine presents to people may even never be seen by the user. The development of the recommendation system has a history of more than 20 years, and due to the huge application requirements, the recommendation system has become a key research object in eyes of a plurality of scholars. Nowadays, the collaborative filtering method is one of the most successful and widely applied recommendation methods in the recommendation system. Research on collaborative filtering recommendation algorithms has been on the rise in recent years, but the problems to be solved are still many. For example, 1, the timeliness problem exists, the recommendation system needs to store the relevant information of the user and the article, along with the development of science and technology, the demand of the recommendation system is increased, the information is rapidly increased, the calculation speed is very slow, the time is prolonged, and the user experience effect is not good. 2. The data sparseness problem is that most collaborative filtering recommendation algorithms are predicted by scoring after similarity calculation, but the similarity calculation is not accurate due to the data sparseness of a user-article scoring matrix, so that a scoring prediction link is seriously affected, and the accuracy of a recommendation system is directly reduced. 3. The data accuracy problem is that the original data of the user-article scoring matrix causes inaccuracy of similarity due to the problem of sparsity, so that a recommendation system cannot give an accurate recommendation result.

Disclosure of Invention

The invention aims to provide a matrix decomposition recommendation algorithm combining a Word2Vec Word vector model and LSH local sensitive hashing based on cosine similarity, which mainly utilizes the Word2Vec model to model project information, converts each line (description information of a single project) in a project file into a Word vector, then utilizes the LSH local hashing based on cosine to efficiently calculate the project similarity in mass data, thereby obtaining a project similarity matrix, finally generates pre-scoring of an unvalued project by combining an original scoring matrix and fills the pre-scoring into the original scoring matrix, forms a new input set input ALS algorithm, and finally obtains a recommendation result.

In order to achieve the technical purpose: the invention adopts the following specific technical scheme:

a matrix decomposition recommendation algorithm combining a Word2Vec Word vector model and LSH locality sensitive hashing based on cosine similarity. Firstly, converting Word similarity into vector similarity by using a Word2Vec model, then obtaining a project similarity matrix by LSH local sensitive hash high-speed calculation based on cosine similarity, combining the obtained project similarity matrix with original scores, outputting pre-scores of the non-scored projects, and filling the pre-scores into a training set; and finally, taking the training set as the input of an ALS matrix decomposition algorithm to obtain a recommendation result.

The specific flow of the algorithm is as follows:

inputting: word files of the preprocessed films, original score u.data files.

And (3) outputting: the training set rainrdd (user, movie, rate).

Word files read from movies and are converted into RDD in step 1. Then the RDD is converted into a DataFrame containing two fields of movieId and message in a reflection mode, and the DataFrame is named as wordDF.

And 2, establishing a Word2Vec Word vector model, inputting a message field and outputting features. Inputting the wordDF into a Word2Vec model for training to obtain a Word vector of the movie file, and then converting the Word vector of the movie into a distributed matrix of IndexRowMatrix.

And 3, establishing an LSH local hash model based on cosine, setting the minimum similarity to be 0.9, setting the number of neighbors to be 10, and inputting the distributed matrix of IndexedRowMatrix into the LSH to obtain a similarity matrix simiiariMatrix.

Step 4, converting the spark reading original score u.data file into an RDD containing userId, movieId and rating, wherein the tuple is in a form of (movieId, (userId, rating)), and performing aggregation operation on the RDD converted by the similarity matrix simiariyMatrix and the RDD to obtain a tuple joinvalueRDD in a form of ((item1, item2), (rate, sim)). And obtaining the pre-scoring RDD by aggregating different partition operations of the joinvalueRDD.

And 5, performing union operation on the pre-score RDD and the original score RDD to obtain a training set trainRDD, and inputting the training set trainRDD into the ALS algorithm.

The invention has the beneficial effects that:

the invention provides a matrix decomposition recommendation algorithm combining a Word2Vec Word vector model and LSH local sensitive hashing based on cosine similarity, which can solve the problems of poor user experience caused by sparsity and accuracy of a traditional collaborative filtering recommendation algorithm and untimely recommendation in a large amount of data. This will be explained in detail below.

The algorithm utilizes a Word2Vec model to model project information, description information of each line in a project file is quickly converted into Word vectors, then, the LSH local sensitive Hash based on cosine can be used for efficiently calculating the project similarity in mass data, so that a project similarity matrix is obtained, finally, pre-scores of the non-scored projects are generated by combining with an original scoring matrix and filled in the original scoring matrix, and a new input set is formed to be input into the ALS algorithm. The algorithm solves the problems of data sparseness, cold start and accuracy in the recommendation algorithm, and also solves the problem of timeliness of recommendation in a large amount of data.

Drawings

FIG. 1 is a flow chart of the algorithm of the present invention.

FIG. 2 is a Word2Vec Word vector model schematic diagram according to the present invention.

FIG. 3 is a comparison of the algorithm of the present invention and a collaborative filtering algorithm.

Detailed Description

The present invention will be further explained with reference to fig. 1 to 3.

According to the invention, a Word2Vec Word vector model and a matrix decomposition recommendation algorithm of LSH local sensitive hash based on cosine similarity are adopted, so that the problems of poor user experience caused by sparsity and accuracy of a traditional collaborative filtering recommendation algorithm and untimely recommendation in a large amount of data can be solved.

The first step is as follows: document processing and project similarity calculation

Item files are used in this step. Item is a description of a movie property per line, with different description items divided by a single vertical line: item1 is an index number, item2 is a movie name, item 3 is a showing date, item 4 is null, item 5 is website information, and items 6 to 24 are movie types described by bitmaps (if a certain movie belongs to a certain type, the corresponding bitmap is 1, otherwise, the bitmap is 0). This step can be divided into two sub-processes as follows.

(1) And preprocessing the movie description file. And converting the information of the u.Item file into movie attribute information, extracting the movie name and the movie type information, and filtering out interference information. Word, each line in the file represents text information of a movie, for example, information of a movie to store is (to store Jan interactive child's company).

(2) And calculating the similarity of the items. The main idea of the locality sensitive hashing algorithm is to use a random hash function value to ensure that similar (adjacent) data points have a high probability of causing collisions after hashing. A cosine-based LSH partial hash algorithm is employed herein. For data set S, all data points in S are an n-dimensional vector, denoted by v:

v＝{v₁,v₂,...,v_n} (1)

for any two elements v and u in S, their similarity (distance) is defined as follows

Let x be { x ═ x₁,x₂,...,x_nIs selected from a random vector in n-dimensional vector space, if a random plane p is generated with vector x as the normal vector₀Then p₀The original n-dimensional space is divided into two different parts. Given an unknown vector v, the space to which the unknown vector v belongs can be determined by calculating the angle between v and x, and the specific formula is defined as follows:

given a free-existing vector x, if the angle between vectors v and u is θ, they are not simultaneously in the hyperplane p, according to the definition of equation (3)₀Probability of same side space is

Assuming that the similarity of the vectors v and u is s, the probability formula can be calculated according to θ ═ arccos(s) as follows:

since the number of hash values generated by equation (4) obviously does not satisfy the actual requirement, in order to generate enough hash values, a general method is to use a locality sensitive hash function cluster to generate a multidimensional hash value vector, as shown in equation (18):

H(v)＝(h₁(v),h₂(v),...,h_k(v)) (5)

as can be seen from equation (5), the number of hash values formed by the k-dimensional hash value vector becomes 2k, and therefore the probability of collision between the hash values of two data points is shown in equation (6):

the second step is that: similarity post-processing

And after the processed movie text is subjected to Word2Vec Word vector conversion, carrying out cosine-based LSH local hash calculation to obtain a similarity matrix. In preparation for subsequent work, the processed similarity matrix exists in the form of a triple, which is (item1, item2, sim). And then estimating the score of the blank according to the existing scores of the neighboring projects of the project so as to obtain the project pre-score.

The third step: pre-scoring solution and population

And the original scoring matrix file is u.data, the scoring of the original scoring matrix file is in a form of a triple (user, item, rate), the original scoring matrix is combined with a high similarity matrix to predict certain missing scoring items, and finally, pre-scoring is filled into the original scoring matrix.

The fourth step: matrix decomposition and recommendation

And on the premise of obtaining the training set, solving and recommending by using the algorithm. The algorithm takes the average absolute error MAE value as a correction objective function of a parameter: and the MAE value is used as a basis for judging the performance of the algorithm. MAE is defined as formula (7), wherein p_iTo predict the score, q_iFor true scoring, N is the test set scoring number.

And recommending the user through an optimal model obtained by averaging the MAE value.

Claims

1. A recommendation algorithm combined with a Word2Vec Word vector model is characterized in that firstly, the Word2Vec model is utilized to convert Word similarity into vector similarity, then LSH local sensitive hashing high-speed calculation based on cosine similarity is carried out to obtain a project similarity matrix, the obtained project similarity matrix is combined with original scores, pre-scores of non-scored projects are output, and the pre-scores are filled into a training set; and finally, taking the training set as the input of an ALS matrix decomposition algorithm to obtain a recommendation result. The algorithm is combined with a Word2Vec model to model project information, each line (description information of a single project) in a project file is converted into a Word vector, then, cosine-based LSH local sensitive hashing is utilized, the project similarity can be efficiently calculated in mass data, a project similarity matrix is obtained, finally, pre-scoring of an unvalued project is generated by combining an original scoring matrix and is filled in the original scoring matrix, a new input set is formed and is input into an ALS algorithm, and a recommendation result is obtained. The algorithm solves the problems of data sparseness, cold start and accuracy in the recommendation algorithm, and also solves the problem of timeliness of recommendation in a large amount of data.

Describing an algorithm process:

experimental data used: the dataset used in the present invention was from the MovieLens dataset provided by the group lens research group, usa. The data set includes 10 ten thousand rating records, 1682 movie information description files, and 943 user description files. The method comprises the following specific steps:

v＝{v₁,v₂,...,v_n} (1)

H(v)＝(h₁(v),h₂(v),...,h_k(v)) (5)

the second step is that: similarity post-processing

The third step: pre-scoring solution and population

The fourth step: matrix decomposition and recommendation