CN111061996A - Recommendation algorithm combining Word2vec Word vector and LSH locality sensitive hashing - Google Patents

Recommendation algorithm combining Word2vec Word vector and LSH locality sensitive hashing Download PDF

Info

Publication number
CN111061996A
CN111061996A CN201911249921.0A CN201911249921A CN111061996A CN 111061996 A CN111061996 A CN 111061996A CN 201911249921 A CN201911249921 A CN 201911249921A CN 111061996 A CN111061996 A CN 111061996A
Authority
CN
China
Prior art keywords
similarity
algorithm
vector
matrix
project
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911249921.0A
Other languages
Chinese (zh)
Inventor
吴晟
舒珏淋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201911249921.0A priority Critical patent/CN111061996A/en
Publication of CN111061996A publication Critical patent/CN111061996A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Algebra (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a matrix decomposition recommendation algorithm combining a Word2Vec Word vector model and LSH local sensitive hashing based on cosine similarity. Firstly, converting Word similarity into vector similarity by using a Word2Vec model, then obtaining a project similarity matrix by LSH local sensitive hash high-speed calculation based on cosine similarity, combining the obtained project similarity matrix with original scores, outputting pre-scores of the non-scored projects, and filling the pre-scores into a training set; and finally, taking the training set as the input of an ALS matrix decomposition algorithm to obtain a recommendation result. Compared with the traditional collaborative filtering recommendation algorithm, the improved algorithm has a lower MAE value and better performance, and can effectively solve the problem of untimely recommendation in a large amount of data.

Description

Recommendation algorithm combining Word2vec Word vector and LSH locality sensitive hashing
Technical Field
The invention provides a Word2Vec Word vector model and an LSH local sensitive hash based on cosine similarity based matrix decomposition recommendation algorithm, which mainly aims at the problem that although the ALS matrix decomposition recommendation algorithm is superior to a neighborhood based collaborative filtering algorithm, the ALS matrix decomposition recommendation algorithm still has data sparseness and accuracy and timeliness of recommendation in a large amount of data, so that user experience is poor. The method belongs to the field of data mining based on a cloud computing platform.
Background
Machine learning and data mining are continuously progressing in the repeated updating and perfecting of large data platforms. Among them, the recommendation system is one of the representatives. The basic idea of the recommendation engine is to infer the user's preferences and to assist the overall process by exploring the associations between objects. In this regard, it is complementary to the search engine that also makes the prediction. But unlike search engines, the content that the recommendation engine presents to people may even never be seen by the user. The development of the recommendation system has a history of more than 20 years, and due to the huge application requirements, the recommendation system has become a key research object in eyes of a plurality of scholars. Nowadays, the collaborative filtering method is one of the most successful and widely applied recommendation methods in the recommendation system. Research on collaborative filtering recommendation algorithms has been on the rise in recent years, but the problems to be solved are still many. For example, 1, the timeliness problem exists, the recommendation system needs to store the relevant information of the user and the article, along with the development of science and technology, the demand of the recommendation system is increased, the information is rapidly increased, the calculation speed is very slow, the time is prolonged, and the user experience effect is not good. 2. The data sparseness problem is that most collaborative filtering recommendation algorithms are predicted by scoring after similarity calculation, but the similarity calculation is not accurate due to the data sparseness of a user-article scoring matrix, so that a scoring prediction link is seriously affected, and the accuracy of a recommendation system is directly reduced. 3. The data accuracy problem is that the original data of the user-article scoring matrix causes inaccuracy of similarity due to the problem of sparsity, so that a recommendation system cannot give an accurate recommendation result.
Disclosure of Invention
The invention aims to provide a matrix decomposition recommendation algorithm combining a Word2Vec Word vector model and LSH local sensitive hashing based on cosine similarity, which mainly utilizes the Word2Vec model to model project information, converts each line (description information of a single project) in a project file into a Word vector, then utilizes the LSH local hashing based on cosine to efficiently calculate the project similarity in mass data, thereby obtaining a project similarity matrix, finally generates pre-scoring of an unvalued project by combining an original scoring matrix and fills the pre-scoring into the original scoring matrix, forms a new input set input ALS algorithm, and finally obtains a recommendation result.
In order to achieve the technical purpose: the invention adopts the following specific technical scheme:
a matrix decomposition recommendation algorithm combining a Word2Vec Word vector model and LSH locality sensitive hashing based on cosine similarity. Firstly, converting Word similarity into vector similarity by using a Word2Vec model, then obtaining a project similarity matrix by LSH local sensitive hash high-speed calculation based on cosine similarity, combining the obtained project similarity matrix with original scores, outputting pre-scores of the non-scored projects, and filling the pre-scores into a training set; and finally, taking the training set as the input of an ALS matrix decomposition algorithm to obtain a recommendation result.
The specific flow of the algorithm is as follows:
inputting: word files of the preprocessed films, original score u.data files.
And (3) outputting: the training set rainrdd (user, movie, rate).
Word files read from movies and are converted into RDD in step 1. Then the RDD is converted into a DataFrame containing two fields of movieId and message in a reflection mode, and the DataFrame is named as wordDF.
And 2, establishing a Word2Vec Word vector model, inputting a message field and outputting features. Inputting the wordDF into a Word2Vec model for training to obtain a Word vector of the movie file, and then converting the Word vector of the movie into a distributed matrix of IndexRowMatrix.
And 3, establishing an LSH local hash model based on cosine, setting the minimum similarity to be 0.9, setting the number of neighbors to be 10, and inputting the distributed matrix of IndexedRowMatrix into the LSH to obtain a similarity matrix simiiariMatrix.
Step 4, converting the spark reading original score u.data file into an RDD containing userId, movieId and rating, wherein the tuple is in a form of (movieId, (userId, rating)), and performing aggregation operation on the RDD converted by the similarity matrix simiariyMatrix and the RDD to obtain a tuple joinvalueRDD in a form of ((item1, item2), (rate, sim)). And obtaining the pre-scoring RDD by aggregating different partition operations of the joinvalueRDD.
And 5, performing union operation on the pre-score RDD and the original score RDD to obtain a training set trainRDD, and inputting the training set trainRDD into the ALS algorithm.
The invention has the beneficial effects that:
the invention provides a matrix decomposition recommendation algorithm combining a Word2Vec Word vector model and LSH local sensitive hashing based on cosine similarity, which can solve the problems of poor user experience caused by sparsity and accuracy of a traditional collaborative filtering recommendation algorithm and untimely recommendation in a large amount of data. This will be explained in detail below.
The algorithm utilizes a Word2Vec model to model project information, description information of each line in a project file is quickly converted into Word vectors, then, the LSH local sensitive Hash based on cosine can be used for efficiently calculating the project similarity in mass data, so that a project similarity matrix is obtained, finally, pre-scores of the non-scored projects are generated by combining with an original scoring matrix and filled in the original scoring matrix, and a new input set is formed to be input into the ALS algorithm. The algorithm solves the problems of data sparseness, cold start and accuracy in the recommendation algorithm, and also solves the problem of timeliness of recommendation in a large amount of data.
Drawings
FIG. 1 is a flow chart of the algorithm of the present invention.
FIG. 2 is a Word2Vec Word vector model schematic diagram according to the present invention.
FIG. 3 is a comparison of the algorithm of the present invention and a collaborative filtering algorithm.
Detailed Description
The present invention will be further explained with reference to fig. 1 to 3.
According to the invention, a Word2Vec Word vector model and a matrix decomposition recommendation algorithm of LSH local sensitive hash based on cosine similarity are adopted, so that the problems of poor user experience caused by sparsity and accuracy of a traditional collaborative filtering recommendation algorithm and untimely recommendation in a large amount of data can be solved.
The first step is as follows: document processing and project similarity calculation
Item files are used in this step. Item is a description of a movie property per line, with different description items divided by a single vertical line: item1 is an index number, item2 is a movie name, item 3 is a showing date, item 4 is null, item 5 is website information, and items 6 to 24 are movie types described by bitmaps (if a certain movie belongs to a certain type, the corresponding bitmap is 1, otherwise, the bitmap is 0). This step can be divided into two sub-processes as follows.
(1) And preprocessing the movie description file. And converting the information of the u.Item file into movie attribute information, extracting the movie name and the movie type information, and filtering out interference information. Word, each line in the file represents text information of a movie, for example, information of a movie to store is (to store Jan interactive child's company).
(2) And calculating the similarity of the items. The main idea of the locality sensitive hashing algorithm is to use a random hash function value to ensure that similar (adjacent) data points have a high probability of causing collisions after hashing. A cosine-based LSH partial hash algorithm is employed herein. For data set S, all data points in S are an n-dimensional vector, denoted by v:
v={v1,v2,...,vn} (1)
for any two elements v and u in S, their similarity (distance) is defined as follows
Figure BDA0002308736660000041
Let x be { x ═ x1,x2,...,xnIs selected from a random vector in n-dimensional vector space, if a random plane p is generated with vector x as the normal vector0Then p0The original n-dimensional space is divided into two different parts. Given an unknown vector v, the space to which the unknown vector v belongs can be determined by calculating the angle between v and x, and the specific formula is defined as follows:
Figure BDA0002308736660000042
given a free-existing vector x, if the angle between vectors v and u is θ, they are not simultaneously in the hyperplane p, according to the definition of equation (3)0Probability of same side space is
Figure BDA0002308736660000043
Assuming that the similarity of the vectors v and u is s, the probability formula can be calculated according to θ ═ arccos(s) as follows:
Figure BDA0002308736660000044
since the number of hash values generated by equation (4) obviously does not satisfy the actual requirement, in order to generate enough hash values, a general method is to use a locality sensitive hash function cluster to generate a multidimensional hash value vector, as shown in equation (18):
H(v)=(h1(v),h2(v),...,hk(v)) (5)
as can be seen from equation (5), the number of hash values formed by the k-dimensional hash value vector becomes 2k, and therefore the probability of collision between the hash values of two data points is shown in equation (6):
Figure BDA0002308736660000045
the second step is that: similarity post-processing
And after the processed movie text is subjected to Word2Vec Word vector conversion, carrying out cosine-based LSH local hash calculation to obtain a similarity matrix. In preparation for subsequent work, the processed similarity matrix exists in the form of a triple, which is (item1, item2, sim). And then estimating the score of the blank according to the existing scores of the neighboring projects of the project so as to obtain the project pre-score.
The third step: pre-scoring solution and population
And the original scoring matrix file is u.data, the scoring of the original scoring matrix file is in a form of a triple (user, item, rate), the original scoring matrix is combined with a high similarity matrix to predict certain missing scoring items, and finally, pre-scoring is filled into the original scoring matrix.
The fourth step: matrix decomposition and recommendation
And on the premise of obtaining the training set, solving and recommending by using the algorithm. The algorithm takes the average absolute error MAE value as a correction objective function of a parameter: and the MAE value is used as a basis for judging the performance of the algorithm. MAE is defined as formula (7), wherein piTo predict the score, qiFor true scoring, N is the test set scoring number.
Figure BDA0002308736660000051
And recommending the user through an optimal model obtained by averaging the MAE value.

Claims (1)

1. A recommendation algorithm combined with a Word2Vec Word vector model is characterized in that firstly, the Word2Vec model is utilized to convert Word similarity into vector similarity, then LSH local sensitive hashing high-speed calculation based on cosine similarity is carried out to obtain a project similarity matrix, the obtained project similarity matrix is combined with original scores, pre-scores of non-scored projects are output, and the pre-scores are filled into a training set; and finally, taking the training set as the input of an ALS matrix decomposition algorithm to obtain a recommendation result. The algorithm is combined with a Word2Vec model to model project information, each line (description information of a single project) in a project file is converted into a Word vector, then, cosine-based LSH local sensitive hashing is utilized, the project similarity can be efficiently calculated in mass data, a project similarity matrix is obtained, finally, pre-scoring of an unvalued project is generated by combining an original scoring matrix and is filled in the original scoring matrix, a new input set is formed and is input into an ALS algorithm, and a recommendation result is obtained. The algorithm solves the problems of data sparseness, cold start and accuracy in the recommendation algorithm, and also solves the problem of timeliness of recommendation in a large amount of data.
Describing an algorithm process:
experimental data used: the dataset used in the present invention was from the MovieLens dataset provided by the group lens research group, usa. The data set includes 10 ten thousand rating records, 1682 movie information description files, and 943 user description files. The method comprises the following specific steps:
the first step is as follows: document processing and project similarity calculation
Item files are used in this step. Item is a description of a movie property per line, with different description items divided by a single vertical line: item1 is an index number, item2 is a movie name, item 3 is a showing date, item 4 is null, item 5 is website information, and items 6 to 24 are movie types described by bitmaps (if a certain movie belongs to a certain type, the corresponding bitmap is 1, otherwise, the bitmap is 0). This step can be divided into two sub-processes as follows.
(1) And preprocessing the movie description file. And converting the information of the u.Item file into movie attribute information, extracting the movie name and the movie type information, and filtering out interference information. Word, each line in the file represents text information of a movie, for example, information of a movie to store is (to store Jan interactive child's company).
(2) And calculating the similarity of the items. The main idea of the locality sensitive hashing algorithm is to use a random hash function value to ensure that similar (adjacent) data points have a high probability of causing collisions after hashing. A cosine-based LSH partial hash algorithm is employed herein. For data set S, all data points in S are an n-dimensional vector, denoted by v:
v={v1,v2,...,vn} (1)
for any two elements v and u in S, their similarity (distance) is defined as follows
Figure FDA0002308736650000021
Let x be { x ═ x1,x2,...,xnIs selected from a random vector in n-dimensional vector space, if a random plane p is generated with vector x as the normal vector0Then p0The original n-dimensional space is divided into two different parts. Given an unknown vector v, the space to which the unknown vector v belongs can be determined by calculating the angle between v and x, and the specific formula is defined as follows:
Figure FDA0002308736650000022
given a free-existing vector x, if the angle between vectors v and u is θ, they are not simultaneously in the hyperplane p, according to the definition of equation (3)0Probability of same side space is
Figure FDA0002308736650000023
Assuming that the similarity of the vectors v and u is s, the probability formula can be calculated according to θ ═ arccos(s) as follows:
Figure FDA0002308736650000024
since the number of hash values generated by equation (4) obviously does not satisfy the actual requirement, in order to generate enough hash values, a general method is to use a locality sensitive hash function cluster to generate a multidimensional hash value vector, as shown in equation (18):
H(v)=(h1(v),h2(v),...,hk(v)) (5)
as can be seen from equation (5), the number of hash values formed by the k-dimensional hash value vector becomes 2k, and therefore the probability of collision between the hash values of two data points is shown in equation (6):
Figure FDA0002308736650000025
the second step is that: similarity post-processing
And after the processed movie text is subjected to Word2Vec Word vector conversion, carrying out cosine-based LSH local hash calculation to obtain a similarity matrix. In preparation for subsequent work, the processed similarity matrix exists in the form of a triple, which is (item1, item2, sim). And then estimating the score of the blank according to the existing scores of the neighboring projects of the project so as to obtain the project pre-score.
The third step: pre-scoring solution and population
And the original scoring matrix file is u.data, the scoring of the original scoring matrix file is in a form of a triple (user, item, rate), the original scoring matrix is combined with a high similarity matrix to predict certain missing scoring items, and finally, pre-scoring is filled into the original scoring matrix.
The fourth step: matrix decomposition and recommendation
And on the premise of obtaining the training set, solving and recommending by using the algorithm. The algorithm takes the average absolute error MAE value as a correction objective function of a parameter: and the MAE value is used as a basis for judging the performance of the algorithm. MAE is defined as formula (7), wherein piTo predict the score, qiFor true scoring, N is the test set scoring number.
Figure FDA0002308736650000031
And recommending the user through an optimal model obtained by averaging the MAE value.
CN201911249921.0A 2019-12-09 2019-12-09 Recommendation algorithm combining Word2vec Word vector and LSH locality sensitive hashing Pending CN111061996A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911249921.0A CN111061996A (en) 2019-12-09 2019-12-09 Recommendation algorithm combining Word2vec Word vector and LSH locality sensitive hashing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911249921.0A CN111061996A (en) 2019-12-09 2019-12-09 Recommendation algorithm combining Word2vec Word vector and LSH locality sensitive hashing

Publications (1)

Publication Number Publication Date
CN111061996A true CN111061996A (en) 2020-04-24

Family

ID=70300169

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911249921.0A Pending CN111061996A (en) 2019-12-09 2019-12-09 Recommendation algorithm combining Word2vec Word vector and LSH locality sensitive hashing

Country Status (1)

Country Link
CN (1) CN111061996A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111966900A (en) * 2020-08-17 2020-11-20 中国银行股份有限公司 User cold start product recommendation method and system based on locality sensitive hashing
CN112035755A (en) * 2020-07-14 2020-12-04 中国科学院信息工程研究所 User-centered personalized recommendation privacy protection method and system
CN112949778A (en) * 2021-04-17 2021-06-11 深圳前海移联科技有限公司 Intelligent contract classification method and system based on locality sensitive hashing and electronic equipment
CN113255834A (en) * 2021-06-28 2021-08-13 南京电研电力自动化股份有限公司 Transformer potential fault prediction method and system based on locality sensitive hashing algorithm
WO2022037446A1 (en) * 2020-08-20 2022-02-24 西南电子技术研究所(中国电子科技集团公司第十研究所) Front-page news prediction and classification method
CN114201669A (en) * 2021-11-19 2022-03-18 西安电子科技大学 API recommendation method based on word embedding and collaborative filtering technology
CN115984126A (en) * 2022-12-05 2023-04-18 北京拙河科技有限公司 Optical image correction method and device based on input instruction
CN115983499A (en) * 2023-03-03 2023-04-18 北京奇树有鱼文化传媒有限公司 Box office prediction method and device, electronic equipment and storage medium

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035755A (en) * 2020-07-14 2020-12-04 中国科学院信息工程研究所 User-centered personalized recommendation privacy protection method and system
CN112035755B (en) * 2020-07-14 2023-04-07 中国科学院信息工程研究所 User-centered personalized recommendation privacy protection method and system
CN111966900A (en) * 2020-08-17 2020-11-20 中国银行股份有限公司 User cold start product recommendation method and system based on locality sensitive hashing
WO2022037446A1 (en) * 2020-08-20 2022-02-24 西南电子技术研究所(中国电子科技集团公司第十研究所) Front-page news prediction and classification method
CN112949778A (en) * 2021-04-17 2021-06-11 深圳前海移联科技有限公司 Intelligent contract classification method and system based on locality sensitive hashing and electronic equipment
CN113255834A (en) * 2021-06-28 2021-08-13 南京电研电力自动化股份有限公司 Transformer potential fault prediction method and system based on locality sensitive hashing algorithm
CN114201669A (en) * 2021-11-19 2022-03-18 西安电子科技大学 API recommendation method based on word embedding and collaborative filtering technology
CN114201669B (en) * 2021-11-19 2023-02-03 西安电子科技大学 API recommendation method based on word embedding and collaborative filtering technology
CN115984126A (en) * 2022-12-05 2023-04-18 北京拙河科技有限公司 Optical image correction method and device based on input instruction
CN115983499A (en) * 2023-03-03 2023-04-18 北京奇树有鱼文化传媒有限公司 Box office prediction method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111061996A (en) Recommendation algorithm combining Word2vec Word vector and LSH locality sensitive hashing
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
US9183467B2 (en) Sketch segmentation
CN110188228B (en) Cross-modal retrieval method based on sketch retrieval three-dimensional model
Liu et al. Robust multi-view feature selection
CN108228867A (en) A kind of theme collaborative filtering recommending method based on viewpoint enhancing
CN109903138B (en) Personalized commodity recommendation method
US20180081992A1 (en) Determination of relationships between collections of disparate media types
CN110334290B (en) MF-Octree-based spatio-temporal data rapid retrieval method
CN108389113B (en) Collaborative filtering recommendation method and system
US20200364259A1 (en) Image retrieval
CN110598022A (en) Image retrieval system and method based on robust deep hash network
CN109766408A (en) The text key word weighing computation method of comprehensive word positional factor and word frequency factor
CN114547307A (en) Text vector model training method, text matching method, device and equipment
CN117035080A (en) Knowledge graph completion method and system based on triplet global information interaction
CN115410199A (en) Image content retrieval method, device, equipment and storage medium
CN115329101A (en) Electric power Internet of things standard knowledge graph construction method and device
CN117556067B (en) Data retrieval method, device, computer equipment and storage medium
CN114637846A (en) Video data processing method, video data processing device, computer equipment and storage medium
CN107391443B (en) Sparse data anomaly detection method and device
Yu et al. A novel multi-feature representation of images for heterogeneous IoTs
CN115905617B (en) Video scoring prediction method based on deep neural network and double regularization
CN111859192B (en) Searching method, searching device, electronic equipment and storage medium
CN112699271B (en) Recommendation method for improving retention time of user video website
CN113920291A (en) Error correction method and device based on picture recognition result, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200424