CN113326433A

CN113326433A - Personalized recommendation method based on ensemble learning

Info

Publication number: CN113326433A
Application number: CN202110629501.6A
Authority: CN
Inventors: 段勇; 杨堃
Original assignee: Shenyang University of Technology
Current assignee: Shenyang University of Technology
Priority date: 2021-03-26
Filing date: 2021-06-07
Publication date: 2021-08-31
Anticipated expiration: 2041-06-07
Also published as: CN113326433B

Abstract

The invention relates to the field of machine learning and recommendation systems, in particular to an ensemble learning-based personalized recommendation method. The data preprocessing module is mainly responsible for reintegrating data features, and solves the problem of difficult extraction of complex features by constructing new features and popularly learning dimensionality reduction; the model establishing and optimizing module is mainly responsible for establishing a personalized ensemble learning prediction model based on the fused data, and carrying out Bayesian optimization on the basis of the establishment of the prediction model to improve the accuracy of personalized recommendation; and the personalized recommendation module is mainly responsible for acquiring a data result of the prediction model, and acquiring and verifying a personalized recommendation result by a Top N recommendation method. The method can improve the accuracy of personalized recommendation through ensemble learning; in addition, the method integrates popular learning to reduce dimensions and realize the integration of data features, thereby solving the problem of difficult extraction of complex features.

Description

Personalized recommendation method based on ensemble learning

Technical Field

The invention relates to the field of machine learning and recommendation systems, in particular to a personalized recommendation method based on popular learning LPP (local Preserving projection algorithm) and integrated learning GBDT (Gradient Boosting Decision Tree).

Background

In recent years, with the continuous update of internet technology and computer technology, the internet brings huge information data volume and also aggravates the phenomenon of information overload. Although the selection range of information resources for users is expanded, how to quickly and effectively screen out information useful for the users from huge data becomes a great problem in the development of the contemporary internet. Many existing web applications (e.g., web portals, search engines, etc.) are essentially one way to help users filter information. However, these methods can only meet the mainstream requirements of users, the problem of personalization is not considered, and the problem of information overload is still not solved well. Personalized recommendation is an important information filtering means and is an effective method for solving the problem of information overload.

With the development of the machine learning era, it is a trend to apply the machine learning method to the field of recommendation algorithms. Personalized recommendation also relies on many machine learning methods such as support vector machines, decision trees, neural networks, deep learning, clustering, dimensionality reduction, regression prediction, ensemble learning, and the like. The personalized recommendation method based on machine learning can effectively solve the problems that a similarity calculation method is monotonous, the similarity calculation complexity is high, the potential interest of a user is difficult to mine, user tag information and demographic information are difficult to utilize, commodity feature extraction is difficult and the like, but the user tag information, the demographic information and the commodity feature information are poor in effect in the aspect of solving the cold start problem and are necessary information for obtaining the potential interest of the user.

Disclosure of Invention

Object of the Invention

The invention provides an individualized recommendation method based on a local retention projection algorithm and ensemble learning, and aims to solve the problem of information overload in a recommendation system and improve the efficiency and precision of individualized recommendation.

Technical scheme

A personalized recommendation method based on ensemble learning is characterized by comprising the following steps:

step 1: analyzing the dimension attribute of the personalized recommendation data, and dividing the personalized recommendation data into user-article-score data; performing data association on the associated user-item-scoring dimension;

step 2: after the processing is finished, analyzing the data type of each dimension attribute of the user-article-score, and converting the data type into the data type required by ensemble learning;

and step 3: generating characteristic attributes according to the 'grading' attributes in the 'user-item-grading' dimension attributes;

and 4, step 4: all the obtained data were normalized and calculated as follows:

wherein vv represents the original value of the data, v' represents the value after normalization processing, min represents the minimum value of the column where vv is located, and max represents the maximum value of the column where vv is located;

and 5: let the "user-item-score" dataset A in the original space have m sample points x₁,x₂,...,x_mSample point x_iIs a vector of dimension l, i is an integer from 1 to m, and a matrix formed by the m samples according to columns is X; performing dimension reduction on the data set A by using a popular learning LPP method, wherein the dimension-reduced data set B is formed by sample points y₁,y₂,...,y_mComposition, sample point y_iIs an n-dimensional vector, the m samples form a matrix Y according to columns, wherein l is more than n;

step 6: and D, performing dimension reduction on the data set B according to the following steps of 8: 2, dividing the ratio of the training set Train into a training set Train and a Test set Test, wherein a data matrix corresponding to the training set Train is Y';

and 7: establishing an individualized recommendation model by adopting an ensemble learning GBDT method;

and 8: optimizing GBDT model parameters by adopting a Bayesian method;

and step 9: selecting the optimal hyper-parameter combination obtained through Bayesian optimization to retrain the GBDT personalized recommendation model;

step 10: and performing Top N recommendation and effect verification according to the finally obtained prediction result of the personalized recommendation model on the test set.

In step 3, the number of times each user scores the item is counted, and the formula is as follows:

b represents the b-th user in a data set A of 'user-item-scoring', wherein d users exist in the data set A, (b) is the score of the user b for each item, and CountRating means 'the sum of times of article review by each user'.

The step 5 specifically includes the following steps:

step 5.1: constructing a graph, and calculating a sample x in a user-article-score data set A_iAnd sample x_jThe euclidean distances of all samples except for the one shown below are as follows:

wherein epsilon is a manually set threshold value, the average value of samples is taken, m is the total number of the samples in the data set, if the Euclidean distance is smaller than the value epsilon, the two samples are considered to be very close to each other, and one side is established between a node i and a node j of the graph;

step 5.2: determining the weight, if the node i is connected with the node j, the weight of the edge between the node i and the node j is calculated by the following formula of the nuclear thermal function:

ω_ijrepresenting the weight, x, between i nodes and j nodes_iAnd x_jFor samples in the "user-item-score" dataset a, t is a manually set real number greater than 0;

step 5.3: calculating a projection matrix, wherein a formula for calculating the projection matrix is as follows:

XLX^Ta＝λXDX^Ta (5)

suppose the solution in the formula is a₀,a₁,...,a_l-1And their corresponding eigenvalues λ are ordered from small to large, the projective transformation matrix is C ═ a₀,a₁,...,a_l-1) And then the reduced sample point y_i＝C^Tx_i；

Wherein X is the matrix X mentioned in step 5, and the adjacent matrix W is determined by the weight ω in step two_ijForming; the main diagonal of the diagonal matrix D is the weighted degree of each vertex of the graph constructed in step one, wherein the weighted degree of the node i is the sum of the weights of all the edges associated with the node, i.e. the sum of each row element of the adjacency matrix W; the placian matrix L is defined as L ═ D-W.

The step 7 comprises the following steps:

step 7.1: the GBDT model is defined by the following formula:

y 'is Y' mentioned in step 6, K is the round of the scoring prediction learner, and K is the total round of the scoring prediction learner; f. of_k(Y') score prediction learning for the k-th roundDevice, h_k(Y') represents the kth CART (Classification and Regression Trees) decision Regression tree;

step 7.2: constructing a CART decision regression tree, namely h (Y') in the step 7.1;

step 7.3: the scoring prediction learning device adopts a forward step-by-step algorithm; the model of the k step is formed by the model of the k-1 step, namely the k step of the score prediction learner is closely related to the score prediction learner of the previous k-1 step, and the formula is as follows:

f_k(Y′)＝f_k-1(Y′)+β_k (7)

f_k(Y') prediction learner for k-th round of scoring, f_k-1(Y') prediction learner for k-1 st round of scoring, beta_kRepresenting the residual error generated in the k round;

step 7.4: and continuing iteration until the iteration is completed, and completing model building.

In the step 7.2, the method comprises the following steps:

step 7.21: partitioning the preprocessed data set B into H₁,H₂,...H_oThe output value of each region is respectively as follows: p is a radical of₁,p₂,...,p_o；

Step 7.22: recursively dividing each region into two sub-regions and determining an output value on each sub-region; selecting an optimal segmentation variable q and a segmentation point s according to the following formula;

p₁for the region H divided in step 7.21₁Output of p₂For the region H divided in step 7.21₂Output of u_vAnd w_vRespectively representing the characteristic attribute and the score of the data in the corresponding region, wherein the maximum value of v is the number of divided region samples; traversing the variable q, scanning a segmentation point s for the fixed segmentation variable q, and selecting a pair (q, s) which enables the above formula to reach the minimum value; dividing the region by the selected pair (q, s) and determiningA corresponding output value;

step 7.23: continuing to call steps 7.21 and 7.22 for the two sub-regions until a stop condition is met;

step 7.24: repartitioning of input space into o regions H'₁,H′₂,...H′_oGenerating a score prediction CART decision regression tree, wherein the formula is as follows:

h (u) is a predicted CART decision regression tree, H'_vFor the divided areas, O is the subscript of the divided areas, and O is the total number of the divided areas; p is a radical of_oFor fixed output values of the region partitioned in step 7.21, q 'and s' are the optimal solutions iterated through step 7.21 and step 7.22.

The step 8 comprises the following steps:

step 8.1: initializing dataset D '═ x'₁,y′₁),...,(x′_n,y′_n) Wherein, y'_i＝f′(x′_i) (ii) a f '(x') is the mapping relation from the dimension attribute to the score in the data;

step 8.2: GBDT model uses selected hyper-parameter combinations x'_iTraining is performed to calculate f '(x'_i)；

Step 8.3: calculating the next super-reference combination to the super-reference x 'by adopting a collection function'_i+1；

Step 8.4: repeating the step 8.2 and the step 8.3, and iterating for T' times;

step 8.5: the hyper-parametric combination of the optimized objective function f '(x') is output.

The step 10 includes the steps of:

step 10.1: setting N values, namely N items recommended to the users, and defining the number of the users as count;

step 10.2: aiming at each user, a real recommendation list generated on a Test set Test is marked as T (all), grading prediction is carried out on the Test set Test according to the GBDT recommendation model completed by the Bayesian optimization, and an obtained result is defined as a Test evaluation set;

step 10.3: scoring and sequencing the test scoring sets, recommending the first N articles to the users, and recording a Top N recommendation list obtained by each user as T (test);

step 10.4: verifying the accuracy and recall result of the test evaluation set;

step 10.5: calculating the length size of T (test);

step 10.6: and calculating the length size of T (all);

step 10.7: calculating the intersection T (U) of the Top N recommendation list of each user and T (test);

step 10.8: calculating the accuracy:

accumulating the accuracy rate generated by each user, and dividing the sum by the count to obtain the average accuracy rate;

step 10.9: calculating the recall ratio:

and accumulating the recall rate generated by each user, and dividing the sum of the recall rates by the count to obtain the average recall rate.

Advantages and effects

1. The invention utilizes the related technology in the field of machine learning, aims at the problem of information overload in the current society, solves the problem of difficult extraction of complex features through popular learning, reduces the dimension information of data feature attributes, reduces the model training time, improves the learning capability of the model, and greatly improves the recommendation efficiency.

2. Personalized recommendation is performed through ensemble learning, and a recommendation model is optimized through Bayesian, so that recommendation precision is improved, useful information can be quickly and effectively screened from huge data, and utilization efficiency of the information is improved.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a flow chart of data feature preprocessing;

fig. 3 is a flow chart of personalized recommendation.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings.

A personalized recommendation method based on popular learning LPP and integrated learning GBDT can improve the accuracy of personalized recommendation through integrated learning; in addition, the method integrates popular learning to reduce dimensions and realize the integration of data features, thereby solving the problem of difficult extraction of complex features.

FIG. 1 is a general flow chart of the present invention, which includes the following 10 steps, wherein steps 1-6 are the recommended data preprocessing portion of FIG. 1; step 7 is constructing a personalized recommendation model part in the attached figure 1; step 8 and step 9 are the optimization model part in fig. 1; step 10 is the personalized recommendation part in fig. 1.

The data preprocessing module is mainly responsible for reintegrating data features, and solves the problem of difficult extraction of complex features by constructing new features and popularly learning dimension reduction; the model establishing and optimizing module is mainly responsible for establishing a personalized ensemble learning prediction model based on the fused data, and carrying out Bayesian optimization on the basis of the establishment of the prediction model to improve the accuracy of personalized recommendation; and the personalized recommendation module is mainly responsible for acquiring a data result of the prediction model, and acquiring and verifying a personalized recommendation result by a Top N recommendation method.

The detailed steps are as follows:

a recommended data preprocessing part:

FIG. 2 is a flow chart of the characteristic data preprocessing of the present invention, and the specific implementation steps are as follows:

step 1: analyzing the dimension attribute of the personalized recommendation data, and dividing the personalized recommendation data into user-article-score data; and performing data association on the associated user-item-scoring dimension.

Step 2: and after the processing is finished, analyzing the data type of each dimension attribute, namely 'user-item-scoring', and converting the data type into the data type required by ensemble learning.

And step 3: generating characteristic attributes according to the 'grading' attributes in the 'user-item-grading' dimension attributes, wherein the formula is as follows:

b represents the b-th user in a data set A of 'user-item-scoring', wherein the data set A comprises d users in total, and R (b) is the scoring of the user b for each item.

And 4, step 4: all the obtained data were normalized and calculated as follows:

wherein vv represents the original value of the data, v' represents the value after normalization, min represents the minimum value of the column in which vv is located, and max represents the maximum value of the column in which vv is located.

And 5: let the "user-item-score" dataset A in the original space have m sample points x₁,x₂,...,x_mSample point x_iIs a vector of dimension l, i is an integer from 1 to m, and the matrix formed by the m samples according to columns is X. Performing dimension reduction processing on the data set A by using the popular learning LPP, wherein the data set B after dimension reduction is a data set formed by sample points y₁,y₂,...,y_mComposition, sample point y_iIs an n-dimensional vector, the m samples form a matrix of Y in columns, where l > n. The method comprises the following specific steps:

wherein epsilon is a manually set threshold value, generally, the average value of samples is taken, m is the total number of samples in the data set, if the distance is less than a certain value epsilon, two samples are considered to be very close, and one side is established between a node i and a node j of the graph.

ω_ijrepresenting the weight, x, between i nodes and j nodes_iAnd x_jFor the samples in the "user-item-score" dataset a, t is an artificially set real number greater than 0.

Step 5.3: and calculating a projection matrix, wherein the formula for calculating the projection matrix is as follows.

XLX^Ta＝λXDX^Ta

Suppose the solution in the formula is a₀,a₁,...,a_l-1And their corresponding eigenvalues λ are ordered from small to large, the projective transformation matrix is C ═ a₀,a₁,...,a_l-1) And then the reduced sample point y_i＝C^Tx_i。

Wherein the adjacency matrix W is determined by the weight ω in step two_ijAnd (4) forming. The main diagonal of the diagonal matrix D is the weighted degree of each vertex of the graph constructed in step one, where the weighted degree of the node i is the sum of the weights of all the edges associated with the node, i.e. the sum of each row element of the adjacency matrix W. The placian matrix L is defined as L ═ D-W.

Step 6: and D, performing dimension reduction on the data set B according to the following steps of 8: 2 into a training set Train and a Test set Test, wherein the data matrix corresponding to the training set Train is Y'.

Constructing a personalized recommendation model part:

and 7: an integrated learning GBDT method is adopted to establish a personalized recommendation model, a process schematic diagram is shown in an attached figure 3, and the method comprises the following specific steps:

step 7.1: the GBDT model is defined by the following formula:

y 'is Y' mentioned in step 6, K is the round of the score prediction learner, and K is the total iteration number of the score prediction learner. f. of_k(Y') score prediction learner for k-th round, h_k(Y') represents the kth CART decision regression tree.

Step 7.2: constructing a CART decision regression tree, namely h (Y') in the step 7.1, and specifically comprising the following steps:

step 7.21: partitioning the preprocessed data set B into H₁,H₂,...H_oThe output value of each region is respectively as follows: p is a radical of₁,p₂,...,p_o。

Step 7.22: recursively divides each region into two sub-regions and determines an output value on each sub-region. And selecting an optimal segmentation variable q and a segmentation point s according to the following formula.

p₁For the region H divided in step 7.21₁Output of p₂For the region H divided in step 7.21₂Output of u_vAnd w_vRespectively expressed as the characteristic attribute and the score of the data in the corresponding region, wherein the maximum value of v is the number of samples in the divided region. The variable q is traversed, the fixed segmentation variable q is scanned for segmentation points s, and the pair (q, s) that makes the above formula reach the minimum value is selected. The selected pair (q, s) is used to divide the region and determine the corresponding output value.

Step 7.23: the steps 7.21 and 7.22 are continued to be invoked for both sub-areas until the stop condition is fulfilled.

h (u) is a predicted CART decision regression tree, H'_vFor the divided regions, O is indicated as a divided region index, and O is indicated as the total number of divided regions. p is a radical of_oFor fixed output values of the region partitioned in step 7.21, q 'and s' are the optimal solutions iterated through step 7.21 and step 7.22.

Step 7.3: the scoring prediction learner adopts a forward step-by-step algorithm. The model of the k step is formed by the model of the k-1 step, namely the k step of the score prediction learner is closely related to the score prediction learner of the previous k-1 step, and the formula is as follows:

f_k(Y′)＝f_k-1(Y′)+β_k

f_k(Y') prediction learner for k-th round of scoring, f_k-1(Y') prediction learner for k-1 st round of scoring, beta_kRepresenting the residual error produced by the k-th round.

And (3) optimizing a model part:

and 8: a Bayesian method is adopted to optimize GBDT model parameters, and the specific steps are as follows:

step 8.1: initializing dataset D '═ x'₁,y′₁),...,(x′_n,y′_n) Wherein, y'_i＝f′(x′_i) (ii) a The objective function f '(x') is the mapping of the dimension attributes in the data to the scores.

Step 8.4: and repeating the steps of 8.2 and 8.3, and iterating for T' times.

And step 9: and selecting the optimal hyperparameter combination obtained through Bayesian optimization to retrain the GBDT personalized recommendation model.

The personalized recommendation part:

step 10: and performing Top N recommendation and effect verification according to the finally obtained prediction result of the personalized recommendation model on the Test set Test, and specifically comprising the following steps:

step 10.1: setting the N value, namely N items recommended to the user, and defining the number of the users as count.

Step 10.2: and (3) aiming at each user, recording a real recommendation list generated on the Test set Test as T (all), and performing score prediction on the Test set Test according to the GBDT recommendation model completed by the Bayesian optimization, wherein an obtained result is defined as the Test score set.

Step 10.3: and grading and sequencing the test scoring set, recommending the first N items to the user, and recording a Top N recommendation list obtained by each user as T (test).

Step 10.4: and verifying the accuracy and recall result of the test evaluation set.

Step 10.5: the length size of T (test) is calculated.

Step 10.6: and calculates the T (all) length size.

Step 10.7: and calculating the intersection T (U) of the Top N recommendation list of each user and T (test).

Step 10.8: calculating the accuracy:

and accumulating the accuracy rate generated by each user, and dividing the sum by the count to obtain the average accuracy rate.

Step 10.9: calculating the recall ratio:

The technical characteristics form an embodiment of the invention, which has strong adaptability and implementation effect, and unnecessary technical characteristics can be increased or decreased according to actual needs to meet the requirements of different situations.

Claims

1. A personalized recommendation method based on ensemble learning is characterized by comprising the following steps:

and 4, step 4: all the obtained data were normalized and calculated as follows:

and 5: let the "user-item-score" dataset A in the original space have m sample points x₁,x₂,...,x_mSample point x_iIs a vector of dimension l, i is an integer from 1 to m, and a matrix formed by the m samples according to columns is X; performing dimensionality reduction on the data set A by using a popular learning local preserving projection algorithm, wherein the data set B subjected to dimensionality reduction is a data set formed by sample points y₁,y₂,...,y_mComposition, sample point y_iIs an n-dimensional vector, the m samples form a matrix Y according to columns, wherein l is more than n;

and 7: establishing an individualized recommendation model by adopting an ensemble learning gradient boosting decision tree method;

and 8: optimizing gradient boosting decision tree model parameters by adopting a Bayes method;

and step 9: selecting the optimal hyper-parameter combination obtained through Bayesian optimization to retrain the gradient boosting decision tree personalized recommendation model;

2. The ensemble learning-based personalized recommendation method according to claim 1, wherein: in step 3, the number of times each user scores the item is counted, and the formula is as follows:

3. The ensemble learning-based personalized recommendation method according to claim 1, wherein: the step 5 specifically includes the following steps:

XLX^Ta＝λXDX^Ta (5)

4. The ensemble learning-based personalized recommendation method according to claim 1, wherein: the step 7 comprises the following steps:

step 7.1: the gradient boosting decision tree model is defined, and the formula is as follows:

y 'is Y' mentioned in step 6, K is the round of the scoring prediction learner, and K is the total round of the scoring prediction learner; f. of_k(Y') score prediction learner for k-th round, h_k(Y') representing the kth classification regression decision tree;

step 7.2: constructing a classification regression decision tree, namely h (Y') in the step 7.1;

f_k(Y′)＝f_k-1(Y′)+β_k (7)

5. The ensemble learning-based personalized recommendation method according to claim 4, wherein: in the step 7.2, the method comprises the following steps:

p₁for the region H divided in step 7.21₁Output of p₂For the region H divided in step 7.21₂Output of u_vAnd w_vRespectively representing the characteristic attribute and the score of the data in the corresponding region, wherein the maximum value of v is the number of divided region samples; traversing the variable q, scanning a segmentation point s for the fixed segmentation variable q, and selecting a pair (q, s) which enables the above formula to reach the minimum value; dividing the region by the selected pair (q, s) and determining a corresponding output value;

step 7.24: repartitioning of input space into o regions H'₁,H′₂,...H′_oGenerating a score prediction classification regression decision tree, wherein the formula is as follows:

h (u) is a score prediction classification regression decision tree, H'_vFor the divided areas, O is the subscript of the divided areas, and O is the total number of the divided areas; p is a radical of_oFor fixed output values of the region partitioned in step 7.21, q 'and s' are the optimal solutions iterated through step 7.21 and step 7.22.

6. The ensemble learning-based personalized recommendation method according to claim 1, wherein: the step 8 comprises the following steps:

step 8.2: gradient lifting decision tree model uses selected super-reference combination x'_iTraining is performed to calculate f '(x'_i)；

And 8, step 8.3: calculating the next super-reference combination to the super-reference x 'by adopting a collection function'_i+1；

Step 8.4: repeating the step 8.2 and the step 8.3, and iterating for T' times;

7. The ensemble learning-based personalized recommendation method according to claim 1, wherein: the step 10 includes the steps of:

step 10.2: aiming at each user, a real recommendation list generated on a Test set Test is marked as T (all), grading prediction is carried out on the Test set Test according to a gradient promotion decision tree recommendation model completed by Bayesian optimization, and an obtained result is defined as a Test grading set;

step 10.4: verifying the accuracy and recall result of the test evaluation set;

step 10.5: calculating the length size of T (test);

step 10.6: and calculating the length size of T (all);

step 10.8: calculating the accuracy:

step 10.9: calculating the recall ratio:

and accumulating the recall rates generated by each user, and dividing the sum by the countTo average recall.