CN111191707A

CN111191707A - LFM training sample construction method fusing time attenuation factors

Info

Publication number: CN111191707A
Application number: CN201911356445.2A
Authority: CN
Inventors: 甘志刚; 饶屾; 蒋晓宁; 余长宏; 余斌霄
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2020-05-22
Anticipated expiration: 2039-12-25
Also published as: CN111191707B

Abstract

The invention provides a construction method of an LFM training sample fused with a time attenuation factor, which comprises the following steps: s1) obtaining a positive sample; s2) calculating the popularity of the articles in the whole training set; s3) evaluating the diversity of the data of the sample library; s4) giving the algorithm recommendation accuracy and recall rate when different sample libraries are formed; s5) selecting a sample with the optimal popularity ratio as a negative sample; s6) the positive and negative examples are combined to form a training example library of the user. The invention has the advantages that: the influence of the popularity of the article and the diversity of the samples on the recommendation performance is comprehensively considered, time attenuation factors are fused, the algorithm recommendation accuracy and the recall rate when different sample banks are formed are given through experiments, the optimal popularity ratio is obtained through analysis, the optimal negative sample is obtained, and the better FC-LFM algorithm training effect is obtained.

Description

LFM training sample construction method fusing time attenuation factors

Technical Field

The invention relates to the technical field of internet big data processing, in particular to a construction method of an LFM training sample with fusion of time attenuation factors.

Background

An LFM (FC-LFM) method fusing time attenuation factors is a latent semantic model based on machine learning, and a user feature matrix P and an article feature matrix Q of each user are generated by learning training samples, so that a construction method of a training sample library is particularly important. In a traditional construction method of a training sample of the LFM for the user u, an article evaluated by the user u is used as a positive sample, the evaluation value of the positive sample is set to 1, a certain number of articles which are not evaluated by the user u are randomly extracted from a training set to form a negative sample, and the evaluation value of the negative sample is set to 0. And forming the positive and negative samples into a training sample of the user u.

Since negative examples are items that represent a user is not interested in, and among items that the user has not rated, there may be some items that are not popular enough, resulting in the user not knowing about the item, and not necessarily the type that the user dislikes. The completely random extraction of the article that the user has not evaluated as a negative sample may result in a reduction in recommendation accuracy because the reason that the user does not know the article and does not evaluate is not considered. However, if the negative sample collection is too concentrated on the high-popularity articles, the diversity loss of the training sample library is caused, and the recommendation accuracy is also reduced, so that a balance point of mutual influence of the popularity and the diversity of the sample needs to be found through experiments to serve as a basis for sample composition.

Disclosure of Invention

The invention provides an LFM training sample construction method which comprehensively considers the influence of article popularity and sample diversity on recommendation performance and integrates time attenuation factors.

In order to achieve the purpose, the invention is realized by the following technical scheme:

the LFM training sample construction method fusing the time attenuation factors comprises the following steps:

s1) obtaining the items evaluated by the user u from the training samples, wherein the number of the items is Sp, and the items are used as positive samples;

s2) calculating the popularity of the items in the entire training set:

wherein u is_iIndicating a user who has made an evaluation of item i, T_rRepresents a training set, f_itRepresenting the time attenuation factor of the item i at the time t;

wherein, t_nowIs the current time of day and the time of day,

the time when the user evaluates the article i is day;

s3) using Simpson diversity index (Simpson index) to evaluate the data diversity of the sample library, the formula is:

where S represents the entire sample set, P_iRepresenting the probability that the extracted sample falls in the i interval;

s4) sorting the sequence of popularity from high to low, using the previous 10%, 20%, 30%,. and 100% of popularity as sample libraries, and randomly extracting negative samples from the sample libraries to construct a learning sample library, keeping the parameter α equal to 0.1, keeping the regularization parameter λ equal to 0.01 unchanged, training set iteration frequency epochs equal to 10, class number K equal to 30, and positive and negative sample ratio equal to 1:10, giving the algorithm recommendation accuracy and recall ratio when different sample libraries are composed, and making a table for comparison;

the accuracy rate describes how many proportions of the items contained in the recommendation list are actually seen by the user, and the calculation formula is as follows:

the recall ratio is a ratio describing how much the items seen by the user in the test set T appear in the recommendation list, and is calculated by the following formula:

wherein T represents a test set, R (u) represents an item list recommended to the user u according to a recommendation algorithm, and T (u) represents items really evaluated by the user u in the test set T;

s5) from samples that user u has not evaluated, according to the formula

Popularity (i) < ═ Pratio (6) selection prevalence ratio is R_bestAs a negative Sample, wherein Sample_uIs the sample library of user u, PRatio is the popularity fraction of the sample library;

s6) the positive and negative samples are combined into a sample set to serve as a training sample library of the user u.

Compared with the prior art, the invention has the following advantages:

according to the LFM training sample construction method fusing the time attenuation factors, the influence of the article popularity and the sample diversity on the recommendation performance is comprehensively considered, the time attenuation factors are fused, the algorithm recommendation accuracy and the recall ratio when different sample libraries are formed are given through experiments, the optimal popularity ratio is obtained through analysis, the optimal negative sample is obtained, and the better FC-LFM algorithm training effect is obtained.

Drawings

FIG. 1 is a schematic flow diagram of the present invention.

FIG. 2 is a graph illustrating the impact of the most popular movie proportions on the performance of algorithm recommendations in an embodiment of the present invention.

FIG. 3 is a graph illustrating the effect of sample library data diversity on algorithm recommendation performance in an embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

The data set adopted by the invention is MovieLens (100k) (https:// GroupLens. org/datasets/movieens/100 k /) provided by group lens Research laboratory of university of Minnesota in the United states, the data set comprises 10 ten thousand scoring records of 1682 movies by 943 users, and the data sparsity reaches 93.7%. Each record includes the user's rating (1-5 points) for the movie and the time of the rating (to the nearest second). The invention randomly extracts data from the data set according to the ratio of 9:1 to form a training set and a testing set, and the training set and the testing set are used for training and testing the algorithm.

Specifically, the construction method of the LFM training sample fused with the time attenuation factor comprises the following steps:

s1) obtaining movies evaluated by the user u from the training samples, wherein the number of the movies is Sp, and the movies are used as positive samples;

s2) to evaluate the popularity of a movie, the present invention evaluates it with a popularity (popularity) index, where the popularity represents the proportion of users who have evaluated the item among all users, and the higher the popularity, the more users who have evaluated the item, and the more recently the item that has been popular earlier is not necessarily popular due to the influence of time factors on the popularity, so that a time decay factor is introduced into the popularity here, and the popularity of the movie in the entire training set is calculated:

wherein u is_iRepresenting users having rated movie i, T_rRepresents a training set, f_itRepresenting the time decay factor of movie i at time t;

wherein, t_nowIs the current time of day and the time of day,

is the user to the moviei time to make an assessment in days;

s3) in order to evaluate the influence of the data diversity of the sample library on the learning performance of the algorithm, a Simpson diversity index (Simpson index) is used to evaluate the data diversity of the sample library, which represents the probability that two randomly sampled individuals belong to different kinds of data, the larger the value of the index is, the more dispersed the sample is, the smaller the value is, the more concentrated the sample is, and the formula is:

s4) sorting the sequence of popularity from high to low, using the previous 10%, 20%, 30%,. and 100% of popularity as sample libraries, and randomly extracting negative samples from the sample libraries to construct a learning sample library, keeping the parameter α equal to 0.1, keeping the regularization parameter λ equal to 0.01 unchanged, training set iteration time epochs equal to 10, class number K equal to 30, and positive and negative sample ratio equal to 1:10, giving the algorithm recommendation accuracy and recall ratio when different sample libraries are composed, and making table 1 for comparison;

the accuracy and recall, which describe how many proportions of movies included in the recommendation list are actually seen by the user (for the test set T), may reflect the effectiveness of the recommendation algorithm, and is calculated as:

the recall ratio is a ratio describing how many movies the user saw in the test set T appear in the recommendation list, and is calculated by the following formula:

wherein T represents a test set, R (u) represents a movie list recommended to the user u according to a recommendation algorithm, and T (u) represents movies really evaluated by the user u in the test set T;

TABLE 1 algorithmic prediction error for different sample libraries

S5) in order to evaluate the degree of coverage of the samples in the sample library by the popularity ranking, the popularity ratio (PRatio) of the sample library is used herein to represent that the smaller the popularity ratio, the more concentrated the samples in the sample library on some samples with the highest popularity, if the popularity ratio is 20%, the samples with the highest popularity of 20% in the training sample set are combined into the sample library, and the formula is that

Popularity (i) < ═ Pratio (6), wherein, Sample_uIs the sample library of user u, PRatio is the popularity fraction of the sample library;

from the samples that user u did not evaluate, a popularity ratio R was selected according to equation (6)_bestThe sample of (2) is taken as a negative sample;

Fig. 1 and 2 show the trend chart of the influence of the popularity of the movie and the diversity of samples on the recommendation accuracy and recall rate of the algorithm. As can be seen from table 1 and fig. 1 and 2, when the sample diversity index is less than 0.875, the recommendation accuracy and recall rate of the algorithm gradually increase with the increase of the sample diversity, and at this time, the benefit of the sample diversity to the algorithm is greater than the bad place caused by the increase of the non-popular movies in the sample library; with the continuous addition of the non-popular movies, although the sample diversity is further improved, the algorithm recommendation accuracy and recall rate are reduced, and the influence of the non-popular movies on the algorithm prediction error is larger than the benefit brought by the sample diversity. Thus, in subsequent experiments, the sample library consisted of the top 80% of movies of the training set popularity (80% popularity).

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and improvements can be made without departing from the spirit of the present invention, and these modifications and improvements should also be considered as within the scope of the present invention.

Claims

1. The LFM training sample construction method fused with the time attenuation factor is characterized by comprising the following steps:

s2) calculating the popularity of the items in the entire training set:

wherein, t_nowIs the current time of day and the time of day,

the time when the user evaluates the article i is day;

s5) from samples that user u has not evaluated, according to the formula