CN111191707B

CN111191707B - LFM training sample construction method integrating time attenuation factors

Info

Publication number: CN111191707B
Application number: CN201911356445.2A
Authority: CN
Inventors: 甘志刚; 饶屾; 蒋晓宁; 余长宏; 余斌霄
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2023-06-06
Anticipated expiration: 2039-12-25
Also published as: CN111191707A

Abstract

The invention provides a method for constructing an LFM training sample by fusing time attenuation factors, which comprises the following steps: s1) acquiring a positive sample; s2) calculating popularity of the articles in the whole training set; s3) evaluating the diversity of the sample library data; s4) the algorithm recommendation accuracy and recall rate when different sample libraries are formed are given; s5) selecting a sample with the optimal popularity ratio as a negative sample; s6) combining the positive and negative samples into a training sample library of the user. The invention has the advantages that: the influence of the popularity of the article and the diversity of the sample on the recommended performance is comprehensively considered, the time attenuation factors are fused, the algorithm recommendation accuracy and recall rate when different sample libraries are formed are given through experiments, the optimal popularity ratio is obtained through analysis, and therefore an optimal negative sample is obtained, and a good FC-LFM algorithm training effect is obtained.

Description

LFM training sample construction method integrating time attenuation factors

Technical Field

The invention relates to the technical field of Internet big data processing, in particular to an LFM training sample construction method integrating time attenuation factors.

Background

The LFM method (FC-LFM) of fusing time attenuation factors is a machine learning-based latent semantic model, and a user characteristic matrix P and an article characteristic matrix Q of each user are generated by learning training samples, so that the construction method of a training sample library is particularly important. The traditional method for constructing the training sample of the user u by the LFM is to take the article evaluated by the user u as a positive sample, set the evaluation value of the positive sample as 1, randomly extract a certain number of articles which are not evaluated by the user u from the training set to form a negative sample, and set the evaluation value of the negative sample as 0. The positive and negative samples are combined into a training sample for user u.

Since the negative sample is an item that indicates that the user is not interested in, and perhaps some of the items that the user has not rated are not popular enough, the user does not know the item, and is not necessarily of a type that the user dislikes. The completely random extraction of items that the user has not evaluated as a negative sample may result in reduced recommendation accuracy because the user is not considered to be unaware of the item and not evaluating the item. However, if the negative sample collection is too concentrated on the articles with high popularity, the diversity loss of the training sample library is caused, and the recommendation accuracy is also reduced, so that a balance point of the mutual influence of the popularity and the diversity of the samples needs to be found through experiments to serve as the basis of sample composition.

Disclosure of Invention

The invention provides an LFM training sample construction method which comprehensively considers the influence of item popularity and sample diversity on recommended performance and fuses time attenuation factors.

In order to achieve the above purpose, the present invention is realized by the following technical scheme:

the LFM training sample construction method integrating the time attenuation factors comprises the following steps:

s1) obtaining articles evaluated by a user u from training samples, wherein the number of the articles is Sp, and the articles are taken as positive samples;

s2) calculating popularity of items in the whole training set:

wherein u is _i Representing a user who has evaluated item i, T _r Representing training set, f _it A time attenuation factor of the object i at the time t is represented;

wherein the method comprises the steps of，t _now Is the current time of day and,

the time unit is the day when the user makes an evaluation on the object i;

s3) evaluating the data diversity of the sample library by adopting a Simpson diversity index (Simpson index), wherein the formula is as follows:

wherein S represents the whole sample set, P _i Representing the probability that the extracted sample falls within interval i;

s4) sequencing the popularity from high to low, respectively using 10%, 20%, 30%,. The first 10% and 100% of popularity as sample libraries, randomly extracting negative samples from the sample libraries to construct a learning sample library, keeping the parameter alpha=0.1, the regularization parameter lambda=0.01 unchanged, the training set iteration number epochs=10, the classification number K=30, the positive and negative sample ratio is 1:10, and giving algorithm recommendation accuracy and recall rate when different sample libraries are formed, and tabulating and comparing;

the accuracy describes how many proportions of the items contained in the recommendation list are actually seen by the user, and the calculation formula is as follows:

the recall rate is a value describing how much proportion of the items seen by the user in the test set T appear in the recommendation list, and its calculation formula is:

wherein T represents a test set, R (u) represents an item list recommended to the user u according to a recommendation algorithm, and T (u) represents an item actually evaluated by the user u in the test set T;

s5) non-rated from user uIn the sample, according to the formula

The Popularity (i) < = PRatio (6) selects a Popularity ratio of R _best As a negative Sample, wherein Sample _u Is a sample library for user u, PRatio is the popularity duty cycle of the sample library;

s6) forming positive and negative samples into a sample set to serve as a training sample library of the user u.

Compared with the prior art, the invention has the following advantages:

according to the LFM training sample construction method integrating the time attenuation factors, influences of item popularity and sample diversity on recommended performance are comprehensively considered, the time attenuation factors are integrated, algorithm recommendation accuracy and recall rate when different sample libraries are formed are given through experiments, and the optimal popularity ratio is obtained through analysis, so that an optimal negative sample is obtained, and a good FC-LFM algorithm training effect is obtained.

Drawings

Fig. 1 is a schematic flow chart of the present invention.

FIG. 2 is a graph of the impact of most popular movie occupancy versus algorithm recommendation performance in an embodiment of the invention.

FIG. 3 is a graph showing the influence of sample library data diversity on algorithm recommendation performance in an embodiment of the invention.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

The data set used in the present invention was MovieLens (100 k) (https:// groups:/org/datasets/movieens/100 k /) offered by university of Minnesota GroupLens Research laboratory, usa, which included a score record of 943 users for 10 ten thousand of 1682 movies, with a data sparseness of 93.7%. Each record includes the user's score (1-5 points) for the movie and the time of the score (accurate to seconds). The data set is randomly extracted according to the proportion of 9:1 to form a training set and a testing set, and the training set and the testing set are used for training and testing an algorithm.

Specifically, the method for constructing the LFM training sample by fusing the time attenuation factors comprises the following steps:

s1) obtaining movies evaluated by a user u from training samples, wherein the number of the movies is Sp, and the movies are taken as positive samples;

s2) in order to evaluate the popularity of the movie, the present invention evaluates using a popularity (popularity) index, where popularity represents the proportion of users who evaluate the item among all users, and the higher the popularity, the more users who evaluate the item, and because the influence of time factors on popularity, the earlier popular items are not necessarily popular recently, so that a time decay factor is introduced into popularity herein, and popularity of the movie in the whole training set is calculated:

wherein u is _i Representing the user who has rated movie i, T _r Representing training set, f _it Representing the time decay factor of film i at time t;

wherein t is _now Is the current time of day and,

the time unit is the day when the user makes an evaluation on the movie i;

s3) in order to evaluate the influence of the data diversity of the sample library on the learning performance of the algorithm, the data diversity of the sample library is evaluated by adopting a Simpson diversity index (Simpson index), wherein the Simpson diversity index represents the probability that two randomly sampled individuals belong to different kinds of data, the larger the value is, the more dispersed the samples are, the smaller the value is, the more concentrated the value is, and the formula is as follows:

wherein S represents the whole sampleThe present set, P _i Representing the probability that the extracted sample falls within interval i;

s4) sequencing the popularity from high to low, respectively using 10%, 20%, 30%,. The first 10% and 100% of popularity as sample libraries, randomly extracting negative samples from the sample libraries to construct a learning sample library, keeping the parameter alpha=0.1, the regularization parameter lambda=0.01 unchanged, the training set iteration number epochs=10, the classification number K=30, the positive and negative sample ratio is 1:10, and comparing the algorithm recommendation accuracy and the recall rate when different sample libraries are formed, and preparing table 1;

the accuracy and recall may reflect the effectiveness of the recommendation algorithm, where the accuracy describes how many proportions of movies contained in the recommendation list were actually watched by the user (for test set T), whose calculation formula is:

the recall rate is a value describing how many proportions of movies viewed by the user in the test set T appear in the recommendation list, and its calculation formula is:

wherein T represents a test set, R (u) represents a movie list recommended to the user u according to a recommendation algorithm, and T (u) represents movies actually evaluated by the user u in the test set T;

table 1 algorithm prediction error for different sample libraries

S5) in order to evaluate the degree to which the samples in the sample library cover the whole sample library according to the popularity ranking, the popularity ratio (PRatio) of the sample library is adopted to represent that the smaller the popularity ratio is, the more the samples in the sample library are concentrated on some samples with highest popularity, and if the popularity ratio is 20%, the training samples are representedThe sample with the highest popularity in the set forms a sample library, and the formula is that

Polar (i) <=PRatio (6), wherein Sample _u Is a sample library for user u, PRatio is the popularity duty cycle of the sample library;

selecting a popularity ratio R according to formula (6) from samples not evaluated by user u _best As negative samples;

Fig. 1 and 2 show trend graphs of movie popularity and sample diversity impact on algorithm recommendation accuracy and recall. As can be seen from table 1, fig. 1 and fig. 2, when the sample diversity index is smaller than 0.875, the recommendation accuracy and recall rate of the algorithm are gradually increased along with the improvement of the sample diversity, and at this time, the benefit of the sample diversity on the algorithm is greater than the damage caused by the increase of unpopular movies in the sample library; with the continuous addition of unpopular movies, sample diversity, though further improved, algorithm recommendation accuracy and recall are instead reduced, and it can be seen that the impact of unpopular movies on algorithm prediction errors has been greater than the benefits of sample diversity. Thus, in subsequent experiments, the sample library consisted of movies 80% before the popularity of the training set (popularity ratio=80%).

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the concept of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. The LFM training sample construction method fusing the time attenuation factors is characterized by comprising the following steps of:

s2) calculating popularity of items in the whole training set:

wherein t is _now Is the current time of day and,

the time unit is the day when the user makes an evaluation on the object i;

s5) from the samples which have not been evaluated by user u, according to the formula