CN111191707A - LFM training sample construction method fusing time attenuation factors - Google Patents

LFM training sample construction method fusing time attenuation factors Download PDF

Info

Publication number
CN111191707A
CN111191707A CN201911356445.2A CN201911356445A CN111191707A CN 111191707 A CN111191707 A CN 111191707A CN 201911356445 A CN201911356445 A CN 201911356445A CN 111191707 A CN111191707 A CN 111191707A
Authority
CN
China
Prior art keywords
sample
user
popularity
training
items
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911356445.2A
Other languages
Chinese (zh)
Other versions
CN111191707B (en
Inventor
甘志刚
饶屾
蒋晓宁
余长宏
余斌霄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Gongshang University
Original Assignee
Zhejiang Gongshang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Gongshang University filed Critical Zhejiang Gongshang University
Priority to CN201911356445.2A priority Critical patent/CN111191707B/en
Publication of CN111191707A publication Critical patent/CN111191707A/en
Application granted granted Critical
Publication of CN111191707B publication Critical patent/CN111191707B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0282Rating or review of business operators or products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0631Item recommendations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Strategic Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Software Systems (AREA)
  • Economics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Game Theory and Decision Science (AREA)
  • Medical Informatics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a construction method of an LFM training sample fused with a time attenuation factor, which comprises the following steps: s1) obtaining a positive sample; s2) calculating the popularity of the articles in the whole training set; s3) evaluating the diversity of the data of the sample library; s4) giving the algorithm recommendation accuracy and recall rate when different sample libraries are formed; s5) selecting a sample with the optimal popularity ratio as a negative sample; s6) the positive and negative examples are combined to form a training example library of the user. The invention has the advantages that: the influence of the popularity of the article and the diversity of the samples on the recommendation performance is comprehensively considered, time attenuation factors are fused, the algorithm recommendation accuracy and the recall rate when different sample banks are formed are given through experiments, the optimal popularity ratio is obtained through analysis, the optimal negative sample is obtained, and the better FC-LFM algorithm training effect is obtained.

Description

LFM training sample construction method fusing time attenuation factors
Technical Field
The invention relates to the technical field of internet big data processing, in particular to a construction method of an LFM training sample with fusion of time attenuation factors.
Background
An LFM (FC-LFM) method fusing time attenuation factors is a latent semantic model based on machine learning, and a user feature matrix P and an article feature matrix Q of each user are generated by learning training samples, so that a construction method of a training sample library is particularly important. In a traditional construction method of a training sample of the LFM for the user u, an article evaluated by the user u is used as a positive sample, the evaluation value of the positive sample is set to 1, a certain number of articles which are not evaluated by the user u are randomly extracted from a training set to form a negative sample, and the evaluation value of the negative sample is set to 0. And forming the positive and negative samples into a training sample of the user u.
Since negative examples are items that represent a user is not interested in, and among items that the user has not rated, there may be some items that are not popular enough, resulting in the user not knowing about the item, and not necessarily the type that the user dislikes. The completely random extraction of the article that the user has not evaluated as a negative sample may result in a reduction in recommendation accuracy because the reason that the user does not know the article and does not evaluate is not considered. However, if the negative sample collection is too concentrated on the high-popularity articles, the diversity loss of the training sample library is caused, and the recommendation accuracy is also reduced, so that a balance point of mutual influence of the popularity and the diversity of the sample needs to be found through experiments to serve as a basis for sample composition.
Disclosure of Invention
The invention provides an LFM training sample construction method which comprehensively considers the influence of article popularity and sample diversity on recommendation performance and integrates time attenuation factors.
In order to achieve the purpose, the invention is realized by the following technical scheme:
the LFM training sample construction method fusing the time attenuation factors comprises the following steps:
s1) obtaining the items evaluated by the user u from the training samples, wherein the number of the items is Sp, and the items are used as positive samples;
s2) calculating the popularity of the items in the entire training set:
Figure BDA0002336051850000021
wherein u isiIndicating a user who has made an evaluation of item i, TrRepresents a training set, fitRepresenting the time attenuation factor of the item i at the time t;
Figure BDA0002336051850000022
wherein, tnowIs the current time of day and the time of day,
Figure BDA0002336051850000023
the time when the user evaluates the article i is day;
s3) using Simpson diversity index (Simpson index) to evaluate the data diversity of the sample library, the formula is:
Figure BDA0002336051850000024
where S represents the entire sample set, PiRepresenting the probability that the extracted sample falls in the i interval;
s4) sorting the sequence of popularity from high to low, using the previous 10%, 20%, 30%,. and 100% of popularity as sample libraries, and randomly extracting negative samples from the sample libraries to construct a learning sample library, keeping the parameter α equal to 0.1, keeping the regularization parameter λ equal to 0.01 unchanged, training set iteration frequency epochs equal to 10, class number K equal to 30, and positive and negative sample ratio equal to 1:10, giving the algorithm recommendation accuracy and recall ratio when different sample libraries are composed, and making a table for comparison;
the accuracy rate describes how many proportions of the items contained in the recommendation list are actually seen by the user, and the calculation formula is as follows:
Figure BDA0002336051850000025
the recall ratio is a ratio describing how much the items seen by the user in the test set T appear in the recommendation list, and is calculated by the following formula:
Figure BDA0002336051850000031
wherein T represents a test set, R (u) represents an item list recommended to the user u according to a recommendation algorithm, and T (u) represents items really evaluated by the user u in the test set T;
s5) from samples that user u has not evaluated, according to the formula
Figure BDA0002336051850000032
Popularity (i) < ═ Pratio (6) selection prevalence ratio is RbestAs a negative Sample, wherein SampleuIs the sample library of user u, PRatio is the popularity fraction of the sample library;
s6) the positive and negative samples are combined into a sample set to serve as a training sample library of the user u.
Compared with the prior art, the invention has the following advantages:
according to the LFM training sample construction method fusing the time attenuation factors, the influence of the article popularity and the sample diversity on the recommendation performance is comprehensively considered, the time attenuation factors are fused, the algorithm recommendation accuracy and the recall ratio when different sample libraries are formed are given through experiments, the optimal popularity ratio is obtained through analysis, the optimal negative sample is obtained, and the better FC-LFM algorithm training effect is obtained.
Drawings
FIG. 1 is a schematic flow diagram of the present invention.
FIG. 2 is a graph illustrating the impact of the most popular movie proportions on the performance of algorithm recommendations in an embodiment of the present invention.
FIG. 3 is a graph illustrating the effect of sample library data diversity on algorithm recommendation performance in an embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
The data set adopted by the invention is MovieLens (100k) (https:// GroupLens. org/datasets/movieens/100 k /) provided by group lens Research laboratory of university of Minnesota in the United states, the data set comprises 10 ten thousand scoring records of 1682 movies by 943 users, and the data sparsity reaches 93.7%. Each record includes the user's rating (1-5 points) for the movie and the time of the rating (to the nearest second). The invention randomly extracts data from the data set according to the ratio of 9:1 to form a training set and a testing set, and the training set and the testing set are used for training and testing the algorithm.
Specifically, the construction method of the LFM training sample fused with the time attenuation factor comprises the following steps:
s1) obtaining movies evaluated by the user u from the training samples, wherein the number of the movies is Sp, and the movies are used as positive samples;
s2) to evaluate the popularity of a movie, the present invention evaluates it with a popularity (popularity) index, where the popularity represents the proportion of users who have evaluated the item among all users, and the higher the popularity, the more users who have evaluated the item, and the more recently the item that has been popular earlier is not necessarily popular due to the influence of time factors on the popularity, so that a time decay factor is introduced into the popularity here, and the popularity of the movie in the entire training set is calculated:
Figure BDA0002336051850000041
wherein u isiRepresenting users having rated movie i, TrRepresents a training set, fitRepresenting the time decay factor of movie i at time t;
Figure BDA0002336051850000042
wherein, tnowIs the current time of day and the time of day,
Figure BDA0002336051850000043
is the user to the moviei time to make an assessment in days;
s3) in order to evaluate the influence of the data diversity of the sample library on the learning performance of the algorithm, a Simpson diversity index (Simpson index) is used to evaluate the data diversity of the sample library, which represents the probability that two randomly sampled individuals belong to different kinds of data, the larger the value of the index is, the more dispersed the sample is, the smaller the value is, the more concentrated the sample is, and the formula is:
Figure BDA0002336051850000044
where S represents the entire sample set, PiRepresenting the probability that the extracted sample falls in the i interval;
s4) sorting the sequence of popularity from high to low, using the previous 10%, 20%, 30%,. and 100% of popularity as sample libraries, and randomly extracting negative samples from the sample libraries to construct a learning sample library, keeping the parameter α equal to 0.1, keeping the regularization parameter λ equal to 0.01 unchanged, training set iteration time epochs equal to 10, class number K equal to 30, and positive and negative sample ratio equal to 1:10, giving the algorithm recommendation accuracy and recall ratio when different sample libraries are composed, and making table 1 for comparison;
the accuracy and recall, which describe how many proportions of movies included in the recommendation list are actually seen by the user (for the test set T), may reflect the effectiveness of the recommendation algorithm, and is calculated as:
Figure BDA0002336051850000051
the recall ratio is a ratio describing how many movies the user saw in the test set T appear in the recommendation list, and is calculated by the following formula:
Figure BDA0002336051850000052
wherein T represents a test set, R (u) represents a movie list recommended to the user u according to a recommendation algorithm, and T (u) represents movies really evaluated by the user u in the test set T;
TABLE 1 algorithmic prediction error for different sample libraries
Figure BDA0002336051850000053
S5) in order to evaluate the degree of coverage of the samples in the sample library by the popularity ranking, the popularity ratio (PRatio) of the sample library is used herein to represent that the smaller the popularity ratio, the more concentrated the samples in the sample library on some samples with the highest popularity, if the popularity ratio is 20%, the samples with the highest popularity of 20% in the training sample set are combined into the sample library, and the formula is that
Figure BDA0002336051850000054
Popularity (i) < ═ Pratio (6), wherein, SampleuIs the sample library of user u, PRatio is the popularity fraction of the sample library;
from the samples that user u did not evaluate, a popularity ratio R was selected according to equation (6)bestThe sample of (2) is taken as a negative sample;
s6) the positive and negative samples are combined into a sample set to serve as a training sample library of the user u.
Fig. 1 and 2 show the trend chart of the influence of the popularity of the movie and the diversity of samples on the recommendation accuracy and recall rate of the algorithm. As can be seen from table 1 and fig. 1 and 2, when the sample diversity index is less than 0.875, the recommendation accuracy and recall rate of the algorithm gradually increase with the increase of the sample diversity, and at this time, the benefit of the sample diversity to the algorithm is greater than the bad place caused by the increase of the non-popular movies in the sample library; with the continuous addition of the non-popular movies, although the sample diversity is further improved, the algorithm recommendation accuracy and recall rate are reduced, and the influence of the non-popular movies on the algorithm prediction error is larger than the benefit brought by the sample diversity. Thus, in subsequent experiments, the sample library consisted of the top 80% of movies of the training set popularity (80% popularity).
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and improvements can be made without departing from the spirit of the present invention, and these modifications and improvements should also be considered as within the scope of the present invention.

Claims (1)

1. The LFM training sample construction method fused with the time attenuation factor is characterized by comprising the following steps:
s1) obtaining the items evaluated by the user u from the training samples, wherein the number of the items is Sp, and the items are used as positive samples;
s2) calculating the popularity of the items in the entire training set:
Figure FDA0002336051840000011
wherein u isiIndicating a user who has made an evaluation of item i, TrRepresents a training set, fitRepresenting the time attenuation factor of the item i at the time t;
Figure FDA0002336051840000012
wherein, tnowIs the current time of day and the time of day,
Figure FDA0002336051840000013
the time when the user evaluates the article i is day;
s3) using Simpson diversity index (Simpson index) to evaluate the data diversity of the sample library, the formula is:
Figure FDA0002336051840000014
where S represents the entire sample set, PiRepresenting the probability that the extracted sample falls in the i interval;
s4) sorting the sequence of popularity from high to low, using the previous 10%, 20%, 30%,. and 100% of popularity as sample libraries, and randomly extracting negative samples from the sample libraries to construct a learning sample library, keeping the parameter α equal to 0.1, keeping the regularization parameter λ equal to 0.01 unchanged, training set iteration frequency epochs equal to 10, class number K equal to 30, and positive and negative sample ratio equal to 1:10, giving the algorithm recommendation accuracy and recall ratio when different sample libraries are composed, and making a table for comparison;
the accuracy rate describes how many proportions of the items contained in the recommendation list are actually seen by the user, and the calculation formula is as follows:
Figure FDA0002336051840000015
the recall ratio is a ratio describing how much the items seen by the user in the test set T appear in the recommendation list, and is calculated by the following formula:
Figure FDA0002336051840000021
wherein T represents a test set, R (u) represents an item list recommended to the user u according to a recommendation algorithm, and T (u) represents items really evaluated by the user u in the test set T;
s5) from samples that user u has not evaluated, according to the formula
Figure FDA0002336051840000022
Popularity (i) < ═ Pratio (6) selection prevalence ratio is RbestAs a negative Sample, wherein SampleuIs the sample library of user u, PRatio is the popularity fraction of the sample library;
s6) the positive and negative samples are combined into a sample set to serve as a training sample library of the user u.
CN201911356445.2A 2019-12-25 2019-12-25 LFM training sample construction method integrating time attenuation factors Active CN111191707B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911356445.2A CN111191707B (en) 2019-12-25 2019-12-25 LFM training sample construction method integrating time attenuation factors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911356445.2A CN111191707B (en) 2019-12-25 2019-12-25 LFM training sample construction method integrating time attenuation factors

Publications (2)

Publication Number Publication Date
CN111191707A true CN111191707A (en) 2020-05-22
CN111191707B CN111191707B (en) 2023-06-06

Family

ID=70707538

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911356445.2A Active CN111191707B (en) 2019-12-25 2019-12-25 LFM training sample construction method integrating time attenuation factors

Country Status (1)

Country Link
CN (1) CN111191707B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116610857A (en) * 2023-04-10 2023-08-18 南京邮电大学 Personalized post recommendation method based on user preference for post popularity

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012022570A (en) * 2010-07-15 2012-02-02 Ntt Comware Corp Object recommendation apparatus, object recommendation method, object recommendation program and object recommendation system
CN106202151A (en) * 2016-06-23 2016-12-07 长沙学院 One is used for improving the multifarious method of personalized recommendation system
US20170132230A1 (en) * 2015-11-09 2017-05-11 WP Company LLC d/b/a The Washington Post Systems and methods for recommending temporally relevant news content using implicit feedback data
CN109063052A (en) * 2018-07-19 2018-12-21 北京物资学院 A kind of personalized recommendation method and device based on time entropy
CN109977299A (en) * 2019-02-21 2019-07-05 西北大学 A kind of proposed algorithm of convergence project temperature and expert's coefficient

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012022570A (en) * 2010-07-15 2012-02-02 Ntt Comware Corp Object recommendation apparatus, object recommendation method, object recommendation program and object recommendation system
US20170132230A1 (en) * 2015-11-09 2017-05-11 WP Company LLC d/b/a The Washington Post Systems and methods for recommending temporally relevant news content using implicit feedback data
CN106202151A (en) * 2016-06-23 2016-12-07 长沙学院 One is used for improving the multifarious method of personalized recommendation system
CN109063052A (en) * 2018-07-19 2018-12-21 北京物资学院 A kind of personalized recommendation method and device based on time entropy
CN109977299A (en) * 2019-02-21 2019-07-05 西北大学 A kind of proposed algorithm of convergence project temperature and expert's coefficient

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LAI, C等: "A Social Recommendation Method Based on the Integration of Social Relationship and Product Popularity", 《INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES》 *
刘乔: "基于时间加权与评分预测的协同过滤推荐算法研究", 《中国优秀硕士学位论文全文库 信息科技辑》 *
牛抗抗: "流行度对用户兴趣的影响机制分析及其在推荐算法中的应用研究", 《中国优秀硕士学位论文全文库 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116610857A (en) * 2023-04-10 2023-08-18 南京邮电大学 Personalized post recommendation method based on user preference for post popularity
CN116610857B (en) * 2023-04-10 2024-05-03 南京邮电大学 Personalized post recommendation method based on user preference for post popularity

Also Published As

Publication number Publication date
CN111191707B (en) 2023-06-06

Similar Documents

Publication Publication Date Title
CN111797321B (en) Personalized knowledge recommendation method and system for different scenes
CN106909981B (en) Model training method, sample balancing method, model training device, sample balancing device and personal credit scoring system
WO2016155493A1 (en) Data processing method and apparatus
CN109740924B (en) Article scoring prediction method integrating attribute information network and matrix decomposition
WO2019196208A1 (en) Text sentiment analysis method, readable storage medium, terminal device, and apparatus
CN107256241B (en) Movie recommendation method for improving multi-target genetic algorithm based on grid and difference replacement
CN112613552A (en) Convolutional neural network emotion image classification method combining emotion category attention loss
CN110334356A (en) Article matter method for determination of amount, article screening technique and corresponding device
CA2861898A1 (en) Download resource recommendation method, system and storage medium
CN112612951B (en) Unbiased learning sorting method for income improvement
CN109816015B (en) Recommendation method and system based on material data
Abbas Deposit subscribe prediction using data mining techniques based real marketing dataset
Hanif et al. Resolving class imbalance and feature selection in customer churn dataset
CN115829683A (en) Power integration commodity recommendation method and system based on inverse reward learning optimization
He et al. Blending pruning criteria for convolutional neural networks
CN111191707A (en) LFM training sample construction method fusing time attenuation factors
CN111079011A (en) Deep learning-based information recommendation method
CN112464106B (en) Object recommendation method and device
CN109615421A (en) A kind of individual commodity recommendation method based on multi-objective Evolutionary Algorithm
CN117370932A (en) Traffic information processing and sensing method based on multi-mode data fusion sensing
CN111199422A (en) Improved LFM (Linear frequency modulation) collaborative filtering method fusing time attenuation factors
CN114510645B (en) Method for solving long-tail recommendation problem based on extraction of effective multi-target groups
CN117688390A (en) Content matching method, apparatus, computer device, storage medium, and program product
CN104881499A (en) Collaborative filtering recommendation method based on attribute rating scaling
Liao et al. Accumulative Time Based Ranking Method to Reputation Evaluation in Information Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant