CN111191707A - LFM training sample construction method fusing time attenuation factors - Google Patents
LFM training sample construction method fusing time attenuation factors Download PDFInfo
- Publication number
- CN111191707A CN111191707A CN201911356445.2A CN201911356445A CN111191707A CN 111191707 A CN111191707 A CN 111191707A CN 201911356445 A CN201911356445 A CN 201911356445A CN 111191707 A CN111191707 A CN 111191707A
- Authority
- CN
- China
- Prior art keywords
- sample
- user
- popularity
- training
- items
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0282—Rating or review of business operators or products
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
- G06Q30/0631—Item recommendations
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Finance (AREA)
- Accounting & Taxation (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Strategic Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Development Economics (AREA)
- Marketing (AREA)
- Software Systems (AREA)
- Economics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- General Business, Economics & Management (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Game Theory and Decision Science (AREA)
- Medical Informatics (AREA)
- Entrepreneurship & Innovation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a construction method of an LFM training sample fused with a time attenuation factor, which comprises the following steps: s1) obtaining a positive sample; s2) calculating the popularity of the articles in the whole training set; s3) evaluating the diversity of the data of the sample library; s4) giving the algorithm recommendation accuracy and recall rate when different sample libraries are formed; s5) selecting a sample with the optimal popularity ratio as a negative sample; s6) the positive and negative examples are combined to form a training example library of the user. The invention has the advantages that: the influence of the popularity of the article and the diversity of the samples on the recommendation performance is comprehensively considered, time attenuation factors are fused, the algorithm recommendation accuracy and the recall rate when different sample banks are formed are given through experiments, the optimal popularity ratio is obtained through analysis, the optimal negative sample is obtained, and the better FC-LFM algorithm training effect is obtained.
Description
Technical Field
The invention relates to the technical field of internet big data processing, in particular to a construction method of an LFM training sample with fusion of time attenuation factors.
Background
An LFM (FC-LFM) method fusing time attenuation factors is a latent semantic model based on machine learning, and a user feature matrix P and an article feature matrix Q of each user are generated by learning training samples, so that a construction method of a training sample library is particularly important. In a traditional construction method of a training sample of the LFM for the user u, an article evaluated by the user u is used as a positive sample, the evaluation value of the positive sample is set to 1, a certain number of articles which are not evaluated by the user u are randomly extracted from a training set to form a negative sample, and the evaluation value of the negative sample is set to 0. And forming the positive and negative samples into a training sample of the user u.
Since negative examples are items that represent a user is not interested in, and among items that the user has not rated, there may be some items that are not popular enough, resulting in the user not knowing about the item, and not necessarily the type that the user dislikes. The completely random extraction of the article that the user has not evaluated as a negative sample may result in a reduction in recommendation accuracy because the reason that the user does not know the article and does not evaluate is not considered. However, if the negative sample collection is too concentrated on the high-popularity articles, the diversity loss of the training sample library is caused, and the recommendation accuracy is also reduced, so that a balance point of mutual influence of the popularity and the diversity of the sample needs to be found through experiments to serve as a basis for sample composition.
Disclosure of Invention
The invention provides an LFM training sample construction method which comprehensively considers the influence of article popularity and sample diversity on recommendation performance and integrates time attenuation factors.
In order to achieve the purpose, the invention is realized by the following technical scheme:
the LFM training sample construction method fusing the time attenuation factors comprises the following steps:
s1) obtaining the items evaluated by the user u from the training samples, wherein the number of the items is Sp, and the items are used as positive samples;
s2) calculating the popularity of the items in the entire training set:
wherein u isiIndicating a user who has made an evaluation of item i, TrRepresents a training set, fitRepresenting the time attenuation factor of the item i at the time t;
wherein, tnowIs the current time of day and the time of day,the time when the user evaluates the article i is day;
s3) using Simpson diversity index (Simpson index) to evaluate the data diversity of the sample library, the formula is:
where S represents the entire sample set, PiRepresenting the probability that the extracted sample falls in the i interval;
s4) sorting the sequence of popularity from high to low, using the previous 10%, 20%, 30%,. and 100% of popularity as sample libraries, and randomly extracting negative samples from the sample libraries to construct a learning sample library, keeping the parameter α equal to 0.1, keeping the regularization parameter λ equal to 0.01 unchanged, training set iteration frequency epochs equal to 10, class number K equal to 30, and positive and negative sample ratio equal to 1:10, giving the algorithm recommendation accuracy and recall ratio when different sample libraries are composed, and making a table for comparison;
the accuracy rate describes how many proportions of the items contained in the recommendation list are actually seen by the user, and the calculation formula is as follows:
the recall ratio is a ratio describing how much the items seen by the user in the test set T appear in the recommendation list, and is calculated by the following formula:
wherein T represents a test set, R (u) represents an item list recommended to the user u according to a recommendation algorithm, and T (u) represents items really evaluated by the user u in the test set T;
s5) from samples that user u has not evaluated, according to the formulaPopularity (i) < ═ Pratio (6) selection prevalence ratio is RbestAs a negative Sample, wherein SampleuIs the sample library of user u, PRatio is the popularity fraction of the sample library;
s6) the positive and negative samples are combined into a sample set to serve as a training sample library of the user u.
Compared with the prior art, the invention has the following advantages:
according to the LFM training sample construction method fusing the time attenuation factors, the influence of the article popularity and the sample diversity on the recommendation performance is comprehensively considered, the time attenuation factors are fused, the algorithm recommendation accuracy and the recall ratio when different sample libraries are formed are given through experiments, the optimal popularity ratio is obtained through analysis, the optimal negative sample is obtained, and the better FC-LFM algorithm training effect is obtained.
Drawings
FIG. 1 is a schematic flow diagram of the present invention.
FIG. 2 is a graph illustrating the impact of the most popular movie proportions on the performance of algorithm recommendations in an embodiment of the present invention.
FIG. 3 is a graph illustrating the effect of sample library data diversity on algorithm recommendation performance in an embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
The data set adopted by the invention is MovieLens (100k) (https:// GroupLens. org/datasets/movieens/100 k /) provided by group lens Research laboratory of university of Minnesota in the United states, the data set comprises 10 ten thousand scoring records of 1682 movies by 943 users, and the data sparsity reaches 93.7%. Each record includes the user's rating (1-5 points) for the movie and the time of the rating (to the nearest second). The invention randomly extracts data from the data set according to the ratio of 9:1 to form a training set and a testing set, and the training set and the testing set are used for training and testing the algorithm.
Specifically, the construction method of the LFM training sample fused with the time attenuation factor comprises the following steps:
s1) obtaining movies evaluated by the user u from the training samples, wherein the number of the movies is Sp, and the movies are used as positive samples;
s2) to evaluate the popularity of a movie, the present invention evaluates it with a popularity (popularity) index, where the popularity represents the proportion of users who have evaluated the item among all users, and the higher the popularity, the more users who have evaluated the item, and the more recently the item that has been popular earlier is not necessarily popular due to the influence of time factors on the popularity, so that a time decay factor is introduced into the popularity here, and the popularity of the movie in the entire training set is calculated:
wherein u isiRepresenting users having rated movie i, TrRepresents a training set, fitRepresenting the time decay factor of movie i at time t;
wherein, tnowIs the current time of day and the time of day,is the user to the moviei time to make an assessment in days;
s3) in order to evaluate the influence of the data diversity of the sample library on the learning performance of the algorithm, a Simpson diversity index (Simpson index) is used to evaluate the data diversity of the sample library, which represents the probability that two randomly sampled individuals belong to different kinds of data, the larger the value of the index is, the more dispersed the sample is, the smaller the value is, the more concentrated the sample is, and the formula is:
where S represents the entire sample set, PiRepresenting the probability that the extracted sample falls in the i interval;
s4) sorting the sequence of popularity from high to low, using the previous 10%, 20%, 30%,. and 100% of popularity as sample libraries, and randomly extracting negative samples from the sample libraries to construct a learning sample library, keeping the parameter α equal to 0.1, keeping the regularization parameter λ equal to 0.01 unchanged, training set iteration time epochs equal to 10, class number K equal to 30, and positive and negative sample ratio equal to 1:10, giving the algorithm recommendation accuracy and recall ratio when different sample libraries are composed, and making table 1 for comparison;
the accuracy and recall, which describe how many proportions of movies included in the recommendation list are actually seen by the user (for the test set T), may reflect the effectiveness of the recommendation algorithm, and is calculated as:
the recall ratio is a ratio describing how many movies the user saw in the test set T appear in the recommendation list, and is calculated by the following formula:
wherein T represents a test set, R (u) represents a movie list recommended to the user u according to a recommendation algorithm, and T (u) represents movies really evaluated by the user u in the test set T;
TABLE 1 algorithmic prediction error for different sample libraries
S5) in order to evaluate the degree of coverage of the samples in the sample library by the popularity ranking, the popularity ratio (PRatio) of the sample library is used herein to represent that the smaller the popularity ratio, the more concentrated the samples in the sample library on some samples with the highest popularity, if the popularity ratio is 20%, the samples with the highest popularity of 20% in the training sample set are combined into the sample library, and the formula is thatPopularity (i) < ═ Pratio (6), wherein, SampleuIs the sample library of user u, PRatio is the popularity fraction of the sample library;
from the samples that user u did not evaluate, a popularity ratio R was selected according to equation (6)bestThe sample of (2) is taken as a negative sample;
s6) the positive and negative samples are combined into a sample set to serve as a training sample library of the user u.
Fig. 1 and 2 show the trend chart of the influence of the popularity of the movie and the diversity of samples on the recommendation accuracy and recall rate of the algorithm. As can be seen from table 1 and fig. 1 and 2, when the sample diversity index is less than 0.875, the recommendation accuracy and recall rate of the algorithm gradually increase with the increase of the sample diversity, and at this time, the benefit of the sample diversity to the algorithm is greater than the bad place caused by the increase of the non-popular movies in the sample library; with the continuous addition of the non-popular movies, although the sample diversity is further improved, the algorithm recommendation accuracy and recall rate are reduced, and the influence of the non-popular movies on the algorithm prediction error is larger than the benefit brought by the sample diversity. Thus, in subsequent experiments, the sample library consisted of the top 80% of movies of the training set popularity (80% popularity).
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and improvements can be made without departing from the spirit of the present invention, and these modifications and improvements should also be considered as within the scope of the present invention.
Claims (1)
1. The LFM training sample construction method fused with the time attenuation factor is characterized by comprising the following steps:
s1) obtaining the items evaluated by the user u from the training samples, wherein the number of the items is Sp, and the items are used as positive samples;
s2) calculating the popularity of the items in the entire training set:
wherein u isiIndicating a user who has made an evaluation of item i, TrRepresents a training set, fitRepresenting the time attenuation factor of the item i at the time t;
wherein, tnowIs the current time of day and the time of day,the time when the user evaluates the article i is day;
s3) using Simpson diversity index (Simpson index) to evaluate the data diversity of the sample library, the formula is:
where S represents the entire sample set, PiRepresenting the probability that the extracted sample falls in the i interval;
s4) sorting the sequence of popularity from high to low, using the previous 10%, 20%, 30%,. and 100% of popularity as sample libraries, and randomly extracting negative samples from the sample libraries to construct a learning sample library, keeping the parameter α equal to 0.1, keeping the regularization parameter λ equal to 0.01 unchanged, training set iteration frequency epochs equal to 10, class number K equal to 30, and positive and negative sample ratio equal to 1:10, giving the algorithm recommendation accuracy and recall ratio when different sample libraries are composed, and making a table for comparison;
the accuracy rate describes how many proportions of the items contained in the recommendation list are actually seen by the user, and the calculation formula is as follows:
the recall ratio is a ratio describing how much the items seen by the user in the test set T appear in the recommendation list, and is calculated by the following formula:
wherein T represents a test set, R (u) represents an item list recommended to the user u according to a recommendation algorithm, and T (u) represents items really evaluated by the user u in the test set T;
s5) from samples that user u has not evaluated, according to the formulaPopularity (i) < ═ Pratio (6) selection prevalence ratio is RbestAs a negative Sample, wherein SampleuIs the sample library of user u, PRatio is the popularity fraction of the sample library;
s6) the positive and negative samples are combined into a sample set to serve as a training sample library of the user u.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911356445.2A CN111191707B (en) | 2019-12-25 | 2019-12-25 | LFM training sample construction method integrating time attenuation factors |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911356445.2A CN111191707B (en) | 2019-12-25 | 2019-12-25 | LFM training sample construction method integrating time attenuation factors |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111191707A true CN111191707A (en) | 2020-05-22 |
CN111191707B CN111191707B (en) | 2023-06-06 |
Family
ID=70707538
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911356445.2A Active CN111191707B (en) | 2019-12-25 | 2019-12-25 | LFM training sample construction method integrating time attenuation factors |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111191707B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116610857A (en) * | 2023-04-10 | 2023-08-18 | 南京邮电大学 | Personalized post recommendation method based on user preference for post popularity |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012022570A (en) * | 2010-07-15 | 2012-02-02 | Ntt Comware Corp | Object recommendation apparatus, object recommendation method, object recommendation program and object recommendation system |
CN106202151A (en) * | 2016-06-23 | 2016-12-07 | 长沙学院 | One is used for improving the multifarious method of personalized recommendation system |
US20170132230A1 (en) * | 2015-11-09 | 2017-05-11 | WP Company LLC d/b/a The Washington Post | Systems and methods for recommending temporally relevant news content using implicit feedback data |
CN109063052A (en) * | 2018-07-19 | 2018-12-21 | 北京物资学院 | A kind of personalized recommendation method and device based on time entropy |
CN109977299A (en) * | 2019-02-21 | 2019-07-05 | 西北大学 | A kind of proposed algorithm of convergence project temperature and expert's coefficient |
-
2019
- 2019-12-25 CN CN201911356445.2A patent/CN111191707B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012022570A (en) * | 2010-07-15 | 2012-02-02 | Ntt Comware Corp | Object recommendation apparatus, object recommendation method, object recommendation program and object recommendation system |
US20170132230A1 (en) * | 2015-11-09 | 2017-05-11 | WP Company LLC d/b/a The Washington Post | Systems and methods for recommending temporally relevant news content using implicit feedback data |
CN106202151A (en) * | 2016-06-23 | 2016-12-07 | 长沙学院 | One is used for improving the multifarious method of personalized recommendation system |
CN109063052A (en) * | 2018-07-19 | 2018-12-21 | 北京物资学院 | A kind of personalized recommendation method and device based on time entropy |
CN109977299A (en) * | 2019-02-21 | 2019-07-05 | 西北大学 | A kind of proposed algorithm of convergence project temperature and expert's coefficient |
Non-Patent Citations (3)
Title |
---|
LAI, C等: "A Social Recommendation Method Based on the Integration of Social Relationship and Product Popularity", 《INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES》 * |
刘乔: "基于时间加权与评分预测的协同过滤推荐算法研究", 《中国优秀硕士学位论文全文库 信息科技辑》 * |
牛抗抗: "流行度对用户兴趣的影响机制分析及其在推荐算法中的应用研究", 《中国优秀硕士学位论文全文库 信息科技辑》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116610857A (en) * | 2023-04-10 | 2023-08-18 | 南京邮电大学 | Personalized post recommendation method based on user preference for post popularity |
CN116610857B (en) * | 2023-04-10 | 2024-05-03 | 南京邮电大学 | Personalized post recommendation method based on user preference for post popularity |
Also Published As
Publication number | Publication date |
---|---|
CN111191707B (en) | 2023-06-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111797321B (en) | Personalized knowledge recommendation method and system for different scenes | |
CN106909981B (en) | Model training method, sample balancing method, model training device, sample balancing device and personal credit scoring system | |
WO2016155493A1 (en) | Data processing method and apparatus | |
CN109740924B (en) | Article scoring prediction method integrating attribute information network and matrix decomposition | |
WO2019196208A1 (en) | Text sentiment analysis method, readable storage medium, terminal device, and apparatus | |
CN107256241B (en) | Movie recommendation method for improving multi-target genetic algorithm based on grid and difference replacement | |
CN112613552A (en) | Convolutional neural network emotion image classification method combining emotion category attention loss | |
CN110334356A (en) | Article matter method for determination of amount, article screening technique and corresponding device | |
CA2861898A1 (en) | Download resource recommendation method, system and storage medium | |
CN112612951B (en) | Unbiased learning sorting method for income improvement | |
CN109816015B (en) | Recommendation method and system based on material data | |
Abbas | Deposit subscribe prediction using data mining techniques based real marketing dataset | |
Hanif et al. | Resolving class imbalance and feature selection in customer churn dataset | |
CN115829683A (en) | Power integration commodity recommendation method and system based on inverse reward learning optimization | |
He et al. | Blending pruning criteria for convolutional neural networks | |
CN111191707A (en) | LFM training sample construction method fusing time attenuation factors | |
CN111079011A (en) | Deep learning-based information recommendation method | |
CN112464106B (en) | Object recommendation method and device | |
CN109615421A (en) | A kind of individual commodity recommendation method based on multi-objective Evolutionary Algorithm | |
CN117370932A (en) | Traffic information processing and sensing method based on multi-mode data fusion sensing | |
CN111199422A (en) | Improved LFM (Linear frequency modulation) collaborative filtering method fusing time attenuation factors | |
CN114510645B (en) | Method for solving long-tail recommendation problem based on extraction of effective multi-target groups | |
CN117688390A (en) | Content matching method, apparatus, computer device, storage medium, and program product | |
CN104881499A (en) | Collaborative filtering recommendation method based on attribute rating scaling | |
Liao et al. | Accumulative Time Based Ranking Method to Reputation Evaluation in Information Networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |