CN111191707B - LFM training sample construction method integrating time attenuation factors - Google Patents
LFM training sample construction method integrating time attenuation factors Download PDFInfo
- Publication number
- CN111191707B CN111191707B CN201911356445.2A CN201911356445A CN111191707B CN 111191707 B CN111191707 B CN 111191707B CN 201911356445 A CN201911356445 A CN 201911356445A CN 111191707 B CN111191707 B CN 111191707B
- Authority
- CN
- China
- Prior art keywords
- sample
- user
- popularity
- training
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0282—Rating or review of business operators or products
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
- G06Q30/0631—Item recommendations
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Finance (AREA)
- Accounting & Taxation (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Strategic Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Development Economics (AREA)
- Marketing (AREA)
- Software Systems (AREA)
- Economics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- General Business, Economics & Management (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Game Theory and Decision Science (AREA)
- Medical Informatics (AREA)
- Entrepreneurship & Innovation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a method for constructing an LFM training sample by fusing time attenuation factors, which comprises the following steps: s1) acquiring a positive sample; s2) calculating popularity of the articles in the whole training set; s3) evaluating the diversity of the sample library data; s4) the algorithm recommendation accuracy and recall rate when different sample libraries are formed are given; s5) selecting a sample with the optimal popularity ratio as a negative sample; s6) combining the positive and negative samples into a training sample library of the user. The invention has the advantages that: the influence of the popularity of the article and the diversity of the sample on the recommended performance is comprehensively considered, the time attenuation factors are fused, the algorithm recommendation accuracy and recall rate when different sample libraries are formed are given through experiments, the optimal popularity ratio is obtained through analysis, and therefore an optimal negative sample is obtained, and a good FC-LFM algorithm training effect is obtained.
Description
Technical Field
The invention relates to the technical field of Internet big data processing, in particular to an LFM training sample construction method integrating time attenuation factors.
Background
The LFM method (FC-LFM) of fusing time attenuation factors is a machine learning-based latent semantic model, and a user characteristic matrix P and an article characteristic matrix Q of each user are generated by learning training samples, so that the construction method of a training sample library is particularly important. The traditional method for constructing the training sample of the user u by the LFM is to take the article evaluated by the user u as a positive sample, set the evaluation value of the positive sample as 1, randomly extract a certain number of articles which are not evaluated by the user u from the training set to form a negative sample, and set the evaluation value of the negative sample as 0. The positive and negative samples are combined into a training sample for user u.
Since the negative sample is an item that indicates that the user is not interested in, and perhaps some of the items that the user has not rated are not popular enough, the user does not know the item, and is not necessarily of a type that the user dislikes. The completely random extraction of items that the user has not evaluated as a negative sample may result in reduced recommendation accuracy because the user is not considered to be unaware of the item and not evaluating the item. However, if the negative sample collection is too concentrated on the articles with high popularity, the diversity loss of the training sample library is caused, and the recommendation accuracy is also reduced, so that a balance point of the mutual influence of the popularity and the diversity of the samples needs to be found through experiments to serve as the basis of sample composition.
Disclosure of Invention
The invention provides an LFM training sample construction method which comprehensively considers the influence of item popularity and sample diversity on recommended performance and fuses time attenuation factors.
In order to achieve the above purpose, the present invention is realized by the following technical scheme:
the LFM training sample construction method integrating the time attenuation factors comprises the following steps:
s1) obtaining articles evaluated by a user u from training samples, wherein the number of the articles is Sp, and the articles are taken as positive samples;
s2) calculating popularity of items in the whole training set:
wherein u is i Representing a user who has evaluated item i, T r Representing training set, f it A time attenuation factor of the object i at the time t is represented;
wherein the method comprises the steps of,t now Is the current time of day and,the time unit is the day when the user makes an evaluation on the object i;
s3) evaluating the data diversity of the sample library by adopting a Simpson diversity index (Simpson index), wherein the formula is as follows:
wherein S represents the whole sample set, P i Representing the probability that the extracted sample falls within interval i;
s4) sequencing the popularity from high to low, respectively using 10%, 20%, 30%,. The first 10% and 100% of popularity as sample libraries, randomly extracting negative samples from the sample libraries to construct a learning sample library, keeping the parameter alpha=0.1, the regularization parameter lambda=0.01 unchanged, the training set iteration number epochs=10, the classification number K=30, the positive and negative sample ratio is 1:10, and giving algorithm recommendation accuracy and recall rate when different sample libraries are formed, and tabulating and comparing;
the accuracy describes how many proportions of the items contained in the recommendation list are actually seen by the user, and the calculation formula is as follows:
the recall rate is a value describing how much proportion of the items seen by the user in the test set T appear in the recommendation list, and its calculation formula is:
wherein T represents a test set, R (u) represents an item list recommended to the user u according to a recommendation algorithm, and T (u) represents an item actually evaluated by the user u in the test set T;
s5) non-rated from user uIn the sample, according to the formulaThe Popularity (i) < = PRatio (6) selects a Popularity ratio of R best As a negative Sample, wherein Sample u Is a sample library for user u, PRatio is the popularity duty cycle of the sample library;
s6) forming positive and negative samples into a sample set to serve as a training sample library of the user u.
Compared with the prior art, the invention has the following advantages:
according to the LFM training sample construction method integrating the time attenuation factors, influences of item popularity and sample diversity on recommended performance are comprehensively considered, the time attenuation factors are integrated, algorithm recommendation accuracy and recall rate when different sample libraries are formed are given through experiments, and the optimal popularity ratio is obtained through analysis, so that an optimal negative sample is obtained, and a good FC-LFM algorithm training effect is obtained.
Drawings
Fig. 1 is a schematic flow chart of the present invention.
FIG. 2 is a graph of the impact of most popular movie occupancy versus algorithm recommendation performance in an embodiment of the invention.
FIG. 3 is a graph showing the influence of sample library data diversity on algorithm recommendation performance in an embodiment of the invention.
Detailed Description
Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.
The data set used in the present invention was MovieLens (100 k) (https:// groups:/org/datasets/movieens/100 k /) offered by university of Minnesota GroupLens Research laboratory, usa, which included a score record of 943 users for 10 ten thousand of 1682 movies, with a data sparseness of 93.7%. Each record includes the user's score (1-5 points) for the movie and the time of the score (accurate to seconds). The data set is randomly extracted according to the proportion of 9:1 to form a training set and a testing set, and the training set and the testing set are used for training and testing an algorithm.
Specifically, the method for constructing the LFM training sample by fusing the time attenuation factors comprises the following steps:
s1) obtaining movies evaluated by a user u from training samples, wherein the number of the movies is Sp, and the movies are taken as positive samples;
s2) in order to evaluate the popularity of the movie, the present invention evaluates using a popularity (popularity) index, where popularity represents the proportion of users who evaluate the item among all users, and the higher the popularity, the more users who evaluate the item, and because the influence of time factors on popularity, the earlier popular items are not necessarily popular recently, so that a time decay factor is introduced into popularity herein, and popularity of the movie in the whole training set is calculated:
wherein u is i Representing the user who has rated movie i, T r Representing training set, f it Representing the time decay factor of film i at time t;
wherein t is now Is the current time of day and,the time unit is the day when the user makes an evaluation on the movie i;
s3) in order to evaluate the influence of the data diversity of the sample library on the learning performance of the algorithm, the data diversity of the sample library is evaluated by adopting a Simpson diversity index (Simpson index), wherein the Simpson diversity index represents the probability that two randomly sampled individuals belong to different kinds of data, the larger the value is, the more dispersed the samples are, the smaller the value is, the more concentrated the value is, and the formula is as follows:
wherein S represents the whole sampleThe present set, P i Representing the probability that the extracted sample falls within interval i;
s4) sequencing the popularity from high to low, respectively using 10%, 20%, 30%,. The first 10% and 100% of popularity as sample libraries, randomly extracting negative samples from the sample libraries to construct a learning sample library, keeping the parameter alpha=0.1, the regularization parameter lambda=0.01 unchanged, the training set iteration number epochs=10, the classification number K=30, the positive and negative sample ratio is 1:10, and comparing the algorithm recommendation accuracy and the recall rate when different sample libraries are formed, and preparing table 1;
the accuracy and recall may reflect the effectiveness of the recommendation algorithm, where the accuracy describes how many proportions of movies contained in the recommendation list were actually watched by the user (for test set T), whose calculation formula is:
the recall rate is a value describing how many proportions of movies viewed by the user in the test set T appear in the recommendation list, and its calculation formula is:
wherein T represents a test set, R (u) represents a movie list recommended to the user u according to a recommendation algorithm, and T (u) represents movies actually evaluated by the user u in the test set T;
table 1 algorithm prediction error for different sample libraries
S5) in order to evaluate the degree to which the samples in the sample library cover the whole sample library according to the popularity ranking, the popularity ratio (PRatio) of the sample library is adopted to represent that the smaller the popularity ratio is, the more the samples in the sample library are concentrated on some samples with highest popularity, and if the popularity ratio is 20%, the training samples are representedThe sample with the highest popularity in the set forms a sample library, and the formula is thatPolar (i) <=PRatio (6), wherein Sample u Is a sample library for user u, PRatio is the popularity duty cycle of the sample library;
selecting a popularity ratio R according to formula (6) from samples not evaluated by user u best As negative samples;
s6) forming positive and negative samples into a sample set to serve as a training sample library of the user u.
Fig. 1 and 2 show trend graphs of movie popularity and sample diversity impact on algorithm recommendation accuracy and recall. As can be seen from table 1, fig. 1 and fig. 2, when the sample diversity index is smaller than 0.875, the recommendation accuracy and recall rate of the algorithm are gradually increased along with the improvement of the sample diversity, and at this time, the benefit of the sample diversity on the algorithm is greater than the damage caused by the increase of unpopular movies in the sample library; with the continuous addition of unpopular movies, sample diversity, though further improved, algorithm recommendation accuracy and recall are instead reduced, and it can be seen that the impact of unpopular movies on algorithm prediction errors has been greater than the benefits of sample diversity. Thus, in subsequent experiments, the sample library consisted of movies 80% before the popularity of the training set (popularity ratio=80%).
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the concept of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.
Claims (1)
1. The LFM training sample construction method fusing the time attenuation factors is characterized by comprising the following steps of:
s1) obtaining articles evaluated by a user u from training samples, wherein the number of the articles is Sp, and the articles are taken as positive samples;
s2) calculating popularity of items in the whole training set:
wherein u is i Representing a user who has evaluated item i, T r Representing training set, f it A time attenuation factor of the object i at the time t is represented;
wherein t is now Is the current time of day and,the time unit is the day when the user makes an evaluation on the object i;
s3) evaluating the data diversity of the sample library by adopting a Simpson diversity index (Simpson index), wherein the formula is as follows:
wherein S represents the whole sample set, P i Representing the probability that the extracted sample falls within interval i;
s4) sequencing the popularity from high to low, respectively using 10%, 20%, 30%,. The first 10% and 100% of popularity as sample libraries, randomly extracting negative samples from the sample libraries to construct a learning sample library, keeping the parameter alpha=0.1, the regularization parameter lambda=0.01 unchanged, the training set iteration number epochs=10, the classification number K=30, the positive and negative sample ratio is 1:10, and giving algorithm recommendation accuracy and recall rate when different sample libraries are formed, and tabulating and comparing;
the accuracy describes how many proportions of the items contained in the recommendation list are actually seen by the user, and the calculation formula is as follows:
the recall rate is a value describing how much proportion of the items seen by the user in the test set T appear in the recommendation list, and its calculation formula is:
wherein T represents a test set, R (u) represents an item list recommended to the user u according to a recommendation algorithm, and T (u) represents an item actually evaluated by the user u in the test set T;
s5) from the samples which have not been evaluated by user u, according to the formulaThe Popularity (i) < = PRatio (6) selects a Popularity ratio of R best As a negative Sample, wherein Sample u Is a sample library for user u, PRatio is the popularity duty cycle of the sample library;
s6) forming positive and negative samples into a sample set to serve as a training sample library of the user u.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911356445.2A CN111191707B (en) | 2019-12-25 | 2019-12-25 | LFM training sample construction method integrating time attenuation factors |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911356445.2A CN111191707B (en) | 2019-12-25 | 2019-12-25 | LFM training sample construction method integrating time attenuation factors |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111191707A CN111191707A (en) | 2020-05-22 |
CN111191707B true CN111191707B (en) | 2023-06-06 |
Family
ID=70707538
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911356445.2A Active CN111191707B (en) | 2019-12-25 | 2019-12-25 | LFM training sample construction method integrating time attenuation factors |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111191707B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116610857B (en) * | 2023-04-10 | 2024-05-03 | 南京邮电大学 | Personalized post recommendation method based on user preference for post popularity |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106202151A (en) * | 2016-06-23 | 2016-12-07 | 长沙学院 | One is used for improving the multifarious method of personalized recommendation system |
CN109977299A (en) * | 2019-02-21 | 2019-07-05 | 西北大学 | A kind of proposed algorithm of convergence project temperature and expert's coefficient |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5481295B2 (en) * | 2010-07-15 | 2014-04-23 | エヌ・ティ・ティ・コムウェア株式会社 | Object recommendation device, object recommendation method, object recommendation program, and object recommendation system |
US10977322B2 (en) * | 2015-11-09 | 2021-04-13 | WP Company, LLC | Systems and methods for recommending temporally relevant news content using implicit feedback data |
CN109063052B (en) * | 2018-07-19 | 2022-01-25 | 北京物资学院 | Personalized recommendation method and device based on time entropy |
-
2019
- 2019-12-25 CN CN201911356445.2A patent/CN111191707B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106202151A (en) * | 2016-06-23 | 2016-12-07 | 长沙学院 | One is used for improving the multifarious method of personalized recommendation system |
CN109977299A (en) * | 2019-02-21 | 2019-07-05 | 西北大学 | A kind of proposed algorithm of convergence project temperature and expert's coefficient |
Also Published As
Publication number | Publication date |
---|---|
CN111191707A (en) | 2020-05-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wu et al. | User Modeling with Click Preference and Reading Satisfaction for News Recommendation. | |
CN111797321B (en) | Personalized knowledge recommendation method and system for different scenes | |
Hoiles et al. | Engagement and popularity dynamics of YouTube videos and sensitivity to meta-data | |
Christakopoulou et al. | Hoslim: Higher-order sparse linear method for top-n recommender systems | |
CN103260061B (en) | A kind of IPTV program commending method of context-aware | |
CN110334356B (en) | Article quality determining method, article screening method and corresponding device | |
US20060036640A1 (en) | Information processing apparatus, information processing method, and program | |
CN108563755A (en) | A kind of personalized recommendation system and method based on bidirectional circulating neural network | |
CN112800097A (en) | Special topic recommendation method and device based on deep interest network | |
CN110879864A (en) | Context recommendation method based on graph neural network and attention mechanism | |
CN105138653A (en) | Exercise recommendation method and device based on typical degree and difficulty | |
CN109471982B (en) | Web service recommendation method based on QoS (quality of service) perception of user and service clustering | |
CN109902823B (en) | Model training method and device based on generation countermeasure network | |
CN111488524B (en) | Attention-oriented semantic-sensitive label recommendation method | |
CN106599047B (en) | Information pushing method and device | |
CN112464100B (en) | Information recommendation model training method, information recommendation method, device and equipment | |
CN104766219B (en) | Based on the user's recommendation list generation method and system in units of list | |
CN109816015B (en) | Recommendation method and system based on material data | |
Babu et al. | An implementation of the user-based collaborative filtering algorithm | |
JP5481295B2 (en) | Object recommendation device, object recommendation method, object recommendation program, and object recommendation system | |
CN104899321A (en) | Collaborative filtering recommendation method based on item attribute score mean value | |
US9020863B2 (en) | Information processing device, information processing method, and program | |
He et al. | Blending pruning criteria for convolutional neural networks | |
CN111191707B (en) | LFM training sample construction method integrating time attenuation factors | |
CN107180028A (en) | A kind of recommended technology combined based on LDA with annealing algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |