CN111191707B - LFM training sample construction method integrating time attenuation factors - Google Patents

LFM training sample construction method integrating time attenuation factors Download PDF

Info

Publication number
CN111191707B
CN111191707B CN201911356445.2A CN201911356445A CN111191707B CN 111191707 B CN111191707 B CN 111191707B CN 201911356445 A CN201911356445 A CN 201911356445A CN 111191707 B CN111191707 B CN 111191707B
Authority
CN
China
Prior art keywords
sample
user
popularity
training
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911356445.2A
Other languages
Chinese (zh)
Other versions
CN111191707A (en
Inventor
甘志刚
饶屾
蒋晓宁
余长宏
余斌霄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Gongshang University
Original Assignee
Zhejiang Gongshang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Gongshang University filed Critical Zhejiang Gongshang University
Priority to CN201911356445.2A priority Critical patent/CN111191707B/en
Publication of CN111191707A publication Critical patent/CN111191707A/en
Application granted granted Critical
Publication of CN111191707B publication Critical patent/CN111191707B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0282Rating or review of business operators or products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0631Item recommendations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Strategic Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Software Systems (AREA)
  • Economics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Game Theory and Decision Science (AREA)
  • Medical Informatics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for constructing an LFM training sample by fusing time attenuation factors, which comprises the following steps: s1) acquiring a positive sample; s2) calculating popularity of the articles in the whole training set; s3) evaluating the diversity of the sample library data; s4) the algorithm recommendation accuracy and recall rate when different sample libraries are formed are given; s5) selecting a sample with the optimal popularity ratio as a negative sample; s6) combining the positive and negative samples into a training sample library of the user. The invention has the advantages that: the influence of the popularity of the article and the diversity of the sample on the recommended performance is comprehensively considered, the time attenuation factors are fused, the algorithm recommendation accuracy and recall rate when different sample libraries are formed are given through experiments, the optimal popularity ratio is obtained through analysis, and therefore an optimal negative sample is obtained, and a good FC-LFM algorithm training effect is obtained.

Description

LFM training sample construction method integrating time attenuation factors
Technical Field
The invention relates to the technical field of Internet big data processing, in particular to an LFM training sample construction method integrating time attenuation factors.
Background
The LFM method (FC-LFM) of fusing time attenuation factors is a machine learning-based latent semantic model, and a user characteristic matrix P and an article characteristic matrix Q of each user are generated by learning training samples, so that the construction method of a training sample library is particularly important. The traditional method for constructing the training sample of the user u by the LFM is to take the article evaluated by the user u as a positive sample, set the evaluation value of the positive sample as 1, randomly extract a certain number of articles which are not evaluated by the user u from the training set to form a negative sample, and set the evaluation value of the negative sample as 0. The positive and negative samples are combined into a training sample for user u.
Since the negative sample is an item that indicates that the user is not interested in, and perhaps some of the items that the user has not rated are not popular enough, the user does not know the item, and is not necessarily of a type that the user dislikes. The completely random extraction of items that the user has not evaluated as a negative sample may result in reduced recommendation accuracy because the user is not considered to be unaware of the item and not evaluating the item. However, if the negative sample collection is too concentrated on the articles with high popularity, the diversity loss of the training sample library is caused, and the recommendation accuracy is also reduced, so that a balance point of the mutual influence of the popularity and the diversity of the samples needs to be found through experiments to serve as the basis of sample composition.
Disclosure of Invention
The invention provides an LFM training sample construction method which comprehensively considers the influence of item popularity and sample diversity on recommended performance and fuses time attenuation factors.
In order to achieve the above purpose, the present invention is realized by the following technical scheme:
the LFM training sample construction method integrating the time attenuation factors comprises the following steps:
s1) obtaining articles evaluated by a user u from training samples, wherein the number of the articles is Sp, and the articles are taken as positive samples;
s2) calculating popularity of items in the whole training set:
Figure BDA0002336051850000021
wherein u is i Representing a user who has evaluated item i, T r Representing training set, f it A time attenuation factor of the object i at the time t is represented;
Figure BDA0002336051850000022
wherein the method comprises the steps of,t now Is the current time of day and,
Figure BDA0002336051850000023
the time unit is the day when the user makes an evaluation on the object i;
s3) evaluating the data diversity of the sample library by adopting a Simpson diversity index (Simpson index), wherein the formula is as follows:
Figure BDA0002336051850000024
wherein S represents the whole sample set, P i Representing the probability that the extracted sample falls within interval i;
s4) sequencing the popularity from high to low, respectively using 10%, 20%, 30%,. The first 10% and 100% of popularity as sample libraries, randomly extracting negative samples from the sample libraries to construct a learning sample library, keeping the parameter alpha=0.1, the regularization parameter lambda=0.01 unchanged, the training set iteration number epochs=10, the classification number K=30, the positive and negative sample ratio is 1:10, and giving algorithm recommendation accuracy and recall rate when different sample libraries are formed, and tabulating and comparing;
the accuracy describes how many proportions of the items contained in the recommendation list are actually seen by the user, and the calculation formula is as follows:
Figure BDA0002336051850000025
the recall rate is a value describing how much proportion of the items seen by the user in the test set T appear in the recommendation list, and its calculation formula is:
Figure BDA0002336051850000031
wherein T represents a test set, R (u) represents an item list recommended to the user u according to a recommendation algorithm, and T (u) represents an item actually evaluated by the user u in the test set T;
s5) non-rated from user uIn the sample, according to the formula
Figure BDA0002336051850000032
The Popularity (i) < = PRatio (6) selects a Popularity ratio of R best As a negative Sample, wherein Sample u Is a sample library for user u, PRatio is the popularity duty cycle of the sample library;
s6) forming positive and negative samples into a sample set to serve as a training sample library of the user u.
Compared with the prior art, the invention has the following advantages:
according to the LFM training sample construction method integrating the time attenuation factors, influences of item popularity and sample diversity on recommended performance are comprehensively considered, the time attenuation factors are integrated, algorithm recommendation accuracy and recall rate when different sample libraries are formed are given through experiments, and the optimal popularity ratio is obtained through analysis, so that an optimal negative sample is obtained, and a good FC-LFM algorithm training effect is obtained.
Drawings
Fig. 1 is a schematic flow chart of the present invention.
FIG. 2 is a graph of the impact of most popular movie occupancy versus algorithm recommendation performance in an embodiment of the invention.
FIG. 3 is a graph showing the influence of sample library data diversity on algorithm recommendation performance in an embodiment of the invention.
Detailed Description
Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.
The data set used in the present invention was MovieLens (100 k) (https:// groups:/org/datasets/movieens/100 k /) offered by university of Minnesota GroupLens Research laboratory, usa, which included a score record of 943 users for 10 ten thousand of 1682 movies, with a data sparseness of 93.7%. Each record includes the user's score (1-5 points) for the movie and the time of the score (accurate to seconds). The data set is randomly extracted according to the proportion of 9:1 to form a training set and a testing set, and the training set and the testing set are used for training and testing an algorithm.
Specifically, the method for constructing the LFM training sample by fusing the time attenuation factors comprises the following steps:
s1) obtaining movies evaluated by a user u from training samples, wherein the number of the movies is Sp, and the movies are taken as positive samples;
s2) in order to evaluate the popularity of the movie, the present invention evaluates using a popularity (popularity) index, where popularity represents the proportion of users who evaluate the item among all users, and the higher the popularity, the more users who evaluate the item, and because the influence of time factors on popularity, the earlier popular items are not necessarily popular recently, so that a time decay factor is introduced into popularity herein, and popularity of the movie in the whole training set is calculated:
Figure BDA0002336051850000041
wherein u is i Representing the user who has rated movie i, T r Representing training set, f it Representing the time decay factor of film i at time t;
Figure BDA0002336051850000042
wherein t is now Is the current time of day and,
Figure BDA0002336051850000043
the time unit is the day when the user makes an evaluation on the movie i;
s3) in order to evaluate the influence of the data diversity of the sample library on the learning performance of the algorithm, the data diversity of the sample library is evaluated by adopting a Simpson diversity index (Simpson index), wherein the Simpson diversity index represents the probability that two randomly sampled individuals belong to different kinds of data, the larger the value is, the more dispersed the samples are, the smaller the value is, the more concentrated the value is, and the formula is as follows:
Figure BDA0002336051850000044
wherein S represents the whole sampleThe present set, P i Representing the probability that the extracted sample falls within interval i;
s4) sequencing the popularity from high to low, respectively using 10%, 20%, 30%,. The first 10% and 100% of popularity as sample libraries, randomly extracting negative samples from the sample libraries to construct a learning sample library, keeping the parameter alpha=0.1, the regularization parameter lambda=0.01 unchanged, the training set iteration number epochs=10, the classification number K=30, the positive and negative sample ratio is 1:10, and comparing the algorithm recommendation accuracy and the recall rate when different sample libraries are formed, and preparing table 1;
the accuracy and recall may reflect the effectiveness of the recommendation algorithm, where the accuracy describes how many proportions of movies contained in the recommendation list were actually watched by the user (for test set T), whose calculation formula is:
Figure BDA0002336051850000051
the recall rate is a value describing how many proportions of movies viewed by the user in the test set T appear in the recommendation list, and its calculation formula is:
Figure BDA0002336051850000052
wherein T represents a test set, R (u) represents a movie list recommended to the user u according to a recommendation algorithm, and T (u) represents movies actually evaluated by the user u in the test set T;
table 1 algorithm prediction error for different sample libraries
Figure BDA0002336051850000053
S5) in order to evaluate the degree to which the samples in the sample library cover the whole sample library according to the popularity ranking, the popularity ratio (PRatio) of the sample library is adopted to represent that the smaller the popularity ratio is, the more the samples in the sample library are concentrated on some samples with highest popularity, and if the popularity ratio is 20%, the training samples are representedThe sample with the highest popularity in the set forms a sample library, and the formula is that
Figure BDA0002336051850000054
Polar (i) <=PRatio (6), wherein Sample u Is a sample library for user u, PRatio is the popularity duty cycle of the sample library;
selecting a popularity ratio R according to formula (6) from samples not evaluated by user u best As negative samples;
s6) forming positive and negative samples into a sample set to serve as a training sample library of the user u.
Fig. 1 and 2 show trend graphs of movie popularity and sample diversity impact on algorithm recommendation accuracy and recall. As can be seen from table 1, fig. 1 and fig. 2, when the sample diversity index is smaller than 0.875, the recommendation accuracy and recall rate of the algorithm are gradually increased along with the improvement of the sample diversity, and at this time, the benefit of the sample diversity on the algorithm is greater than the damage caused by the increase of unpopular movies in the sample library; with the continuous addition of unpopular movies, sample diversity, though further improved, algorithm recommendation accuracy and recall are instead reduced, and it can be seen that the impact of unpopular movies on algorithm prediction errors has been greater than the benefits of sample diversity. Thus, in subsequent experiments, the sample library consisted of movies 80% before the popularity of the training set (popularity ratio=80%).
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the concept of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims (1)

1. The LFM training sample construction method fusing the time attenuation factors is characterized by comprising the following steps of:
s1) obtaining articles evaluated by a user u from training samples, wherein the number of the articles is Sp, and the articles are taken as positive samples;
s2) calculating popularity of items in the whole training set:
Figure FDA0002336051840000011
wherein u is i Representing a user who has evaluated item i, T r Representing training set, f it A time attenuation factor of the object i at the time t is represented;
Figure FDA0002336051840000012
wherein t is now Is the current time of day and,
Figure FDA0002336051840000013
the time unit is the day when the user makes an evaluation on the object i;
s3) evaluating the data diversity of the sample library by adopting a Simpson diversity index (Simpson index), wherein the formula is as follows:
Figure FDA0002336051840000014
wherein S represents the whole sample set, P i Representing the probability that the extracted sample falls within interval i;
s4) sequencing the popularity from high to low, respectively using 10%, 20%, 30%,. The first 10% and 100% of popularity as sample libraries, randomly extracting negative samples from the sample libraries to construct a learning sample library, keeping the parameter alpha=0.1, the regularization parameter lambda=0.01 unchanged, the training set iteration number epochs=10, the classification number K=30, the positive and negative sample ratio is 1:10, and giving algorithm recommendation accuracy and recall rate when different sample libraries are formed, and tabulating and comparing;
the accuracy describes how many proportions of the items contained in the recommendation list are actually seen by the user, and the calculation formula is as follows:
Figure FDA0002336051840000015
the recall rate is a value describing how much proportion of the items seen by the user in the test set T appear in the recommendation list, and its calculation formula is:
Figure FDA0002336051840000021
wherein T represents a test set, R (u) represents an item list recommended to the user u according to a recommendation algorithm, and T (u) represents an item actually evaluated by the user u in the test set T;
s5) from the samples which have not been evaluated by user u, according to the formula
Figure FDA0002336051840000022
The Popularity (i) < = PRatio (6) selects a Popularity ratio of R best As a negative Sample, wherein Sample u Is a sample library for user u, PRatio is the popularity duty cycle of the sample library;
s6) forming positive and negative samples into a sample set to serve as a training sample library of the user u.
CN201911356445.2A 2019-12-25 2019-12-25 LFM training sample construction method integrating time attenuation factors Active CN111191707B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911356445.2A CN111191707B (en) 2019-12-25 2019-12-25 LFM training sample construction method integrating time attenuation factors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911356445.2A CN111191707B (en) 2019-12-25 2019-12-25 LFM training sample construction method integrating time attenuation factors

Publications (2)

Publication Number Publication Date
CN111191707A CN111191707A (en) 2020-05-22
CN111191707B true CN111191707B (en) 2023-06-06

Family

ID=70707538

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911356445.2A Active CN111191707B (en) 2019-12-25 2019-12-25 LFM training sample construction method integrating time attenuation factors

Country Status (1)

Country Link
CN (1) CN111191707B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116610857B (en) * 2023-04-10 2024-05-03 南京邮电大学 Personalized post recommendation method based on user preference for post popularity

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202151A (en) * 2016-06-23 2016-12-07 长沙学院 One is used for improving the multifarious method of personalized recommendation system
CN109977299A (en) * 2019-02-21 2019-07-05 西北大学 A kind of proposed algorithm of convergence project temperature and expert's coefficient

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5481295B2 (en) * 2010-07-15 2014-04-23 エヌ・ティ・ティ・コムウェア株式会社 Object recommendation device, object recommendation method, object recommendation program, and object recommendation system
US10977322B2 (en) * 2015-11-09 2021-04-13 WP Company, LLC Systems and methods for recommending temporally relevant news content using implicit feedback data
CN109063052B (en) * 2018-07-19 2022-01-25 北京物资学院 Personalized recommendation method and device based on time entropy

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202151A (en) * 2016-06-23 2016-12-07 长沙学院 One is used for improving the multifarious method of personalized recommendation system
CN109977299A (en) * 2019-02-21 2019-07-05 西北大学 A kind of proposed algorithm of convergence project temperature and expert's coefficient

Also Published As

Publication number Publication date
CN111191707A (en) 2020-05-22

Similar Documents

Publication Publication Date Title
Wu et al. User Modeling with Click Preference and Reading Satisfaction for News Recommendation.
CN111797321B (en) Personalized knowledge recommendation method and system for different scenes
Hoiles et al. Engagement and popularity dynamics of YouTube videos and sensitivity to meta-data
Christakopoulou et al. Hoslim: Higher-order sparse linear method for top-n recommender systems
CN103260061B (en) A kind of IPTV program commending method of context-aware
CN110334356B (en) Article quality determining method, article screening method and corresponding device
US20060036640A1 (en) Information processing apparatus, information processing method, and program
CN108563755A (en) A kind of personalized recommendation system and method based on bidirectional circulating neural network
CN112800097A (en) Special topic recommendation method and device based on deep interest network
CN110879864A (en) Context recommendation method based on graph neural network and attention mechanism
CN105138653A (en) Exercise recommendation method and device based on typical degree and difficulty
CN109471982B (en) Web service recommendation method based on QoS (quality of service) perception of user and service clustering
CN109902823B (en) Model training method and device based on generation countermeasure network
CN111488524B (en) Attention-oriented semantic-sensitive label recommendation method
CN106599047B (en) Information pushing method and device
CN112464100B (en) Information recommendation model training method, information recommendation method, device and equipment
CN104766219B (en) Based on the user&#39;s recommendation list generation method and system in units of list
CN109816015B (en) Recommendation method and system based on material data
Babu et al. An implementation of the user-based collaborative filtering algorithm
JP5481295B2 (en) Object recommendation device, object recommendation method, object recommendation program, and object recommendation system
CN104899321A (en) Collaborative filtering recommendation method based on item attribute score mean value
US9020863B2 (en) Information processing device, information processing method, and program
He et al. Blending pruning criteria for convolutional neural networks
CN111191707B (en) LFM training sample construction method integrating time attenuation factors
CN107180028A (en) A kind of recommended technology combined based on LDA with annealing algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant