CN113343079A - Attack detection robust recommendation method based on random forest and target item identification - Google Patents

Attack detection robust recommendation method based on random forest and target item identification Download PDF

Info

Publication number
CN113343079A
CN113343079A CN202110511665.9A CN202110511665A CN113343079A CN 113343079 A CN113343079 A CN 113343079A CN 202110511665 A CN202110511665 A CN 202110511665A CN 113343079 A CN113343079 A CN 113343079A
Authority
CN
China
Prior art keywords
attack
user
item
detection
random forest
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110511665.9A
Other languages
Chinese (zh)
Inventor
伊华伟
徐文倩
冯晗
李晓会
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning University of Technology
Original Assignee
Liaoning University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning University of Technology filed Critical Liaoning University of Technology
Priority to CN202110511665.9A priority Critical patent/CN113343079A/en
Publication of CN113343079A publication Critical patent/CN113343079A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/018Certifying business or products
    • G06Q30/0185Product, service or business identity fraud
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0631Item recommendations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Evolutionary Computation (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Economics (AREA)
  • Artificial Intelligence (AREA)
  • Development Economics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an attack detection robust recommendation method based on random forest and target item identification, which comprises the following steps of S1: extracting effective characteristics capable of distinguishing normal users from attack users from scoring data based on a chi-square statistic theory; s2: training a random forest classifier based on the effective features extracted in the step S1, and performing first-stage detection on a user set to be detected by using the trained random forest classifier to obtain a first-stage user profile detection result; s3: identifying the initial attack profile class obtained in the step S2 through target item identification to realize attack profile detection in the second stage; s4: and constructing a robust recommendation algorithm according to the attack profile detection result to realize robust recommendation of attack detection. Compared with the existing robust recommendation algorithm, the algorithm provided by the invention improves the robustness of the algorithm on the premise of guaranteeing the recommendation precision, so that the recommendation result of the collaborative filtering recommendation system is more accurate.

Description

Attack detection robust recommendation method based on random forest and target item identification
Technical Field
The invention relates to the technical field of personalized recommendation by utilizing a computer technology, in particular to an attack detection robust recommendation method based on random forest and target item identification.
Background
The collaborative filtering recommendation system is used as an important component in the field of electronic commerce, and can actively provide personalized recommendation service for users. In order to obtain preference data of a user, the recommendation system has an open nature to the user. However, some merchants use the open nature of the system to inject fake scoring data into the system and achieve personal goals by changing the recommendation results of the system, and this behavior with malicious intent is called "trusting attack", also called recommendation attack. The trust attack interferes the recommendation process of the system, so that the recommendation result generates deviation, and dissatisfaction of users and merchants is easily caused. Therefore, how to make the recommendation system have the anti-attack capability and ensure the accuracy of the recommendation result becomes a problem to be solved urgently.
Aiming at the problems provided above, based on the machine learning theory, people provide some corresponding robust recommendation algorithms from both supervised and unsupervised aspects.
From the perspective of a supervision method, Williams et al extract 13 features for an attacking user, and on the basis, detect and identify the attacking profile by using SVM, KNN and C4.5 methods. Wushion et al extracts effective challenge detection indexes and classifies users by using naive Bayes classification and k-nearest neighbor classification algorithm. The lie waves et al use item popularity to extract features for different users, and based on the features proposed, an improved ID3 algorithm is used to propose an attack profile detection algorithm based on popularity. Zhou et al propose an SVM-TIA based attack profile detection algorithm in order to alleviate the class imbalance problem in classification. Zhou et al feature-extracted Aop attacks using text features TF-IDF and proposed an SVM-based attack profile detection algorithm. Hao et al propose an automatic feature extraction method and an Adaboost-based detection method to solve the problem of classification imbalance.
From the perspective of an unsupervised method, Zhang et al propose an unsupervised trusting attack detection method based on hidden markov model and hierarchical clustering. Clever et al propose an LFAMR model to resolve the potential factors of score loss. The Zhouqiang et al extracts general features of the attacking user by using the information entropy, covers the user by using a bionic pattern recognition technology, and judges the user outside the coverage as the attacking user. Mobasher et al propose two recommendation algorithms, one is based on a k-means recommendation algorithm, and the other is based on a probabilistic latent semantic analysis recommendation algorithm, and the robustness of the method is remarkably improved in the face of attack compared with the traditional k nearest neighbor method. The method comprises the steps of firstly introducing a risk factor concept to floods and the like, calculating a risk value of a user rating behavior, then calculating classification weight of the risk factor by using information entropy, and finally providing a multi-dimensional risk factor attack profile detection method. Zhang et al propose an unsupervised attack detection algorithm based on user scoring behavior.
The existing robust recommendation algorithms still have some defects, firstly, the real profile is easily judged as the attack profile by mistake, so that the accuracy of the algorithm is damaged; the second is that the improvement of algorithm robustness is at the cost of loss of accuracy.
Disclosure of Invention
Aiming at the existing problems, the invention aims to provide a robust recommendation method based on random forest and target item identification, which has stronger robustness on the premise that a collaborative filtering recommendation system guarantees recommendation precision.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the attack detection robust recommendation method based on random forest and target item identification is characterized by comprising the following steps,
s1: extracting effective characteristics capable of distinguishing normal users from attack users from scoring data based on a chi-square statistic theory;
s2: training a random forest classifier based on the effective features extracted in the step S1, and performing first-stage detection on a user set to be detected by using the trained random forest classifier to obtain a first-stage user profile detection result;
s3: identifying the initial attack profile class obtained in the step S2 through target item identification to realize attack profile detection in the second stage;
s4: and constructing a robust recommendation algorithm according to the attack profile detection result to realize robust recommendation of attack detection.
Further, the specific operation of step S1 includes the following steps,
s101: let U be { U ═ U1,u2,...,umDenotes the set of all users, I ═ I1,i2,...,inRepresents the set of all items, for item I belongs to I, user U belongs to U, if
Figure BDA0003060518240000021
The item i is rated to be scored once, and the scoring times of the item i by all users in the U are counted, namely
Figure BDA0003060518240000031
S102: calculating the item popularity T ═ { IPop for all items1,IPop2,...,IPopnSorting the n items according to the popularity descending order;
s103: according to the sorting result, dividing n items into two sets by adopting a 10-fold cross validation method, wherein one set is a popular item set IPOPUA collection of non-popular items IUNPOPU
S104: taking the item and the user as two statistics, the value ranges of which are respectively { popular item, unpopular item } and { user scored, user not scored }, calculating the degree of association between the unpopular item and the user u, namely the chi-square value of the unpopular item,
Figure BDA0003060518240000032
in the formula (I);
Figure BDA0003060518240000033
indicating that an item belongs to the collection IUNPOPUAnd is scored by the user u by the number,
Figure BDA0003060518240000034
indicating that an item does not belong to the collection IUNPOPUAnd the number scored by the user u,
Figure BDA0003060518240000035
indicating that an item belongs to the collection IUNPOPUAnd user u has not scored the number of,
Figure BDA0003060518240000036
indicating that an item does not belong to the collection IUNPOPUThe number of the users u not scoring is larger, and N represents the number of all items;
s105: combining the non-popular item card value characteristics detected in the step S104 with WDMA, RDMA, WDA, Length Variance, DegSim', FMV, FAC, FMD and PV13 detection characteristics to form a characteristic matrix V of a user characteristic vector, which is used as an effective characteristic for distinguishing normal users from attack users.
Further, the specific operation of step S2 includes the following steps,
s201: dividing original user scoring data into two parts according to a proportion, wherein one part is used as a training set for training a random forest classifier, and the other part is used as a user set to be detected;
s202: respectively calculating the feature matrixes of the training set and the user set to be detected according to the feature matrix V extracted in the step S1;
s203: constructing a random forest classifier by using training set data and training;
s204: and detecting the user set to be detected by using a trained random forest classifier, outputting a classification prediction result to obtain a user profile detection result in a first stage, and preliminarily dividing the user profile detection result into an initial real profile class and an initial attack profile class.
Further, the specific operation of step S203 includes the following steps,
s2031: assuming that a data set of a training set contains t samples, randomly selecting k subsets from the data set by using a Bootsrap resampling technology, and respectively training k decision trees, wherein each sample in the training subsets contains m attributes;
s2032: when each node of the decision tree needs to be split, randomly selecting s attributes (s < m) from m attributes, selecting one attribute from the s attributes as the split attribute of the node, and repeatedly executing the division process until the stop condition is met;
s2033: respectively training k Bootsrap sample sets according to the mode in the step S2032 to k decision tree models, and finally combining all generated decision trees into a random forest classifier { T }i i=1,2,…,k}。
Further, the specific operation of step S3 includes the following steps,
s301: calculating a grading mean value corresponding to the items in the initial attack profile class, and confirming the item with the largest mean value as a target item;
s302: and sequentially checking the users in the user set containing the initial attack profile, finding out all the users with the highest scores for the target item, identifying the users as the final attack profile, and finishing the detection of the second stage.
6. The attack detection robust recommendation method based on random forest and target item identification as claimed in claim 5, wherein the specific operation of step S4 comprises the following steps,
s401: based on the initial real profile class obtained in the step S2 and the final attack profile class obtained in the step S3, initializing a feature matrix by using a PSO method to obtain an initial user feature matrix and an item feature matrix;
s402: constructing an indication function I according to the detection result of the final attack profile in the step S3S(u),
Figure BDA0003060518240000051
In the formula, S is the ultimate attackA user set corresponding to the profile, wherein U is an overall user set;
s403: will indicate the function IS(u) and item feature vector qiAre combined to obtain qi←qi+IS(u)γ(pueui-λqi);
S404: using the formula pu←pu+γ(qieui-λpu) And q isi←qi+IS(u)γ(pueui-λqi) Iteratively updating the initial user characteristic matrix and the project characteristic matrix until the algorithm converges to obtain an optimal user characteristic matrix and an optimal project characteristic matrix;
s405: and generating a recommendation aiming at the target user according to the optimal user characteristic matrix and the item characteristic matrix.
The invention has the beneficial effects that:
the invention provides a robust recommendation method for fusing random forest and target item identification, which comprises the steps of firstly, utilizing a random forest classifier obtained by training to carry out first-stage attack profile detection on a user profile, then identifying a target item to finish second-stage attack profile detection on the user profile to obtain a final attack profile detection result, combining the attack profile detection result with a matrix decomposition model, and providing a robust recommendation algorithm RRA-RFTII which is compared with the existing matrix decomposition method (MMF) based on M-estimator, the matrix decomposition method (MMF) based on minimum truncation two-times estimator and the robust recommendation algorithm (KMR-M) based on incremental clustering and matrix decomposition, wherein the algorithm provided by the invention is superior in recommendation precision and robustness, and obtains an initial characteristic matrix through particle swarm optimization technology, the capability of obtaining the optimal solution by model training is improved, so that the recommendation precision of the algorithm is guaranteed, and the recommendation result of the collaborative filtering recommendation system is more accurate.
Drawings
FIG. 1 is a block diagram of a robust recommendation method of the present invention;
FIG. 2 is a block flow diagram of steps S1-S3 of the robust recommendation method of the present invention;
FIG. 3 is a comparison result of the accuracy of four attack detection algorithms in the first embodiment of the present invention;
FIG. 4 is a comparison of recall ratios of four attack detection algorithms according to a first embodiment of the present invention;
FIG. 5 shows the MAE values of four proposed algorithms according to one embodiment of the present invention;
FIG. 6 shows the PS values of four proposed algorithms according to one embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the following further describes the technical solution of the present invention with reference to the drawings and the embodiments.
As shown in fig. 1, the attack detection robust recommendation method based on random forest and target item identification comprises the following steps,
s1: extracting effective characteristics capable of distinguishing normal users from attack users from scoring data based on a chi-square statistic theory;
specifically, S101: let U be { U ═ U1,u2,...,umDenotes the set of all users, I ═ I1,i2,...,inRepresents the set of all items, for item I belongs to I, user U belongs to U, if
Figure BDA0003060518240000061
The item i is rated to be scored once, and the scoring times of the item i by all users in the U are counted, namely
Figure BDA0003060518240000062
S102: calculating the item popularity T ═ { IPop for all items1,IPop2,...,IPopnSorting the n items according to the popularity descending order;
s103: according to the sorting result, dividing n items into two sets by adopting a 10-fold cross validation method, wherein one set is a popular item set IPOPUA collection of non-popular items IUNPOPU(ii) a Evaluation of popular items by normal usersThe grading number is large, the grading number of non-popular projects is small, the grading of the attack user on the projects is random, and the grading number of popular projects is not greatly different from that of the non-popular projects.
S104: taking the item and the user as two statistics, the value ranges of which are respectively { popular item, unpopular item } and { user scored, user not scored }, calculating the degree of association between the unpopular item and the user u, namely the chi-square value of the unpopular item,
Figure BDA0003060518240000071
in the formula (I);
Figure BDA0003060518240000072
indicating that an item belongs to the collection IUNPOPUAnd is scored by the user u by the number,
Figure BDA0003060518240000073
indicating that an item does not belong to the collection IUNPOPUAnd the number scored by the user u,
Figure BDA0003060518240000074
indicating that an item belongs to the collection IUNPOPUAnd user u has not scored the number of,
Figure BDA0003060518240000075
indicating that an item does not belong to the collection IUNPOPUThe number of the users u not scoring is larger, and N represents the number of all items;
the larger the CSUI value is, the larger the correlation degree of the user u with the non-popular items is, and the more times of scoring the non-popular items by the user u are indicated; the smaller the CSUI value is, the smaller the correlation degree of the user u with the non-popular project is, and the smaller the number of times of scoring the non-popular project by the user u is indicated; the CSUI value is 0, so that the user u is independent from the non-popular items, and the number of times of scoring the non-popular items by the user u is 0; because the normal user is more inclined to score popular items, and the attack user scores randomly selected items, the number of times that the attack user scores non-popular items is greater than that of times that the normal user scores non-popular items, and the CSUI value of the normal user is smaller than that of the attack user;
s105: combining the chi-square value characteristics of the non-popular items detected in the step S104 with the feedback recorder systems in the prior art: detection of profile attributes [ J ], (Chad A W, Bamshad M, Robin B.D; Service organized Computing and Applications,2007,1(3): 157-: a classification-based attack [ C ]// Proceedings of the 8th Knowledge Discovery on the Web International Conference on Advances in Web Mining and Web Usage Analysis (LLIAMS C A, Mobasher B, Burke R, et al; Berlin: Springer,2007.167-186.) proposed WDMA, RDMA, WDA, Length Variance, DegSim', FMV (mean attack model), FAC (random attack model), FAC (popular attack model), FMD (random attack model), FMD (popular attack model), FMD (mean attack model), and PV (mean attack model), constituting a feature matrix V of user feature vectors as an effective feature for distinguishing normal users from attacking users.
The specific algorithm in step S1 is named algorithm 1 (fusion feature extraction algorithm FFEA), and then algorithm 1 specifically is:
inputting: a user-item scoring matrix R, a user set U and an item set I;
and (3) outputting: a feature matrix V.
Figure BDA0003060518240000081
Figure BDA0003060518240000091
In the algorithm, lines 1-8 calculate the popularity of each item, sort the items in a descending order according to the popularity of the items, and divide all the items into a popular item set IPOPUAnd a set of non-popular items IUNPOPU(ii) a Lines 9-14 utilize the 13 features mentioned in the prior art in combinationCalculating characteristics of each user u by a CSUI calculation formula; and the 15 th row returns a feature matrix V consisting of all the user feature vectors.
Further, step S2: training a random forest classifier based on the effective features extracted in the step S1, and performing first-stage detection on a user set to be detected by using the trained random forest classifier to obtain a first-stage user profile detection result;
specifically, S201: dividing original user scoring data into two parts according to a proportion, wherein one part is used as a training set for training a random forest classifier, and the other part is used as a user set to be detected;
s202: respectively calculating the feature matrixes of the training set and the user set to be detected according to the feature matrix V extracted in the step S1;
s203: constructing a random forest classifier by using the test set data and training;
the specific operation steps of constructing the random forest classifier and specifically training are as follows,
s2031: assuming that a data set of a training set contains t samples, randomly selecting k subsets from the data set by using a Bootsrap resampling technology, and respectively training k decision trees, wherein each sample in the training subsets contains m attributes;
s2032: when each node of the decision tree needs to be split, randomly selecting s attributes (s < < m) from m attributes, selecting one attribute from the s attributes as the split attribute of the node, and repeatedly executing the division process until the stop condition is met;
s2033: respectively training k Bootsrap sample sets according to the mode in the step S2032 to k decision tree models, and finally combining all generated decision trees into a random forest classifier { T }i i=1,2,…,k}。
S204: and detecting the user set to be detected by using a trained random forest classifier, outputting a classification prediction result to obtain a user profile detection result in a first stage, and preliminarily dividing the user profile detection result into an initial real profile class and an initial attack profile class.
The algorithm in step S2 is named algorithm 2 (attack profile detection algorithm APDA _ RF), and then algorithm 2 specifically is:
inputting: training set Train, to be detected user set Test;
and (3) outputting: user class label set Tresult.
Figure BDA0003060518240000101
In the algorithm 2, the 1 st row to the 3 rd row calculate the feature matrix of the training set and the test set according to the 14 features provided by the algorithm 1; training a random forest classifier in the 4 th row; and in lines 5-6, performing predictive classification on the training set characteristic matrix by using the trained classifier to obtain TresultFinally returning the predicted result Tresult
Further, step S3: identifying the initial attack profile class detected in the step S2 through target item identification to realize attack profile detection in the second stage;
user profiles can be preliminarily classified into two types by an attack profile detection algorithm (algorithm 2) based on a random forest, but some normal users may be mistakenly detected as attack users in the detection process. In order to make the Detection result more accurate, an Attack Profile Detection Algorithm (APDA _ TII) Based on Target Item Identification is further proposed. In particular, the method comprises the following steps of,
s301: calculating a grading mean value corresponding to the items in the initial attack profile class, and confirming the item with the largest mean value as a target item;
s302: and sequentially checking the users in the user set containing the initial attack profile, finding out all the users with the highest scores for the target item, identifying the users as the final attack profile, and finishing the detection of the second stage.
The algorithm in step S3 is named algorithm 3 (attack profile detection algorithm APDA _ TII), and then algorithm 3 specifically is:
inputting: initial attack Profile class TsusItem set IsusHighest, highestScoring max;
and (3) outputting: attack user Profile set Cattack.
Figure BDA0003060518240000111
Figure BDA0003060518240000121
Lines 1-7 in the algorithm 3 are determined target projects, the mean value of each project is calculated, and the target project with the largest mean value is calculated; in the 8 th-13 th line, finding out attack users and removing the attack users from the initial attack profile class; line 14 returns the attacking user Profile set Cattack
Further, step S4: and constructing a robust recommendation algorithm according to the attack profile detection result to realize robust recommendation of attack detection.
In order to guarantee the recommendation accuracy of the algorithm, a feature matrix initialization algorithm (IFM _ PSO) based on PSO is adopted to improve the capability of model training to obtain an optimal solution, and a robust recommendation algorithm RRA-RFTII is constructed by combining the attack profile detection result obtained in the step S3.
The method comprises the steps of expressing the scoring of a user on an item into a linear model, assuming that a plurality of implicit classification features exist, expressing the scoring of the user on a certain item into a linear combination of the degree of the item belonging to each implicit classification feature and the preference degree of the user on each implicit classification feature, and specifically expressing the linear model into a formula
Figure BDA0003060518240000122
Figure BDA0003060518240000123
Representing a matrix of n x m prediction scores,
Figure BDA0003060518240000124
is a user feature matrix of f x m, vector pu(u ═ 1, 2.. times, m) denotes that user u is implicitly scored for eachThe degree of preference of the class;
Figure BDA0003060518240000125
is f x n item feature matrix, vector qi( i 1, 2.., n) represents the extent to which item i belongs to each implicit classification; solving a least squares problem by gradient descent
Figure BDA0003060518240000131
Q can be obtainediAnd puThereby a user feature matrix and an item feature matrix.
Specifically, S401: based on the initial real profile class obtained in the step S2 and the final attack profile class obtained in the step S3, initializing a feature matrix by using a PSO method to obtain an initial user feature matrix and an item feature matrix;
s402: constructing an indication function I according to the detection result of the final attack profile in the step S3S(u),
Figure BDA0003060518240000132
In the formula, S is an attack user set, and U is a whole user set;
s403: will indicate the function IS(u) and the term feature vector q in step S102iAre combined to obtain qi←qi+IS(u)γ(pueuiqi);
S404: to the initialized item feature vector qiAnd a user feature vector puPerforming an iterative update, pu←pu+γ(qieui-λpu),qi←qi+γ(pueui-λqi) To obtain the predicted score of the user u for the item i
Figure BDA0003060518240000133
Figure BDA0003060518240000134
In the formula, λ (| q)i||2+||pu||2) Is a regularization term added to avoid overfitting, λ is a constant,
Figure BDA0003060518240000135
representing the difference between the true score and the predicted score, ruiThe user u truly scores the item i, and gamma represents the change step length of gradient descent; until the algorithm is converged, obtaining an optimal user characteristic matrix and an optimal project characteristic matrix;
s405: and generating a recommendation aiming at the target user according to the optimal user characteristic matrix and the item characteristic matrix.
The algorithm in step S4 is named algorithm 4 (robust recommendation algorithm RRA-RFTII), and then the algorithm 4 specifically is:
inputting: the user matrix R to be detected, the training set Train, the Test set Test and the attack user profile CattackThe number of users m, the number of items n, the number of particles t and the number of implicit classification features f;
and (3) outputting: a user characteristic matrix P and an item characteristic matrix Q.
Figure BDA0003060518240000141
Figure BDA0003060518240000151
In the algorithm 4, the rows 1-2 are used for acquiring an initial user characteristic matrix and an item characteristic matrix; part 2 is rows 3-21, for feature vector puAnd q isiAnd carrying out iterative updating until the algorithm is converged to obtain the optimal user characteristic matrix and the optimal project characteristic matrix.
The first embodiment is as follows:
in this embodiment, 1M scoring information data in the MovieLens movie recommendation system is used, and the data set includes 1000209 pieces of scoring information of 3952 movies by 6040 users. The scoring value ranges from an integer of 1 to 5, with a larger value indicating a greater preference of the user for the movie being scored.
An attack profile is generated by adopting an average attack (AverageAttack) model, a popular attack (PopulaAttack) model, a random attack (RandomAttack) model and an Aop attack (AopAttack) model, and different filling scales are set, wherein the filling scales of the random attack, the average attack and the popular attack are respectively { 1%, 3%, 5%, 10%, 25%, 50% }, and the filling scales of Aop attack are respectively { 1%, 3%, 5%, 10% }.
Table 1 below shows the setup protocol of the experimental data of this example, which includes 1 training set and 7 testing sets. The training set is used for training a random forest classifier and comprises 600 real users, wherein the number of random attack users, mean attack users and popular attack users is 120 respectively, and the number of 20% Aop attack users, 30% Aop attack users and 40% Aop attack users is 80 respectively. The 7 sets of test set data were used to test the performance of the attack detection algorithm and the robustness of the recommendation algorithm. Group 1 contains 500 real users and 60 random attack users. Group 2 contains 500 real users and 60 mean attack users. Group 3 contains 500 real users and 60 popular attack users. Group 4 contains 500 real users and 60 20% Aop attacking users. Group 5 contains 500 real users and 60 30% Aop attack users. Group 6 contains 500 real users and 60 40% Aop attack users. Group 7 contains 500 real users and 150 mixed attack users (hybrid attach).
TABLE 1 Experimental data
Figure BDA0003060518240000161
In the present embodiment, Mean Absolute Error (MAE) and Prediction bias (PS) are used to evaluate the recommendation accuracy and robustness of the algorithm.
The lower the MAE value is, the better the accuracy of the algorithm is, and the MAE calculation formula is
Figure BDA0003060518240000162
In the formula, ruiRepresenting the user u's true rating of item i,
Figure BDA0003060518240000163
and N is the predicted scoring times of the user u on the item i.
The PS is the average value of the change values of the prediction scores of the target item before and after the target item is attacked by the user, the smaller the value is, the stronger the anti-attack capability of the algorithm is represented, and the calculation formula of the PS is as follows
Figure BDA0003060518240000164
In the formula (I), the compound is shown in the specification,
Figure BDA0003060518240000165
and
Figure BDA0003060518240000166
respectively representing the prediction scores of the user u on the target item before and after the attack, and N representing the total prediction times.
The accuracy (Precision) and the Recall (Recall) are used for evaluating the Touchi attack detection performance of the algorithm, and the calculation formula is as follows:
Figure BDA0003060518240000167
where TP represents the number of attack profiles that are correctly detected, FP represents the number of true profiles that are misjudged, and FN represents the number of attack profiles that are not detected.
In order to evaluate the performance of the attack detection algorithm (TS _ APDA) proposed in the present invention, it was experimentally compared with the existing 3 attack detection algorithms.
(1) SVM _ APDA, model for random attacks (random average attacks) and random average of attacks (FMD) models proposed in the references of defense recipient systems of detection of profile attacks [ J ] (Chad A W, Bamshad M, Robin B.D; Service interested Computing and Applications,2007,1(3):157-170.) and detection profile information in a colloidal borne filtering of a classification-based attack [ C ]// Proceedings of the 8th Knowledge Discovery on the Web Internationality communication in Web Mining and Using Analysis (FMs C A, Mobasher B, Burr, et al; Berlin: Spring, 2007.167-186), FMD model of random average of attacks (FMD), model for random average of attacks (FMD) of attacks (FMD 13) and model for random average of attacks (RDMA) of attacks, random average of attacks (RDMA) of attacks, random average of attacks of random average of attacks (RDMA) of random average of attacks of random average of random attacks of attack of RDMA, RDMA of random attack (RDMA) of random attack of random attack of RDMA, random attack of random attack, and training the SVM classifier to carry out attack detection on the user profile.
(2) KNN _ APDA selects the random attack models mentioned in the literature, the detection of the random attack models, the random attack, the user profile is classified using a KNN classifier.
(3) C4.5_ APDA, the random attack models mentioned in the references of the detection of the attack on the probability distribution systems [ J ] (Chad A W, Bamshad M, Robin B.D; Service organized Computing and Applications,2007,1(3): 157) and detection of the attack on the probability distribution in the probability distribution Analysis on the probability distribution Analysis in the Web Analysis and Usage (I), the random attack models mentioned in the references of the probability distribution systems [ C ]// Procedents of the 8 Kknowledged attack on the probability distribution in the Web Analysis and Analysis (FMD), the random attack models mentioned in the probability distribution systems [ C ], the random attack models of the attack on the probability distribution models [ C ]/(R, B, Burr, et ], the random attack models of the probability distribution models [ M ] (FMD), the random attack models of the probability distribution models [ C ], (R, M) (FMD), the random attack models of the random attack on the probability distribution models of the probability distribution models (FMMA, 2007.167-186), the random attack models of the random attack on the probability distribution models (RDMA), the random attack models of the random attack on the probability distribution models (FMD), the random attack models of the random attack on the random attack, the user profile is classified using a decision tree C4.5 classifier.
In order to evaluate the recommendation accuracy and robustness of the recommendation algorithm RRA-RFTII provided by the invention, the following experiment comparison is carried out by comparing the existing recommendation algorithm.
(1) M-estimator-based matrix decomposition methods proposed in the MMF literature (Mehta B, Hofmann T, and Nejdl W.robust colloidal filtering [ C ]. Proceedings of the 2007ACM reference on Recommander Systems, Recsys, Minneapolis, MN, USA,2007: 49-56.).
(2) LTSMF-a matrix decomposition method based on a minimum truncated two-times estimator proposed in the literature (Cheng Z, hurley N. robust colloidal reactive matrix factorization [ C ]. Proceedings of the IEEE 201022 and International Conference on Tools with engineering Intelligence (ICTAI), ARras, France,2010: 105-.
(3) KMCQR-M robust recommendation algorithm based on incremental clustering and matrix factorization proposed in literature (Xu Yu-chen, Liu Zhen, Zhang Fu-zhi. robust recommendation on incremental clustering and matrix factorization [ J ]. Journal of Chinese Computer Systems,2015,36(04): 689-.
The comparative results are as follows:
(1) accuracy versus recall
The attack detection algorithms TS _ APDA, SVM _ APDA, KNN _ APDA and C4.5_ APDA detect attack profiles over 7 test sets with accuracy and recall as shown in fig. 3 and 4.
As can be seen from FIG. 3, for different types of attacks, the detection accuracy of the TS _ APDA algorithm is close to 1; the accuracy rate of the SVM _ APDA algorithm on detection of random attacks, mean value attacks, popular attacks and mixed attacks is between 0.5 and 0.65, and detection on Aop attacks is invalid; the accuracy rate of the KNN _ APDA algorithm for detecting random attack, mean attack, popular attack and mixed attack is between 0.4 and 0.52, and the detection for Aop attack is invalid; the accuracy of the detection of various attacks by the C4.5_ APDA algorithm is between 0.02 and 0.26. Generally speaking, the detection accuracy of the TS _ APDA algorithm in 7 groups of test sets is obviously higher than that of the other three algorithms, the main reason is that in the first-stage detection process, besides 13 detection features mentioned in the prior art, chi-square value detection features of non-popular projects proposed based on chi-square statistics are adopted, and the extracted features are used for training a random forest classifier, so that the precision of the classifier is improved to a certain extent; meanwhile, aiming at the detection result of the first stage, the second stage detection is carried out on the category containing the attack profile through the identification of the target item, so that the detection result is more accurate.
As can be seen from fig. 4, in the detection of random attack, mean attack, and popular attack, the recall rate of the TS _ APDA detection algorithm reaches 1, and the recall rates of the other three detection algorithms are also close to 1. In the detection of Aop attack and mixed attack, the recall rate of the TS _ APDA detection algorithm reaches over 0.7 and 0.8 respectively, and in the detection of Aop attack and mixed attack, the recall rate of the other three detection algorithms is obviously reduced, particularly the recall rate of the SVM _ APDA detection algorithm and the KNN _ APDA detection algorithm is 0 in the detection of Aop attack. In general, of the four detection algorithms, the TS _ APDA detection algorithm is most capable of identifying attack profiles.
(2) MAE and PS comparison
The four recommended algorithms RRA-RFTII, MMF, LTSMF and RRA-RFTII MAE and PS on the 7 test sets are shown in FIGS. 5 and 6.
As can be seen from FIG. 5, under different types of attacks, the MAE values of the MMF, LTSMF and KMCQR-M algorithms are all above 0.7, while the MAE value of the RRA-RFTII algorithm is close to 0.7 only under the mixed attack. The smaller the MAE value is, the higher the recommendation accuracy of the algorithm is, so that the recommendation accuracy of the RRA-RFTII algorithm provided by the invention is the best among the four algorithms. The main reason is that before gradient descent is carried out on the RRA-RFTII algorithm, the initial characteristic matrix is not generated randomly, but the initial user characteristic matrix and the project characteristic matrix are obtained by adopting a particle swarm optimization technology, so that the capability of obtaining the optimal solution by model training is improved, and the recommendation precision of the algorithm is guaranteed.
As can be seen from FIG. 6, under random attack, mean attack and epidemic attack, the PS values of the RRA-RFTII algorithm provided by the invention are smaller, while the PS values of the other three algorithms are larger. Under Aop attack and mixed attack, the attack resistance of the four algorithms is reduced, but the PS value of the RRA-RFTII algorithm is still lower than that of the other three algorithms. Since the smaller the PS value, the better the robustness of the algorithm, the robustness of the RRA-RFTII algorithm is the best among the four algorithms. The main reason is that the RRA-RFTII algorithm adopts a two-stage attack profile detection algorithm, and can effectively detect the attack profile before recommendation.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (6)

1. The attack detection robust recommendation method based on random forest and target item identification is characterized by comprising the following steps,
s1: extracting effective characteristics capable of distinguishing normal users from attack users from scoring data based on a chi-square statistic theory;
s2: training a random forest classifier based on the effective features extracted in the step S1, and performing first-stage detection on a user set to be detected by using the trained random forest classifier to obtain a first-stage user profile detection result;
s3: identifying the initial attack profile class obtained in the step S2 through target item identification to realize attack profile detection in the second stage;
s4: and constructing a robust recommendation algorithm according to the attack profile detection result to realize robust recommendation of attack detection.
2. The attack detection robust recommendation method based on random forest and target item identification as claimed in claim 1, wherein the specific operation of step S1 comprises the following steps,
s101: let U be { U ═ U1,u2,...,umDenotes the set of all users, I ═ I1,i2,...,inRepresents the set of all items, for item I belongs to I, user U belongs to U, if
Figure FDA0003060518230000011
The item i is rated to be scored once, and the scoring times of the item i by all users in the U are counted, namely
Figure FDA0003060518230000012
S102: calculating item popularity for all items
Figure FDA0003060518230000013
Sorting the n items in descending order according to popularity;
s103: according to the sorting result, dividing n items into two sets by adopting a 10-fold cross validation method, wherein one set is a popular item set IPOPUA collection of non-popular items IUNPOPU
S104: taking the item and the user as two statistics, the value ranges of which are respectively { popular item, unpopular item } and { user scored, user not scored }, calculating the degree of association between the unpopular item and the user u, namely the chi-square value of the unpopular item,
Figure FDA0003060518230000021
(ii) a In the formula (I);
Figure FDA0003060518230000022
indicating that an item belongs to the collection IUNPOPUAnd is scored by the user u by the number,
Figure FDA0003060518230000023
indicating that an item does not belong to the collection IUNPOPUAnd the number scored by the user u,
Figure FDA0003060518230000024
indicating that an item belongs to the collection IUNPOPUAnd user u has not scored the number of,
Figure FDA0003060518230000025
indicating that an item does not belong to the collection IUNPOPUThe number of the users u not scoring is larger, and N represents the number of all items;
s105: combining the non-popular item card value characteristics detected in the step S104 with WDMA, RDMA, WDA, LENGTTHVAriance, DegSim', FMV, FAC, FMD and PV13 detection characteristics to form a characteristic matrix V of a user characteristic vector, which is used as an effective characteristic for distinguishing normal users from attack users.
3. The attack detection robust recommendation method based on random forest and target item identification as claimed in claim 2, wherein the specific operation of step S2 comprises the following steps,
s201: dividing original user scoring data into two parts according to a proportion, wherein one part is used as a training set for training a random forest classifier, and the other part is used as a user set to be detected;
s202: respectively calculating the feature matrixes of the training set and the user set to be detected according to the feature matrix V extracted in the step S1;
s203: constructing a random forest classifier by using training set data and training;
s204: and detecting the user set to be detected by using a trained random forest classifier, outputting a classification prediction result to obtain a user profile detection result in a first stage, and preliminarily dividing the user profile detection result into an initial real profile class and an initial attack profile class.
4. The attack detection robust recommendation method based on random forest and target item identification as claimed in claim 3, wherein the specific operation of step S203 comprises the following steps,
s2031: assuming that a data set of a training set contains t samples, randomly selecting k subsets from the data set by using a Bootsrap resampling technology, and respectively training k decision trees, wherein each sample in the training subsets contains m attributes;
s2032: when each node of the decision tree needs to be split, randomly selecting s attributes (s < < m) from m attributes, selecting one attribute from the s attributes as the split attribute of the node, and repeatedly executing the division process until the stop condition is met;
s2033: respectively training k Bootsrap sample sets according to the mode in the step S2032 to k decision tree models, and finally combining all generated decision trees into a random forest classifier { T }i i=1,2,…,k}。
5. The attack detection robust recommendation method based on random forest and target item identification as claimed in claim 3, wherein the specific operation of step S3 comprises the following steps,
s301: calculating a grading mean value corresponding to the items in the initial attack profile class, and confirming the item with the largest mean value as a target item;
s302: and sequentially checking the users in the user set containing the initial attack profile, finding out all the users with the highest scores for the target item, identifying the users as the final attack profile, and finishing the detection of the second stage.
6. The attack detection robust recommendation method based on random forest and target item identification as claimed in claim 5, wherein the specific operation of step S4 comprises the following steps,
s401: based on the initial real profile class obtained in the step S2 and the final attack profile class obtained in the step S3, initializing a feature matrix by using a PSO method to obtain an initial user feature matrix and an item feature matrix;
s402: constructing an indication function I according to the detection result of the final attack profile in the step S3S(u),
Figure FDA0003060518230000041
In the formula, S is a user set corresponding to the final attack profile, and U is a whole user set;
s403: will indicate the function IS(u) and item feature vector qiAre combined to obtain qi←qi+IS(u)γ(pueui-λqi);
S404: using the formula pu←pu+γ(qieui-λpu) And q isi←qi+IS(u)γ(pueui-λqi) Iteratively updating the initial user characteristic matrix and the project characteristic matrix until the algorithm converges to obtain an optimal user characteristic matrix and an optimal project characteristic matrix;
s405: and generating a recommendation aiming at the target user according to the optimal user characteristic matrix and the item characteristic matrix.
CN202110511665.9A 2021-05-11 2021-05-11 Attack detection robust recommendation method based on random forest and target item identification Pending CN113343079A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110511665.9A CN113343079A (en) 2021-05-11 2021-05-11 Attack detection robust recommendation method based on random forest and target item identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110511665.9A CN113343079A (en) 2021-05-11 2021-05-11 Attack detection robust recommendation method based on random forest and target item identification

Publications (1)

Publication Number Publication Date
CN113343079A true CN113343079A (en) 2021-09-03

Family

ID=77470706

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110511665.9A Pending CN113343079A (en) 2021-05-11 2021-05-11 Attack detection robust recommendation method based on random forest and target item identification

Country Status (1)

Country Link
CN (1) CN113343079A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114397244A (en) * 2022-01-14 2022-04-26 长春工业大学 Method for identifying defects of metal additive manufacturing part and related equipment
CN116796326A (en) * 2023-08-21 2023-09-22 北京遥感设备研究所 SQL injection detection method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114397244A (en) * 2022-01-14 2022-04-26 长春工业大学 Method for identifying defects of metal additive manufacturing part and related equipment
CN116796326A (en) * 2023-08-21 2023-09-22 北京遥感设备研究所 SQL injection detection method
CN116796326B (en) * 2023-08-21 2023-11-14 北京遥感设备研究所 SQL injection detection method

Similar Documents

Publication Publication Date Title
Pang et al. Learning representations of ultrahigh-dimensional data for random distance-based outlier detection
Chien et al. Node feature extraction by self-supervised multi-scale neighborhood prediction
Zhang et al. Discovering new intents with deep aligned clustering
Cavallari et al. Embedding both finite and infinite communities on graphs [application notes]
Guo et al. A distance sum-based hybrid method for intrusion detection
Kuhkan A method to improve the accuracy of k-nearest neighbor algorithm
Junejo et al. Terms-based discriminative information space for robust text classification
Du et al. Graph-based class-imbalance learning with label enhancement
Krawczyk et al. Dynamic classifier selection for one-class classification
Chen et al. Progressive EM for latent tree models and hierarchical topic detection
CN113343079A (en) Attack detection robust recommendation method based on random forest and target item identification
Saha et al. Genetic algorithm combined with support vector machine for building an intrusion detection system
Chen et al. DGA-based botnet detection toward imbalanced multiclass learning
CN115688024A (en) Network abnormal user prediction method based on user content characteristics and behavior characteristics
Bakhtiari et al. A latent Beta-Liouville allocation model
Yang et al. Uncovering anomalous rating behaviors for rating systems
CN112579783B (en) Short text clustering method based on Laplace atlas
Zhang et al. Large-scale clustering with structured optimal bipartite graph
Yuan et al. Research of deceptive review detection based on target product identification and metapath feature weight calculation
Son et al. Data reduction for instance-based learning using entropy-based partitioning
KR102158049B1 (en) Data clustering apparatus and method based on range query using cf tree
Ganiz et al. Leveraging higher order dependencies between features for text classification
Liu et al. Learning implicit labeling-importance and label correlation for multi-label feature selection with streaming labels
Lennox et al. Deep metric learning for proteomics
Kurniawati et al. Model optimisation of class imbalanced learning using ensemble classifier on over-sampling data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination