E-commerce water force identification method based on interval segmentation
Technical Field
The invention belongs to the technical field of information in a computer network, comprises precise analysis of big data, relates to a method for evaluating user credit through scoring characteristics of a user in an electronic commerce system, and is an evaluation and calculation method for obtaining the user credit through analyzing a user-commodity bipartite graph with a right.
Background
With the development of economy and network technology, commodity transaction completed on the internet is more convenient and more affordable, so that the development of electronic commerce is greatly promoted, and in the transaction process, commodity scoring is an important standard when a user selects, so that a reliable scoring system needs to be established. In the current reputation scoring system, there are many problems, the most common of which is the problem of scoring deviation caused by random scoring or malicious scoring of a user, and especially there is a case that a water army is asked to make a large-scale good scoring or bad scoring so as to increase or decrease the scoring of a certain commodity, which can seriously mislead consumers. The water army group is numerous, scoring is not based on objective facts, meanwhile, the concealment is strong, and the scoring is mixed in a large number of normal users, so that adverse effects are caused on the scoring of commodities. This is detrimental to the development of e-commerce platforms and seriously affects the normal order of online trading of goods. Therefore, it is very important to construct a stable and reliable user and commodity reputation system, and if an algorithm is available, the algorithm can eliminate malicious users and evaluate the real quality of a commodity, which is very beneficial to the development of e-commerce and the development of the whole society.
The user-good reputation system requires a large amount of user's rating data to support. We react to the quality of the good by quantifying the user's impact on the good and compute the user's reputation. Although the water army is very strong in concealment, the difference between the water army and a normal user can be analyzed through the scoring record of the water army through a high-quality algorithm, and the water army is screened out. There are two very typical examples of water army groups, the random water army group and the extreme water army group. The random naval group randomly scores the commodities regardless of the quality of the commodities for reasons of insufficient understanding of the commodities and the like. Extreme water force groups score the goods highest or lowest in order to disrupt the normal scoring of the goods. In order to screen out the two typical naval groups, a plurality of quality credit evaluation algorithms emerge in recent years through a large number of experiments.
The idea based on correlation: laureti et al propose an Iterative (IR) method in which the reputation of a user is inversely proportional to the difference between his score and the quality of the corresponding object, the user reputation and the quality of the good being iteratively calculated until they become stable. Zhou et al propose a correlation-based Ranking (CR) method that is robust against malicious user attacks, where the reputation of a user is determined by the correlation coefficient between his score and the estimated quality of the object. Liao et al further improve the CR method by introducing a reputation redistribution process and two penalty factors.
The group-based idea: gao et al propose a Group-based Ranking (GR) method. Later Gao et al proposed an Iterative group-based ranking (IGR) method in 2015 and on the basis of GR, with the Iterative part added on the basis of GR. Iterative applications have been widely used in subsequent studies. Wu et al also applied the iterative thought in the method of Eliminating the deviation of user scores (IBR), and divided the deviation of user scores into three categories, one is negative, the other is positive, the third is no influence, and the quality of goods is corrected by the deviation of users. Although bias-based algorithms work well in extreme user attacks, there is no significant improvement for random users.
Based on the specific distribution assumption idea: LEE DAEKYUNG et al proposed a bias-based random malicious user screening (DR) method. The BR (Bayesian ranking) algorithm proposed by Wuying-Ying et al based on the Beta distribution assumption. The above algorithm is not very good in performance and poor in robustness under the conditions of large data volume and sparseness.
Disclosure of Invention
The invention aims to provide an e-commerce water army identification method based on interval division, which has better identification capability on extreme water army groups and random water army groups, can basically ensure the original identification capability under big data, and has strong robustness and strong expansibility.
In order to realize more accurate identification of the E-business water army, the solution of the invention is as follows:
an e-commerce water army identification method based on interval segmentation comprises the following steps:
step 1, defining a ternary group data structure G ═ { i, alpha, r }, and respectively representing users, commodities and scores;
step 2, assuming that the scores of the users obey normal distribution, and calculating the mean value and the variance of the historical scores of the commodities;
step 3, respectively calculating the accuracy of user scores, the sum of distances from each score to the corresponding score interval of the user and the extreme difference of the user scores;
in the step 3, the method for calculating the accuracy of the user score includes:
step A31, standardizing the user score by using a Z-score method;
step A32, classifying the scores after the standardization treatment, and determining a score interval, wherein the scores in the score interval are considered to be correct scores, otherwise, the scores are considered to be incorrect scores;
step A33, counting the times of correct scoring and incorrect scoring of the user, thereby calculating the correct scoring rate of the user;
step 4, according to the accuracy and range of user scores, the sum of the distances from each score of the user to the corresponding score interval is combined to obtain the credit of the user;
and 5, sequencing the user reputations, and selecting the first N users with the lowest reputations as water army, wherein N is a set value.
In the above step 2, the average value μ of the history scores of the commodity α is calculated according to the following formula α :
Wherein, U α Set of users, | U, representing the purchase of a commodity α α I represents the number of users who purchased the product alpha, r iα Representing the grade of the user i on the commodity alpha;
the variance σ of the history score of the commodity α is calculated according to the following formula α :
Wherein, U α Set of users, | U, representing the purchase of a commodity α α I represents the number of users who purchased the product alpha, r iα Represents the score, mu, of the user i on the commodity alpha α Represents the mean of the historical scores of the good α.
The specific content of the step a31 is:
the score of the normalized commercial product α is calculated according to the following formula:
wherein r is iα Represents the score, mu, of user i on the item alpha α Mean value, σ, representing historical score of item α α Representing the variance of the historical scores of the good alpha.
The specific content of the step a33 is: counting the times s of correct scoring of the user i And number of incorrect scores f i Then, the accuracy of the user score is calculated using the following formula:
wherein eta i Indicating the accuracy of the user i score.
In step 3, the formula for calculating the sum of the distances from each score to the corresponding score interval of the user is as follows:
wherein bd represents the boundary of the scoring interval determined according to step A32, i.e., -1 or 1, when r' iα 1, bd ≤ 1, when r ≤' iα Not less than 1, bd ═ 1, otherwise r' iα =bd;d i Represents the sum of the distances from each score of the user i to the corresponding score interval, O α A set of users i who score the commodity alpha; f. of i Indicating the number of times user i incorrectly scored.
In the step 3, the formula for calculating the range of the user score is as follows:
wherein, V i Representing the worst of the user i score, t ir Representing the number of scores given by user i under each score r.
In step 4, the user reputation calculation formula is:
wherein R is i Representing the reputation of user i, d i Representing the sum of the distances, η, from each score of the user i to the corresponding score interval i Indicating the accuracy of the user i score, V i Indicating a poor score for user i.
In the step 5, sorting the user credit from small to large by adopting a bubble sorting method; for users with the same credit, the numbers are enlarged in front of the numbers with small numbers according to the user numbers of the users.
And (3) adopting the recall rate to verify the superiority and inferiority:
simulating a certain number of random scoring water army groups and extreme scoring water army groups, marking the groups, calculating to obtain L users with lowest credit according to the steps 1-5, and calculating the percentage of marked users in the L users, namely the recall rate:
wherein R is c (L) represents the recall rate, d' (L) represents the number of marked artificial water armies in the selected L users, and d represents the number of set artificial water armies; r c The higher the value of (L) is, the more artificial water army is recognized, and the better the effect is.
And (3) verifying the accuracy of the user reputation calculated in the steps 1-5 by adopting an AUC (AUC) judgment method: comparing the reputation of the artificial water army with the reputations of other users in sequence, counting the times N 'that the reputation is lower than the reputation of a normal user and the times N' that the reputation is equal to the reputation of the normal user, and calculating a formula according to the following AUC:
wherein N is N 1 ×N 2 ,N 1 Number of artificial water army, N 2 The number of other users;
when the credits of the artificial water army are randomly selected to be equal to normal users, the AUC is 0.5, when the credits of the artificial water army are lower than the normal users, the AUC is 1, and then the AUC is in an interval of [0.5,1 ]; when the AUC is higher, the number of times that the reputation of the artificial navy is low is higher, namely, the identification degree is higher.
After adopting the scheme, the invention has the following characteristics:
(1) the algorithm adopted by the invention assumes that the scoring of the commodities follows normal distribution, and the standardization is carried out through Z-score, so that the scoring interval of the commodities is divided, and the scoring accuracy and range of each user are calculated; and finally, calculating the sum of the distances from the user scores to the scoring intervals of the corresponding commodities, and obtaining the reputation of the user by combining the scoring accuracy and the extremely poor of the user. When the scoring distance of a user is larger and the scoring accuracy is lower, the user is likely to be a member of the water army group;
(2) the method provided by the invention is tested, and the result shows that the method is good in user reputation calculation and malicious user filtering. In addition, the invention integrates the specific distribution hypothesis into the reputation calculation process. Even if some naval exists in the data set, the artificially generated random naval group can be well screened.
(3) The method does not relate to an iterative process, is low in time complexity and easy to expand, and can be applied to the fields of fraud detection, water army identification and the like.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is an exemplary illustration of the method of the present invention on small matrix data;
the method comprises the following steps of (a) representing an original weighted bipartite graph, (b) representing a user object scoring matrix, (c) representing variance and mean values in the (b), (d) representing a new scoring matrix obtained by standardizing the matrix in the (b) by using Z-score, (e) representing a rating rationality matrix, representing whether scoring in the new scoring matrix is correct, (f) representing a rationality statistical matrix, (g) representing a user scoring accuracy matrix, (h) representing a user distance matrix, (i) representing a user scoring time matrix, (j) representing a range matrix, and (k) representing a user reputation matrix;
FIG. 3 is a plot of recall rate versus L value for different datasets in accordance with the present invention;
the system comprises a plurality of groups, wherein (a) represents an extreme water army group, and (b) represents a random water army group, the number of the groups is 100, and the abscissa represents the first L selected lowest reputation users;
FIG. 4 is a variation curve of recall rate along with the value of L under different data sets with the number of water army being 100 and the variation range of L being 0-300, compared with other reputation evaluation methods;
wherein, (a) (b) (c) represents the condition of recall rate of each algorithm under the condition of an extreme water army group, and (d) (e) (f) represents the condition of recall rate of each algorithm under the condition of a random water army group; the variation range of p (p represents the proportion of the water army among all users) is 0-50 percent;
FIG. 5 is a graph showing the change of recall rate with the proportion of naval groups in different data sets compared with other reputation evaluation methods, wherein the value of L is equal to the number of naval groups by default;
FIG. 6 is a curve showing the ratio change of AUC with a water army group in a Netflix data set according to the invention and other reputation evaluation methods;
wherein (a) represents an extreme water army group, and (b) represents a random water army group.
Detailed Description
The technical result of the present invention will be described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the invention provides an e-commerce water military identification method based on interval segmentation, and the idea is as follows:
(1) and data storage is carried out by adopting triple storage. Since the user scoring matrix has the problem of sparse distribution, a data structure G of a triplet is defined as { i, α, r }, i, α, r sequentially represents a user, a commodity and a score. One triple data represents a behavior that a user carries out scoring once, and the data storage structure is constructed so that all scoring behaviors can be covered by smaller data storage amount;
(2) the mean and variance of the historical scores of the goods are calculated. The scoring of the commodity is assumed to be in accordance with normal distribution, and the mean value and the variance are calculated to facilitate data standardization;
(3) the scoring matrix is normalized by the Z-score method. Carrying out normalization processing on the score matrix by using the mean value and the variance of the historical scores by adopting a Z-score method so as to carry out further processing in the following;
(4) and classifying the user scores. The normalized score is subject to standard normal distribution, 68.268949% of the area under the function curve is within a standard deviation range around the average value, if the normalized score is in [ -1,1], the score is a correct score, otherwise, the score is an incorrect score;
(6) and calculating the accuracy of the user score. According to the rationality of the user scoring, counting the times of correct scoring and the times of incorrect scoring of the user, and further calculating the correct rate of the user scoring;
(7) and calculating the sum of the scoring distances of the users. And (4) based on the step (2), calculating the sum of the distances from the user score to the correct scoring interval by using the normalized score. The larger the sum of the scoring distances is, the more unreliable the user score is;
(8) calculate the worst of the user's scores. Based on the step (2), calculating the user score range according to a range formula, wherein the user score range is one of evaluation indexes of the user credit;
(9) and calculating the user reputation. Since the user scores are unreasonable but the sum of the scoring distances is still small, the user scores need to be corrected by using the accuracy and range of the user scores, and the sum of the user scores is combined with the accuracy and range to serve as the reputation of the user.
(10) Selecting a water army. And (4) obtaining credit values of all users according to the step (9), sequencing the credit values from small to large, and counting that the number of the artificially manufactured water army is N, and then capturing the first N users to be used as the water army and comparing the water army with the real water army to obtain the identification accuracy of the method.
Specifically, in the embodiment, the following steps are mainly included, and for convenience of understanding, the following steps will be described in conjunction with the small matrix test data in fig. 2:
step 1, data storage is performed by adopting triple storage, and because the distribution of a user scoring matrix is sparse, the occupied space of matrix storage is too much, a data structure G of a triple is defined to sequentially represent a user, a commodity and a score. A triple data represents a behavior of one time of scoring performed by a user, and the data storage structure can be constructed to cover all scoring behaviors by using a smaller data storage amount. For convenience of illustration, however, fig. 2(b) uses a matrix to store the user rating information.
Step 2, initially, a user scoring matrix is initialized and the scores of the users are assumed to be normally distributed. We first calculate the average mu of the historical scores of the goods α The calculation method is shown as formula (1):
in the formula (1), μ α Means of historical scores, U, representing a commodity alpha α Set of users, | U, representing purchase of alpha goods α I represents the number of users purchasing alpha merchandise, r iα Indicating the rating of the item alpha by the user i.
Step 3, calculating the variance sigma of the commodity historical score α . Based on the variance formula, the calculation is as follows:
in the above formula (2), U α Set of users, | U, representing the purchase of alpha goods α I represents the number of users purchasing an alpha commodity, r iα Represents the score, mu, of user i on the item alpha α Representing the mean of the historical scores for the item alpha. Fig. 2(c) shows the results of the product history score mean and the variance of the history score obtained by the formulas (1) and (2) in the small matrix.
And 4, at the moment, the mean value and the variance of the commodity historical scores already exist, and the Z-sc is utilizedThe score matrix is standardized by the ore method to obtain r' iα :
R 'in formula (3)' iα For standardized product scoring, r iα Represents the score, mu, of user i on the item alpha α Means of historical scores, σ, representing the commercial product α α Representing the historical score variance of the good alpha. It is considered that the upper and lower limits of the score can be ignored after normalization, and the score of 1-5 or 1-10 can be normalized, so that more data can be processed conveniently. Fig. 2(d) is a result of normalizing the original matrix (B) by the formula (3), which is referred to as a matrix B.
Step 5, based on the user object standardized scoring matrix B obtained in step 4, the object O is predicted α The normalized score will follow a standard normal distribution N (0, 1).
Step 6, according to the analysis in step 5, the scores of the users are divided into two categories, if the normalized scores are in [ -1,1], the scores are a correct score, otherwise, the scores are incorrect scores, and [ -1,1] are the score interval of the embodiment. Recording as follows:
wherein delta iα 1 indicates that the user i scores the item α correctly, δ iα 0 indicates that the user i scores the item α incorrectly. In this way, the user scoring rationality is digitized to obtain the scoring rationality matrix C of fig. 2 (e).
Step 7, calculating the accuracy rate eta of the user score by using the result in the step 6 i :
Eta in formula (5) i RepresentsAccuracy of user i rating, s i Representing the number of times user i scored correctly, f i Representing the number of times user i scores incorrectly. Fig. 2(f) is a result of counting the number of correct scores and the number of incorrect scores, and fig. 2(g) is a result of calculating the accuracy of the user scores.
Step 8, calculating the sum d of the distances between the user and the correct interval i :
In formula (6), bd represents the boundary of the scoring interval, i.e. 1 or-1, according to the interval for determining the scoring correctness in step 6. R' iα R is ≤ 1, bd ═ 1, when r' iα Not less than 1, bd ═ 1, otherwise r' iα Bd, so 0 when the score is correct, only the wrong part of the score is listed; d i Representing the sum of the distances of the users from the correct interval, O i Set of items scored on behalf of this user i, f i Representing the number of times the user scored incorrectly. The addition of 0.001 to the molecule is intended to prevent d i The reciprocal time denominator is 0, which is convenient for calculation; the denominator is incremented by 1 to prevent a false score from not being scored, resulting in a divisor of 0. By calculating d i The degree of user deviation can be derived. The d matrix of fig. 2(h) is the result obtained.
Step 9, calculating the range V of the user score i :
Maxt in equation (7) ir Representing the maximum value, mint, of the number of scores given by each score of the user ir Represents the minimum value.
Step 10, finally, the reputation R of the user is obtained i :
In the formula (8), d i Representing the sum of the distances between the user scores and the scoring sections, η i Accuracy, V, of the score on behalf of the user i Representing the worst of the user's scores. R i I.e. the reputation of the user.
And 11, verifying the algorithm. In the experiment, L extreme navy groups and random navy groups are manufactured and marked manually, meanwhile, the credit values of all users are calculated through a formula (8) and are ranked from small to large, L users which are arranged at the front and have low credit values are taken, the L users are compared with the manually manufactured navy groups, and the number ratio of the marked navy of the L users, namely the Recall rate (Recall), is counted. The concrete formula is as follows:
in formula (9), R c (L) represents a recall rate, d' (L) represents the number of marked artificial water armies recognized in the case of a length of L, and d represents the number of set artificial water armies; from the above formula it is apparent that d' (L) is at [0, d]In this interval, i.e. R c (L) is in [0, 1]]In between, higher numerical value means more artificial water army is recognized, i.e. better effect.
Meanwhile, in order to verify the reliability of the experimental result more deeply, the calculation of AUC (area under ROC curve) is also adopted as an index for measuring the superiority and inferiority of the algorithm in the experiment. The AUC may be interpreted as the likelihood that a randomly selected naval reputation is higher than a non-naval reputation. Comparing the reputation of the artificial water army with the reputations of other users in sequence, and calculating the times N 'that the reputation is lower than the reputation of the normal user and is equal to the times N' of the reputation of the normal user according to an AUC calculation formula:
in formula (10), where N ═ N 1 ×N 2 (N 1 Number of artificial naves, N 2 Number of other users) when randomly choosing artificial naval reputationEqual to 0.5 AUC of normal user at the moment, and when the artificial water force credits are all lower than 1 AUC of normal user at the moment, the AUC should be [0.5,1]]When the AUC is higher, the number of times that the reputation of the artificial water army is low is more, that is, the algorithm identification degree is higher, and the robustness is stronger.
The validity of the method is verified based on the authenticity data. The data sets used in the experiment were one Amazon data set and one Netflix data set and one Movielens data set. Amazon has the data set address https:// snap.stanford.edu/data/# Amazon, Netflix has the data set address www.netflixprize.com, and Movielens has the data set address www.grouplens.org.
The network characteristics of each data set are shown in table 1, and the data fields in each column in the table respectively represent from left to right: the number of users M, the number of products N, the number of scores I, the user average degree < Ku > (the average number of products evaluated by each user), the product average degree < Ko > (the average number of products purchased per product), and the data sparsity S (the number of scores/(the number of users × the number of products)).
TABLE 1 network characteristics of three data sets
|
M
|
N
|
I
|
<Ku>
|
<K O >
|
S
|
Movielens
|
943
|
1682
|
100000
|
106
|
59
|
0.06305
|
Netflix
|
5000
|
17768
|
3496614
|
699
|
169
|
0.03936
|
Amazon
|
9000
|
514105
|
1304872
|
145
|
3
|
0.0039 |
The recall rate of the three data sets of table 1 is plotted in fig. 3. It can be seen that the method of the present invention performed significantly differently on the three data sets for the random and extreme water force groups. In addition, comparing the two graphs (a) and (b), the method of the invention has better effect on the extreme water army group than the random water army group.
The recall rate comparison of the method of the invention and other reputation evaluation methods (GR, IGR, DR, IBR) on different data sets is shown in FIG. 4, wherein the abscissa L represents the first L selected users with low reputation, the ordinate is the recall rate, and FIGS. 4(a), (b) and (c) are extreme navy groups, so that the invention effect of the invention is better than that of other 4 methods, and the robustness on sparse data sets is stronger; fig. 4(d) (e) (f) shows a random naval group, and our method performs significantly better than other algorithms, especially on Amazon datasets, with excellent stability in screening for random naval.
The recall rate curves of the various methods as a function of the naval group scale are shown in FIG. 5. It can be seen that in terms of targeting extreme naval groups, the recall rate hardly changes when the naval groups increase, while both GR and IGR methods decrease to different extents (this is a disadvantage of GR and IGR employing group-partition-based methods). In the aspect of aiming at random naval groups, the growth trend of each method is consistent, but the early growth speed of the method is higher, and the performance is better.
Fig. 6(a) and (b) show the variation curves of AUC for each method under different scale of extreme water army group and random water army group. In (a) it can be seen that the AUC performance of the IBR method and the method of the invention is consistent, whereas there is a significant decrease in both GR and IGR (this is consistent with the phenomenon in fig. 5). While in (b) it is clear that each algorithm has a slight decrease in AUC when the random naval group increases, our method still performs stably, better than other methods.
Through the above analysis, the present invention has the following advantages: firstly, through the test of three data sets, the result shows that the algorithm well balances the screening of extremely malicious users and random malicious users, and has good accuracy and robustness; and the algorithm has slightly better screening capability than other algorithms whether in the data set with more malicious users or the data set with less malicious users.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.