E-commerce water force identification method based on range difference
Technical Field
The invention belongs to the technical field of information, relates to a complex network analysis technology, and relates to a credit evaluation method based on information entropy and user scoring characteristics in an electronic commerce system, in particular to a calculation method for carrying out user credit evaluation on a weighted user-commodity bipartite graph.
Background
With the rapid development of electronic commerce, a large number of commodity transactions increasingly rely on a reliable reputation system to give reasonably valuable scores to items, however, the current reputation scoring systems face a plurality of challenges, the most common of which is unreasonable problems caused by the randomness of user scores, and particularly, the scores are intentionally increased or decreased by an organized water army group for a specific merchant, so that the decision of consumers is seriously misled. The navy groups do not need to be scored according to objective facts, are numerous and high in concealment, and cause abnormal commodity scoring, so that the value of the commodity is judged wrongly by consumers. These water force groups disturb the normal order of e-commerce platform development, damage the benefits of e-commerce platform and consumers, and bring considerable harm to e-commerce development. Therefore, how to establish an efficient and reliable user and commodity reputation system can identify organized malicious attack users and give reasonable scores to commodities, and has profound theoretical significance and significant social and economic values.
The reputation scoring system calculates the reputation of a user by using a series of historical scoring data of the user and quantifying the effect of the user on the good. Although the naval group has the characteristic of strong imperceptibility, the naval group is different from normal users in the performance of scoring behaviors by analyzing historical scoring data. The most common of these groups are the extreme and random army groups. The extreme water army group is users who prefer to score the lowest and the highest, and the random water army group is used for randomly scoring the commodities in order to disturb the scoring. For the two types of water army groups, a large number of reputation evaluation algorithms are developed in recent years.
Based on the Correlation between user reputation and commodity quality, p.lauretti and Zhou Yan-Bo et al propose ir (iterative reconstruction Ranking), CR (corrected Ranking) algorithm, and the like, respectively. The core of the algorithm is to determine the difference between the quality of the commodity and the user score, when the scoring behavior of the user deviates from the quality of the commodity, the reputation of the user is lower than that of a normal user, the algorithm has a better performance effect in the face of an extreme water army group, and the algorithm is unfamiliar with the restraint in the face of a random water army group. Therefore, Gao Jianan et al proposed the GR (Group-based Ranking) and IGR (Iterative Group-based Ranking) algorithms based on the Group idea. The idea of these algorithms is to group those users who deviate from the general scoring behavior into a water army group. Although the algorithm has good performance in judging the random scoring water army group and the extreme water army group, the algorithm does not consider that if the water army group in the commodity occupies the public position, the normal user becomes the water army group instead, and the algorithm fails. In addition to the above algorithms, there are also reputation algorithms based on the assumption of normal distribution, Beta distribution, such as the DR (development-based spam-filtering) algorithm proposed by LEE DAEKYUNG et al and BR (Bayesian Ranking) algorithm proposed by Wuying-Ying et al, and IBR (iterative Balance Ranking) algorithm proposed by Wu Leilei et al in consideration of user preference. However, such algorithms only adapt to data under certain assumptions, and have large limitations. Through related experiments, the algorithm has poor performance and poor robustness under the conditions of large data volume and sparseness.
Disclosure of Invention
The invention aims to provide an extremely poor e-commerce water army identification method which is excellent in identification capability of extreme water army groups and random water army groups, capable of basically guaranteeing original identification capability under big data, strong in robustness and strong in expansibility.
In order to achieve the above purpose, the solution of the invention is:
an extremely poor e-commerce water army identification method comprises the following steps:
step 1, defining a ternary group data structure G ═ i, j, k, and respectively representing users, commodities and scores;
step 2, initializing the credit of each user to be the same in an initial state, and calculating the commodity quality in the initial state based on the credit;
step 3, calculating the deviation between each user score and the commodity quality according to an information entropy theoretical formula;
step 4, calculating the grade range of each user, thereby distinguishing the grade behaviors of normal users and water army;
step 5, calculating and obtaining the user credit of each user based on the deviation obtained in the step 3 and the grade extreme difference obtained in the step 4;
step 6, substituting each user credit obtained by the step 5 into a commodity quality calculation formula to obtain corresponding commodity quality, and repeating the steps 3-5 to obtain new user credits;
step 7, calculating the total credit change of the users, if the total credit change is larger than the credit change threshold, substituting the new credit of the users into a calculation formula of the commodity quality to obtain the corresponding commodity quality, and repeating the steps 3-6 until the total credit change of the users is smaller than the credit change threshold; if the credit change is smaller than the credit change threshold value, stopping iteration;
and 8, sequencing the obtained user reputations, and selecting the first N users with the lowest reputations as water army, wherein N is a set value.
The specific content of the step 2 is as follows: the commercial product quality was calculated according to the following formula:
in the above formula, QαDenotes the quality, U, of the commodity alphaαRepresenting a set of users purchasing alpha goods, riαIndicating the grade of the user i on the commodity; riAnd representing the credit of the user i, and initializing the credit of each user to be 1, wherein the commodity quality obtained by calculation at the moment is the commodity quality in the initial state.
The specific process of the step 3 is as follows:
step 31, assume that the user i scores G for m commoditiesi={gi1,gi2...gimQ, m commodity qualities Q ═ Q1,q2...qmAnd then, the absolute value of the difference value of the two vectors is:
Di(Gi,Q)=|G-Q|={di1,di2...dim}
wherein d isimThe difference between the quality of the commodity purchased by the user and the quality of the commodity, Di(GiQ) represents the difference between the user's score of the purchased goods and the quality of the goods;
step 32, calculating the difference value of each user as in step 31, classifying the difference value, dividing n intervals according to the grade 1-n, and dividing the interval according to dimThe value is classified into corresponding intervals according to the value, then the average difference value of each interval of each user and the ratio of each interval are counted, and the calculation mode of the ratio is as follows:
wherein, p (n)ij) Indicating the ratio of the difference interval j of the user i, LijIndicates the number of differences, L, between the user i intervals jiRepresenting the total number of the grading difference values of the user i;
step 33, calculating the deviation between the user score and the commodity quality according to the following formula according to the calculation mode of the information entropy:
wherein, DH
iDenotes the deviation of user i, p (n)
ij) Indicating the size of the occupation ratio of the interval j,
represents the average difference of interval j; DH
iThe larger the user's behavior, the more off-quality the goods, and the lower the reputation.
In the step 4, the evaluation times of each user on each rating level are counted firstly, then the rating level with the largest number of times is subtracted with the smallest rating, and if the user does not hit a certain rating, the evaluation times are not counted in the range of extremely poor calculation; finally, normalizing the data, wherein the calculation formula is as follows:
therein, ζiIndicating a very poor rating of user i, rmaxIndicating the most frequent scoring, rminIndicates the least number of scores; if ζiThe smaller the score, the less the user has a clear score preference, which is less reputable than a user with a clear score preference.
In step 5, the reputation of the user i is calculated according to the following formula:
wherein R isiRepresents the reputation, ζ, of user iiIndicating the significance of the user's preference, DHiIndicating the deviation of the user's score from the quality of the good.
In step 7, the calculation formula of the user reputation change sum Δ is:
wherein R isi' represents the new user reputation in the iterative computation and | U | represents the set of users.
In the step 8, the reputation of the user is sorted from small to large by adopting a bubble sorting method. For users with the same credit, the numbers are enlarged in front of the numbers with small numbers according to the user numbers of the users. And recording the user numbers of the first N users as the naval group identified by the invention.
In step 7, the reputation change threshold is set to 10-6。
The method further comprises verifying by using the recall rate:
respectively and manually manufacturing a certain number of extreme naval groups and random naval groups and marking, then obtaining L users with the lowest credit according to the steps 1-8, and counting the proportion of the marked number of the users in the L users to the L, namely the recall rate:
wherein R isc(L) represents a recall rate, d' (L) represents the number of the water army recognized in the case of a length of L, and d represents the number of the water army; it is clear that Rc(L) falls in [0,1 ]]In between, a higher numerical value indicates a better recognition effect.
The method also adopts AUC to judge the advantages and disadvantages of the users, randomly selects the water army to compare with the normal users for N times, and counts the times N 'which are lower than the normal users and the times N' which are equal to the credit of the normal users:
wherein a higher AUC indicates a higher probability that the normal user reputation is higher than the water force, i.e., the better the method performs.
After the scheme is adopted, compared with the existing reputation algorithm, the method starts from the water army scoring characteristics, and if the user accords with certain scoring characteristics, the water army is very likely to be judged. In addition, in order to simultaneously consider the grading behavior of the user, the invention utilizes an information entropy formula to measure the deviation value of the quality of the user and the quality of the commodity. Thus, when a user meets the water force scoring characteristics and the individual scoring behavior deviates from the quality of the goods, the user is likely to be a member of the water force group.
The method provided by the invention improves the identification of the extreme navy group and the random navy group, and solves the problem of algorithm failure caused by overlarge navy group in the group-based algorithm. In addition, the invention combines the thought of the correlation algorithm based on the user reputation and the commodity quality, and integrates the thought into the reputation calculation process, thereby being capable of judging the commodity quality.
The method has the advantages of small occupation amount of operation space, high result precision, strong universality, easiness in understanding and the like, and can be applied to the fields of e-commerce platform recommendation systems, network 'water army' identification, water army detection and the like.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is an example of the method demonstration of the present invention on small matrix data;
the method comprises the following steps that (a) storage of scoring data on a matrix is represented, (b) a result of commodity quality initialization is represented, (c) a difference matrix representing quality of users and commodities, (d) the number of times that each user falls in each difference interval is represented by the matrix, (e) the frequency of the difference intervals obtained by the matrix (d) is represented by the matrix, (f) the average value of each difference interval of the users is represented by the matrix, (g) the reciprocal of the deviation value between each user and each commodity is represented by the matrix, (h) the user credit of each user is represented by the user credit, and (i) the scoring extreme difference of each user is obtained by the scoring matrix; after the user credit is obtained, the commodity quality of the (b) is updated, and the processes from the (b) to the (h) are repeated until the credit changes of all the users are less than 10-6Stopping iteration;
FIG. 3 is a plot of recall rate versus L value for different datasets in accordance with the present invention;
the system comprises a plurality of groups, wherein (a) represents an extreme water army group, and (b) represents a random water army group, the number of the groups is 50, and the abscissa represents the first L selected lowest reputation users;
FIG. 4 is a graph showing the change of recall rate with the value of L in different data sets, wherein the number of water armies is 50, and the change range of L is 0-250, compared with other reputation evaluation methods;
wherein (a) (b) (c) represents the change in recall rate under an extreme army group, and (d) (e) (f) represents the change in recall rate under a random army group;
FIG. 5 is a graph showing the change of recall rate with the proportion of naval groups in different data sets compared with other reputation evaluation methods, wherein the value of L is equal to the number of naval groups by default;
FIG. 6 is a curve showing the ratio change of AUC with a water army group in a Netflix data set according to the invention and other reputation evaluation methods;
wherein (a) represents an extreme water army group, and (b) represents a random water army group.
Detailed Description
The technical solution and the advantages of the present invention will be described in detail with reference to the accompanying drawings.
As shown in fig. 1, the invention provides a method for identifying an e-commerce water military based on a range difference, and the idea is as follows:
(1) and introducing a triple array to store the user scoring data. In order to solve the problem of large data storage in an e-commerce system, a triple data structure G ═ i, j, k is defined and is used for representing users, commodities and scores respectively. The initialization of a triple data structure represents one-time scoring behavior, and all scoring behaviors can be covered by establishing a triple array;
(2) and initializing the quality of the commodity. Each commodity has an inherent commodity quality according to the quality, and the commodity quality is formed according to the evaluation of a user in a scoring system, but due to the existence of a malicious scorer, the analysis by simply using the score of a buyer is not enough. Based on the consideration, the invention correlates the quality of the goods with the reputation of the user when calculating the quality of the goods;
(3) the deviation of each user score from the quality of the goods is calculated. Quantifying the deviation between the user score and the commodity quality based on the variation of the information entropy calculation formula;
(4) and (4) distinguishing the scoring characteristics of normal users and water army. The scoring behavior of normal users and the scoring characteristics of water army are objectively present. Compared with other credit evaluation algorithms, the method starts from the evaluation angle of the water army group, and adopts the extremely poor evaluation times to measure the difference between the normal users and the water army evaluation characteristics;
(5) and calculating the user reputation. According to the scoring characteristics after the user quantification and the scoring deviation degree of the user score and the commodity quality, the credit of the user can be obtained;
(6) and performing iterative calculation to obtain stable user reputation. The quality of the commodity is related to the reputation of the user, so that in order to obtain more stable reputation result and commodity quality, a reputation change threshold value delta is set (10 is adopted in the method)-6) And when the sum of the user reputation changes is smaller than delta, stopping iteration.
(7) And selecting the water army, sequencing the water army from small to large according to the credit obtained by calculation, selecting the first N users with the lowest credit equal to the number of the artificially manufactured water army according to the number of the artificially manufactured water army, taking the users as the water army, and comparing the water army with the artificially manufactured water army to obtain the method identification accuracy.
Specifically, in the embodiment, the following steps are mainly included, and for convenience of understanding, the following steps will be described in conjunction with the small matrix test data in fig. 2:
step 1, introducing a triple array to store user scoring data. A triple data structure G ═ i, j, k is defined to characterize users, goods, and ratings, respectively. In the E-commerce system big data environment, the commodities involved by the user are few relative to the total number of commodities. Therefore, compared with a mode of directly storing the user scoring behavior by adopting the matrix, the triple array can avoid the waste of space caused by sparse data, thereby greatly reducing the occupation of space. For convenience of illustration, however, fig. 2(a) stores the user rating information using a matrix.
And 2, initializing the quality of the commodity. Each commodity has an inherent commodity quality according to the quality, and the commodity quality is formed according to the evaluation of a user in a scoring system, but due to the existence of a malicious scorer, the analysis by simply using the score of a buyer is not enough. Based on the consideration, the invention correlates the commodity quality with the reputation of the user when calculating the commodity quality, and the calculation mode is shown as formula (1):
in the above formula, QαDenotes the quality, U, of the commodity alphaαRepresenting a set of users purchasing alpha goods, RiRepresenting the reputation of user i, riαAnd the credit of each user is 1 under the initial condition. Fig. 2(b) is an initialization result of the commodity quality under the small matrix.
And 3, calculating the deviation between each user score and the quality of the commodity. Based on the theoretical formula of information entropy, the specific calculation method of the deviation is as follows:
suppose that user i scores G for m itemsi={gi1,gi2...gimQ, m commodity qualities Q ═ Q1,q2...qmAnd then, the absolute value of the difference value of the two vectors is:
Di(Gi,Q)=|G-Q|={di1,di2...dim} (2)
in the formula (2), d is to be notedimWhich refers to the difference between the quality of the goods purchased by the user and the quality of the goods. For goods that the user has not purchased, equation (2) does not perform the calculation. Under this precondition, Di(GiQ) represents the difference between the user's score of the purchased goods and the quality of the goods, and fig. 2(c) represents the difference matrix between the user's score and the quality of the goods.
Equation (2) for each user) And then the difference is classified. Dividing into n intervals according to the grade 1-n, according to dimThe size of the value puts it in the corresponding interval. In this way, each user corresponds to n score difference intervals, and finally, the average difference value of each interval and the occupation ratio of each interval of each user are counted, wherein the calculation mode of the occupation ratio is as follows:
in the formula (3), p (n)ij) Indicating the ratio of the difference interval j of the user i, LijIndicates the number of differences, L, between the user i intervals jiRepresenting the total number of scoring differences for user i. The specific implementation results are shown in fig. 2(d) and (e). Where figure 2(d) shows the number of times the user score difference is in a certain interval.
After classification is finished, calculating the deviation between the user score and the commodity quality according to the calculation mode of the information entropy, wherein a specific calculation formula is shown as a formula (4):
in the above formula (4), DH
iDenotes the deviation of user i, p (n)
ij) Indicating the size of the occupation ratio of the interval j,
represents the average difference of the interval j. When it is DH
iThe larger the user's behavior, the more off-quality the product, and the lower the reputation. Fig. 2(g) shows the reciprocal of the matrix multiplication result of (e) and (f), i.e., the reciprocal of the deviation value.
And 4, distinguishing the scoring behaviors of the normal user and the water army. In the invention, the degree of significance of the preference of the user is measured by using the extremely poor statistical value. The evaluation times of each user to each rating level are counted firstly, then the minimum rating is subtracted from the maximum rating level, and if the user does not hit a certain rating, the evaluation times are not counted in a very poor calculation range. Finally, normalizing the data, wherein the calculation formula is as follows:
wherein ζiIndicating a very poor rating of user i, rmaxIndicating the most frequent scoring, rminIndicating the least number of scores. If ζiThe smaller the score, the less the user has a clear preference for scoring, which we consider to be untrustworthy, and thus the ultimate reputation will be lower than for users with a clear preference for scoring. Fig. 2(i) shows that the scores of the users in the small matrix are very poor.
And 5, calculating the user reputation. Based on the two formulas (4) and (5), the following reputation calculation mode is obtained:
in the formula (6), RiExpressed as the reputation, ζ, of user iiExpressed as the significance of the user's preference, DHiIndicating the deviation between the user's score and the quality of the good. Obviously, when ζiThe larger, DHiThe smaller, the more trustworthy the user.
And 6, performing iterative calculation to obtain a stable user credit. As can be seen from the formula (1), the quality of the commodity is associated with the reputation of the user, so in order to obtain a more stable reputation result and commodity quality, a reputation change threshold is set, and when the sum delta of the reputation changes of the user is smaller than the reputation change threshold, the iteration is stopped, as shown in the following formula:
wherein R isi' represents the latest reputation result in the iterative computation, | U | represents the set of users. Shown in FIG. 2(h)That is, the final result after the iteration of the small matrix under the formulas (6) and (7) is completed.
And 7, verifying the algorithm. In the experiment, a certain number of extreme water army groups and random water army groups are manufactured and marked. And sequencing the obtained user credits from small to large, taking the L users with the lowest credits obtained after the method is operated as water army, comparing the L users with the artificially manufactured water army, and counting the proportion of the number of marked users in the L users to the L, namely the Recall rate (Recall). The specific calculation formula is shown as formula (8):
in the formula (8), Rc(L) represents the recall ratio, d' (L) represents the number of the water armies identified by the algorithm with the length L, and d represents the number of the water armies. It is clear that Rc(L) falls in [0,1 ]]In between, a higher numerical value indicates a better recognition effect.
In addition, in order to further illustrate the reliability of the experimental results, AUC (area under ROC curve) is used as another measure value in the experiment to judge whether the algorithm is good or bad. The AUC may be interpreted as the likelihood that a randomly selected naval reputation is higher than a non-naval reputation. In the experiment, the randomly selected water army is compared with the normal user for N times, and the times N' lower than the normal user and the times N equal to the credit of the normal user are counted. As shown in formula (9):
in equation (9), it can be seen that the higher the AUC, the higher the probability that the normal user reputation is higher than the water force, and the better the algorithm performance is also being demonstrated.
The validity of the method is verified based on the authenticity data. The data sets used in the experiment were two data sets for MovieLens and one data set for Netflix. The two dataset addresses for MovieLens are https:// group.org/datasets/movieels/, and the Netflix dataset is http:// pan.basic.com/s/1 dDtmbW 9.
The network characteristics of each data set are shown in table 1, and the data fields in each column in the table represent from left to right: number of users, number of products, number of scores, user average degree (average number of products evaluated by each user), product average degree (average number of purchased products per product), data sparsity degree (number of scores/(number of users))
TABLE 1 network characteristics of three data sets
|
M
|
N
|
I
|
<Ku>
|
<KO>
|
S
|
MovieLens
|
943
|
1682
|
100000
|
106
|
59
|
0.06305
|
MovieLens_100
|
7120
|
130642
|
1048575
|
147
|
8
|
0.00113
|
Netflix
|
5000
|
17768
|
3496614
|
699
|
169
|
0.03936 |
The recall rate of the present invention as a function of L in the three data sets of Table 1 is shown in FIG. 3. It can be seen that the method of the present invention performed significantly differently on the three data sets for the random and extreme water force groups. From fig. 3(a) (b), the method of the present invention will perform better on the data sets MovieLens and Netflix than on the data set MovieLens _ 100. In addition, comparing the two graphs (a) and (b), the method of the invention has better effect on the extreme water army group than the random water army group.
The performance of the method of the present invention and other reputation evaluation methods (GR, IGR, IB) on different data sets is shown in fig. 4, where the abscissa represents the lowest L reputation users selected, and the ordinate represents the corresponding recall rate. In overview, the method of the invention identifies groups of extreme naval groups that are greater than or equal to the recall rate of other methods. In the aspect of identification of random navy groups, the method is superior to other methods on three data sets, particularly, under the condition of a large data set, GR, IGR and IB can be seen to be underperformed, and the identification effect of the method is still stable, so that the method is more suitable for identification of navies under large data of e-commerce systems.
The recall rate curves of the various methods as a function of the naval group scale are shown in FIG. 5. It can be seen that in terms of targeting extreme naval groups, the recall rate hardly changes as the naval groups increase, while the GR and IGR methods both decrease to different extents. In the aspect of aiming at the random navy group, with the increase of the random navy group, the recall rate of each method is improved to a certain extent. It can be seen that the method of the present invention is superior to other methods not only in the recall ratio but also in the rate of increase of the recall ratio.
Fig. 6(a) and (b) show the variation curves of AUC for each method under different scale of extreme water army group and random water army group. In (a) it can be seen that the AUC performance of the IB method and the method of the invention are consistent, while GR and IGR are both significantly reduced. While in (b) it is clear that the AUC of each algorithm does not change significantly when the random naval group is increased, we see that the AUC of the method of the present invention is better than that of other methods.
Through the above analysis, the present invention has the following advantages: firstly, compared with other methods, the method has remarkable advantages in the aspect of water army identification aiming at big data; secondly, the method has stronger robustness and small influence degree by the water army; finally, the invention is easy to explain and understand, and is highly interpretable. In addition, the operation efficiency of the evaluation method is higher than that of other methods in the experimental process, and subsequent work shows that the user score calculated by the formula (5) is extremely poor and can be suitable for most of current reputation evaluation methods, the identification of the water army group is improved to different degrees, and the evaluation method has general applicability.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.