CN111275526B - E-commerce water force identification method based on range difference - Google Patents

E-commerce water force identification method based on range difference Download PDF

Info

Publication number
CN111275526B
CN111275526B CN202010065827.6A CN202010065827A CN111275526B CN 111275526 B CN111275526 B CN 111275526B CN 202010065827 A CN202010065827 A CN 202010065827A CN 111275526 B CN111275526 B CN 111275526B
Authority
CN
China
Prior art keywords
user
credit
users
quality
commodity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010065827.6A
Other languages
Chinese (zh)
Other versions
CN111275526A (en
Inventor
孙宏亮
梁楷平
卜湛
曹杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Lalamy Information Technology Co ltd
Original Assignee
Nanjing University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Finance and Economics filed Critical Nanjing University of Finance and Economics
Priority to CN202010065827.6A priority Critical patent/CN111275526B/en
Publication of CN111275526A publication Critical patent/CN111275526A/en
Application granted granted Critical
Publication of CN111275526B publication Critical patent/CN111275526B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0609Buyer or seller confidence or verification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0282Rating or review of business operators or products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]

Abstract

The invention discloses an electronic commerce water army identification method based on range difference, which comprises the following steps: calculating the commodity quality in an initial state, and calculating the deviation between each user score and the commodity quality; calculating the grade range of each user; calculating to obtain the user credit of each user based on the deviation and the grade range; substituting the credit of each user into a calculation formula of the commodity quality to obtain the corresponding commodity quality and further obtain a new user credit; calculating the sum of the credit changes of the users, and if the sum of the credit changes of the users is greater than the credit change threshold, substituting the new credit of the users into a calculation formula of the commodity quality to obtain the corresponding commodity quality until the sum of the credit changes of the users is less than the credit change threshold; if the credit change is smaller than the credit change threshold value, stopping iteration; and finally, selecting the first N low-reputation users as water army. The method has excellent identification capability on extreme water army groups and random water army groups, can basically ensure the original identification capability under big data, and has strong robustness and strong expansibility.

Description

E-commerce water force identification method based on range difference
Technical Field
The invention belongs to the technical field of information, relates to a complex network analysis technology, and relates to a credit evaluation method based on information entropy and user scoring characteristics in an electronic commerce system, in particular to a calculation method for carrying out user credit evaluation on a weighted user-commodity bipartite graph.
Background
With the rapid development of electronic commerce, a large number of commodity transactions increasingly rely on a reliable reputation system to give reasonably valuable scores to items, however, the current reputation scoring systems face a plurality of challenges, the most common of which is unreasonable problems caused by the randomness of user scores, and particularly, the scores are intentionally increased or decreased by an organized water army group for a specific merchant, so that the decision of consumers is seriously misled. The navy groups do not need to be scored according to objective facts, are numerous and high in concealment, and cause abnormal commodity scoring, so that the value of the commodity is judged wrongly by consumers. These water force groups disturb the normal order of e-commerce platform development, damage the benefits of e-commerce platform and consumers, and bring considerable harm to e-commerce development. Therefore, how to establish an efficient and reliable user and commodity reputation system can identify organized malicious attack users and give reasonable scores to commodities, and has profound theoretical significance and significant social and economic values.
The reputation scoring system calculates the reputation of a user by using a series of historical scoring data of the user and quantifying the effect of the user on the good. Although the naval group has the characteristic of strong imperceptibility, the naval group is different from normal users in the performance of scoring behaviors by analyzing historical scoring data. The most common of these groups are the extreme and random army groups. The extreme water army group is users who prefer to score the lowest and the highest, and the random water army group is used for randomly scoring the commodities in order to disturb the scoring. For the two types of water army groups, a large number of reputation evaluation algorithms are developed in recent years.
Based on the Correlation between user reputation and commodity quality, p.lauretti and Zhou Yan-Bo et al propose ir (iterative reconstruction Ranking), CR (corrected Ranking) algorithm, and the like, respectively. The core of the algorithm is to determine the difference between the quality of the commodity and the user score, when the scoring behavior of the user deviates from the quality of the commodity, the reputation of the user is lower than that of a normal user, the algorithm has a better performance effect in the face of an extreme water army group, and the algorithm is unfamiliar with the restraint in the face of a random water army group. Therefore, Gao Jianan et al proposed the GR (Group-based Ranking) and IGR (Iterative Group-based Ranking) algorithms based on the Group idea. The idea of these algorithms is to group those users who deviate from the general scoring behavior into a water army group. Although the algorithm has good performance in judging the random scoring water army group and the extreme water army group, the algorithm does not consider that if the water army group in the commodity occupies the public position, the normal user becomes the water army group instead, and the algorithm fails. In addition to the above algorithms, there are also reputation algorithms based on the assumption of normal distribution, Beta distribution, such as the DR (development-based spam-filtering) algorithm proposed by LEE DAEKYUNG et al and BR (Bayesian Ranking) algorithm proposed by Wuying-Ying et al, and IBR (iterative Balance Ranking) algorithm proposed by Wu Leilei et al in consideration of user preference. However, such algorithms only adapt to data under certain assumptions, and have large limitations. Through related experiments, the algorithm has poor performance and poor robustness under the conditions of large data volume and sparseness.
Disclosure of Invention
The invention aims to provide an extremely poor e-commerce water army identification method which is excellent in identification capability of extreme water army groups and random water army groups, capable of basically guaranteeing original identification capability under big data, strong in robustness and strong in expansibility.
In order to achieve the above purpose, the solution of the invention is:
an extremely poor e-commerce water army identification method comprises the following steps:
step 1, defining a ternary group data structure G ═ i, j, k, and respectively representing users, commodities and scores;
step 2, initializing the credit of each user to be the same in an initial state, and calculating the commodity quality in the initial state based on the credit;
step 3, calculating the deviation between each user score and the commodity quality according to an information entropy theoretical formula;
step 4, calculating the grade range of each user, thereby distinguishing the grade behaviors of normal users and water army;
step 5, calculating and obtaining the user credit of each user based on the deviation obtained in the step 3 and the grade extreme difference obtained in the step 4;
step 6, substituting each user credit obtained by the step 5 into a commodity quality calculation formula to obtain corresponding commodity quality, and repeating the steps 3-5 to obtain new user credits;
step 7, calculating the total credit change of the users, if the total credit change is larger than the credit change threshold, substituting the new credit of the users into a calculation formula of the commodity quality to obtain the corresponding commodity quality, and repeating the steps 3-6 until the total credit change of the users is smaller than the credit change threshold; if the credit change is smaller than the credit change threshold value, stopping iteration;
and 8, sequencing the obtained user reputations, and selecting the first N users with the lowest reputations as water army, wherein N is a set value.
The specific content of the step 2 is as follows: the commercial product quality was calculated according to the following formula:
Figure BDA0002375940170000031
in the above formula, QαDenotes the quality, U, of the commodity alphaαRepresenting a set of users purchasing alpha goods, rIndicating the grade of the user i on the commodity; riAnd representing the credit of the user i, and initializing the credit of each user to be 1, wherein the commodity quality obtained by calculation at the moment is the commodity quality in the initial state.
The specific process of the step 3 is as follows:
step 31, assume that the user i scores G for m commoditiesi={gi1,gi2...gimQ, m commodity qualities Q ═ Q1,q2...qmAnd then, the absolute value of the difference value of the two vectors is:
Di(Gi,Q)=|G-Q|={di1,di2...dim}
wherein d isimThe difference between the quality of the commodity purchased by the user and the quality of the commodity, Di(GiQ) represents the difference between the user's score of the purchased goods and the quality of the goods;
step 32, calculating the difference value of each user as in step 31, classifying the difference value, dividing n intervals according to the grade 1-n, and dividing the interval according to dimThe value is classified into corresponding intervals according to the value, then the average difference value of each interval of each user and the ratio of each interval are counted, and the calculation mode of the ratio is as follows:
Figure BDA0002375940170000032
wherein, p (n)ij) Indicating the ratio of the difference interval j of the user i, LijIndicates the number of differences, L, between the user i intervals jiRepresenting the total number of the grading difference values of the user i;
step 33, calculating the deviation between the user score and the commodity quality according to the following formula according to the calculation mode of the information entropy:
Figure BDA0002375940170000041
wherein, DHiDenotes the deviation of user i, p (n)ij) Indicating the size of the occupation ratio of the interval j,
Figure BDA0002375940170000042
represents the average difference of interval j; DHiThe larger the user's behavior, the more off-quality the goods, and the lower the reputation.
In the step 4, the evaluation times of each user on each rating level are counted firstly, then the rating level with the largest number of times is subtracted with the smallest rating, and if the user does not hit a certain rating, the evaluation times are not counted in the range of extremely poor calculation; finally, normalizing the data, wherein the calculation formula is as follows:
Figure BDA0002375940170000043
therein, ζiIndicating a very poor rating of user i, rmaxIndicating the most frequent scoring, rminIndicates the least number of scores; if ζiThe smaller the score, the less the user has a clear score preference, which is less reputable than a user with a clear score preference.
In step 5, the reputation of the user i is calculated according to the following formula:
Figure BDA0002375940170000044
wherein R isiRepresents the reputation, ζ, of user iiIndicating the significance of the user's preference, DHiIndicating the deviation of the user's score from the quality of the good.
In step 7, the calculation formula of the user reputation change sum Δ is:
Figure BDA0002375940170000045
wherein R isi' represents the new user reputation in the iterative computation and | U | represents the set of users.
In the step 8, the reputation of the user is sorted from small to large by adopting a bubble sorting method. For users with the same credit, the numbers are enlarged in front of the numbers with small numbers according to the user numbers of the users. And recording the user numbers of the first N users as the naval group identified by the invention.
In step 7, the reputation change threshold is set to 10-6
The method further comprises verifying by using the recall rate:
respectively and manually manufacturing a certain number of extreme naval groups and random naval groups and marking, then obtaining L users with the lowest credit according to the steps 1-8, and counting the proportion of the marked number of the users in the L users to the L, namely the recall rate:
Figure BDA0002375940170000051
wherein R isc(L) represents a recall rate, d' (L) represents the number of the water army recognized in the case of a length of L, and d represents the number of the water army; it is clear that Rc(L) falls in [0,1 ]]In between, a higher numerical value indicates a better recognition effect.
The method also adopts AUC to judge the advantages and disadvantages of the users, randomly selects the water army to compare with the normal users for N times, and counts the times N 'which are lower than the normal users and the times N' which are equal to the credit of the normal users:
Figure BDA0002375940170000052
wherein a higher AUC indicates a higher probability that the normal user reputation is higher than the water force, i.e., the better the method performs.
After the scheme is adopted, compared with the existing reputation algorithm, the method starts from the water army scoring characteristics, and if the user accords with certain scoring characteristics, the water army is very likely to be judged. In addition, in order to simultaneously consider the grading behavior of the user, the invention utilizes an information entropy formula to measure the deviation value of the quality of the user and the quality of the commodity. Thus, when a user meets the water force scoring characteristics and the individual scoring behavior deviates from the quality of the goods, the user is likely to be a member of the water force group.
The method provided by the invention improves the identification of the extreme navy group and the random navy group, and solves the problem of algorithm failure caused by overlarge navy group in the group-based algorithm. In addition, the invention combines the thought of the correlation algorithm based on the user reputation and the commodity quality, and integrates the thought into the reputation calculation process, thereby being capable of judging the commodity quality.
The method has the advantages of small occupation amount of operation space, high result precision, strong universality, easiness in understanding and the like, and can be applied to the fields of e-commerce platform recommendation systems, network 'water army' identification, water army detection and the like.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is an example of the method demonstration of the present invention on small matrix data;
the method comprises the following steps that (a) storage of scoring data on a matrix is represented, (b) a result of commodity quality initialization is represented, (c) a difference matrix representing quality of users and commodities, (d) the number of times that each user falls in each difference interval is represented by the matrix, (e) the frequency of the difference intervals obtained by the matrix (d) is represented by the matrix, (f) the average value of each difference interval of the users is represented by the matrix, (g) the reciprocal of the deviation value between each user and each commodity is represented by the matrix, (h) the user credit of each user is represented by the user credit, and (i) the scoring extreme difference of each user is obtained by the scoring matrix; after the user credit is obtained, the commodity quality of the (b) is updated, and the processes from the (b) to the (h) are repeated until the credit changes of all the users are less than 10-6Stopping iteration;
FIG. 3 is a plot of recall rate versus L value for different datasets in accordance with the present invention;
the system comprises a plurality of groups, wherein (a) represents an extreme water army group, and (b) represents a random water army group, the number of the groups is 50, and the abscissa represents the first L selected lowest reputation users;
FIG. 4 is a graph showing the change of recall rate with the value of L in different data sets, wherein the number of water armies is 50, and the change range of L is 0-250, compared with other reputation evaluation methods;
wherein (a) (b) (c) represents the change in recall rate under an extreme army group, and (d) (e) (f) represents the change in recall rate under a random army group;
FIG. 5 is a graph showing the change of recall rate with the proportion of naval groups in different data sets compared with other reputation evaluation methods, wherein the value of L is equal to the number of naval groups by default;
FIG. 6 is a curve showing the ratio change of AUC with a water army group in a Netflix data set according to the invention and other reputation evaluation methods;
wherein (a) represents an extreme water army group, and (b) represents a random water army group.
Detailed Description
The technical solution and the advantages of the present invention will be described in detail with reference to the accompanying drawings.
As shown in fig. 1, the invention provides a method for identifying an e-commerce water military based on a range difference, and the idea is as follows:
(1) and introducing a triple array to store the user scoring data. In order to solve the problem of large data storage in an e-commerce system, a triple data structure G ═ i, j, k is defined and is used for representing users, commodities and scores respectively. The initialization of a triple data structure represents one-time scoring behavior, and all scoring behaviors can be covered by establishing a triple array;
(2) and initializing the quality of the commodity. Each commodity has an inherent commodity quality according to the quality, and the commodity quality is formed according to the evaluation of a user in a scoring system, but due to the existence of a malicious scorer, the analysis by simply using the score of a buyer is not enough. Based on the consideration, the invention correlates the quality of the goods with the reputation of the user when calculating the quality of the goods;
(3) the deviation of each user score from the quality of the goods is calculated. Quantifying the deviation between the user score and the commodity quality based on the variation of the information entropy calculation formula;
(4) and (4) distinguishing the scoring characteristics of normal users and water army. The scoring behavior of normal users and the scoring characteristics of water army are objectively present. Compared with other credit evaluation algorithms, the method starts from the evaluation angle of the water army group, and adopts the extremely poor evaluation times to measure the difference between the normal users and the water army evaluation characteristics;
(5) and calculating the user reputation. According to the scoring characteristics after the user quantification and the scoring deviation degree of the user score and the commodity quality, the credit of the user can be obtained;
(6) and performing iterative calculation to obtain stable user reputation. The quality of the commodity is related to the reputation of the user, so that in order to obtain more stable reputation result and commodity quality, a reputation change threshold value delta is set (10 is adopted in the method)-6) And when the sum of the user reputation changes is smaller than delta, stopping iteration.
(7) And selecting the water army, sequencing the water army from small to large according to the credit obtained by calculation, selecting the first N users with the lowest credit equal to the number of the artificially manufactured water army according to the number of the artificially manufactured water army, taking the users as the water army, and comparing the water army with the artificially manufactured water army to obtain the method identification accuracy.
Specifically, in the embodiment, the following steps are mainly included, and for convenience of understanding, the following steps will be described in conjunction with the small matrix test data in fig. 2:
step 1, introducing a triple array to store user scoring data. A triple data structure G ═ i, j, k is defined to characterize users, goods, and ratings, respectively. In the E-commerce system big data environment, the commodities involved by the user are few relative to the total number of commodities. Therefore, compared with a mode of directly storing the user scoring behavior by adopting the matrix, the triple array can avoid the waste of space caused by sparse data, thereby greatly reducing the occupation of space. For convenience of illustration, however, fig. 2(a) stores the user rating information using a matrix.
And 2, initializing the quality of the commodity. Each commodity has an inherent commodity quality according to the quality, and the commodity quality is formed according to the evaluation of a user in a scoring system, but due to the existence of a malicious scorer, the analysis by simply using the score of a buyer is not enough. Based on the consideration, the invention correlates the commodity quality with the reputation of the user when calculating the commodity quality, and the calculation mode is shown as formula (1):
Figure BDA0002375940170000081
in the above formula, QαDenotes the quality, U, of the commodity alphaαRepresenting a set of users purchasing alpha goods, RiRepresenting the reputation of user i, rAnd the credit of each user is 1 under the initial condition. Fig. 2(b) is an initialization result of the commodity quality under the small matrix.
And 3, calculating the deviation between each user score and the quality of the commodity. Based on the theoretical formula of information entropy, the specific calculation method of the deviation is as follows:
suppose that user i scores G for m itemsi={gi1,gi2...gimQ, m commodity qualities Q ═ Q1,q2...qmAnd then, the absolute value of the difference value of the two vectors is:
Di(Gi,Q)=|G-Q|={di1,di2...dim} (2)
in the formula (2), d is to be notedimWhich refers to the difference between the quality of the goods purchased by the user and the quality of the goods. For goods that the user has not purchased, equation (2) does not perform the calculation. Under this precondition, Di(GiQ) represents the difference between the user's score of the purchased goods and the quality of the goods, and fig. 2(c) represents the difference matrix between the user's score and the quality of the goods.
Equation (2) for each user) And then the difference is classified. Dividing into n intervals according to the grade 1-n, according to dimThe size of the value puts it in the corresponding interval. In this way, each user corresponds to n score difference intervals, and finally, the average difference value of each interval and the occupation ratio of each interval of each user are counted, wherein the calculation mode of the occupation ratio is as follows:
Figure BDA0002375940170000091
in the formula (3), p (n)ij) Indicating the ratio of the difference interval j of the user i, LijIndicates the number of differences, L, between the user i intervals jiRepresenting the total number of scoring differences for user i. The specific implementation results are shown in fig. 2(d) and (e). Where figure 2(d) shows the number of times the user score difference is in a certain interval.
After classification is finished, calculating the deviation between the user score and the commodity quality according to the calculation mode of the information entropy, wherein a specific calculation formula is shown as a formula (4):
Figure BDA0002375940170000092
in the above formula (4), DHiDenotes the deviation of user i, p (n)ij) Indicating the size of the occupation ratio of the interval j,
Figure BDA0002375940170000094
represents the average difference of the interval j. When it is DHiThe larger the user's behavior, the more off-quality the product, and the lower the reputation. Fig. 2(g) shows the reciprocal of the matrix multiplication result of (e) and (f), i.e., the reciprocal of the deviation value.
And 4, distinguishing the scoring behaviors of the normal user and the water army. In the invention, the degree of significance of the preference of the user is measured by using the extremely poor statistical value. The evaluation times of each user to each rating level are counted firstly, then the minimum rating is subtracted from the maximum rating level, and if the user does not hit a certain rating, the evaluation times are not counted in a very poor calculation range. Finally, normalizing the data, wherein the calculation formula is as follows:
Figure BDA0002375940170000093
wherein ζiIndicating a very poor rating of user i, rmaxIndicating the most frequent scoring, rminIndicating the least number of scores. If ζiThe smaller the score, the less the user has a clear preference for scoring, which we consider to be untrustworthy, and thus the ultimate reputation will be lower than for users with a clear preference for scoring. Fig. 2(i) shows that the scores of the users in the small matrix are very poor.
And 5, calculating the user reputation. Based on the two formulas (4) and (5), the following reputation calculation mode is obtained:
Figure BDA0002375940170000101
in the formula (6), RiExpressed as the reputation, ζ, of user iiExpressed as the significance of the user's preference, DHiIndicating the deviation between the user's score and the quality of the good. Obviously, when ζiThe larger, DHiThe smaller, the more trustworthy the user.
And 6, performing iterative calculation to obtain a stable user credit. As can be seen from the formula (1), the quality of the commodity is associated with the reputation of the user, so in order to obtain a more stable reputation result and commodity quality, a reputation change threshold is set, and when the sum delta of the reputation changes of the user is smaller than the reputation change threshold, the iteration is stopped, as shown in the following formula:
Figure BDA0002375940170000102
wherein R isi' represents the latest reputation result in the iterative computation, | U | represents the set of users. Shown in FIG. 2(h)That is, the final result after the iteration of the small matrix under the formulas (6) and (7) is completed.
And 7, verifying the algorithm. In the experiment, a certain number of extreme water army groups and random water army groups are manufactured and marked. And sequencing the obtained user credits from small to large, taking the L users with the lowest credits obtained after the method is operated as water army, comparing the L users with the artificially manufactured water army, and counting the proportion of the number of marked users in the L users to the L, namely the Recall rate (Recall). The specific calculation formula is shown as formula (8):
Figure BDA0002375940170000103
in the formula (8), Rc(L) represents the recall ratio, d' (L) represents the number of the water armies identified by the algorithm with the length L, and d represents the number of the water armies. It is clear that Rc(L) falls in [0,1 ]]In between, a higher numerical value indicates a better recognition effect.
In addition, in order to further illustrate the reliability of the experimental results, AUC (area under ROC curve) is used as another measure value in the experiment to judge whether the algorithm is good or bad. The AUC may be interpreted as the likelihood that a randomly selected naval reputation is higher than a non-naval reputation. In the experiment, the randomly selected water army is compared with the normal user for N times, and the times N' lower than the normal user and the times N equal to the credit of the normal user are counted. As shown in formula (9):
Figure BDA0002375940170000111
in equation (9), it can be seen that the higher the AUC, the higher the probability that the normal user reputation is higher than the water force, and the better the algorithm performance is also being demonstrated.
The validity of the method is verified based on the authenticity data. The data sets used in the experiment were two data sets for MovieLens and one data set for Netflix. The two dataset addresses for MovieLens are https:// group.org/datasets/movieels/, and the Netflix dataset is http:// pan.basic.com/s/1 dDtmbW 9.
The network characteristics of each data set are shown in table 1, and the data fields in each column in the table represent from left to right: number of users, number of products, number of scores, user average degree (average number of products evaluated by each user), product average degree (average number of purchased products per product), data sparsity degree (number of scores/(number of users))
TABLE 1 network characteristics of three data sets
M N I <Ku> <KO> S
MovieLens 943 1682 100000 106 59 0.06305
MovieLens_100 7120 130642 1048575 147 8 0.00113
Netflix 5000 17768 3496614 699 169 0.03936
The recall rate of the present invention as a function of L in the three data sets of Table 1 is shown in FIG. 3. It can be seen that the method of the present invention performed significantly differently on the three data sets for the random and extreme water force groups. From fig. 3(a) (b), the method of the present invention will perform better on the data sets MovieLens and Netflix than on the data set MovieLens _ 100. In addition, comparing the two graphs (a) and (b), the method of the invention has better effect on the extreme water army group than the random water army group.
The performance of the method of the present invention and other reputation evaluation methods (GR, IGR, IB) on different data sets is shown in fig. 4, where the abscissa represents the lowest L reputation users selected, and the ordinate represents the corresponding recall rate. In overview, the method of the invention identifies groups of extreme naval groups that are greater than or equal to the recall rate of other methods. In the aspect of identification of random navy groups, the method is superior to other methods on three data sets, particularly, under the condition of a large data set, GR, IGR and IB can be seen to be underperformed, and the identification effect of the method is still stable, so that the method is more suitable for identification of navies under large data of e-commerce systems.
The recall rate curves of the various methods as a function of the naval group scale are shown in FIG. 5. It can be seen that in terms of targeting extreme naval groups, the recall rate hardly changes as the naval groups increase, while the GR and IGR methods both decrease to different extents. In the aspect of aiming at the random navy group, with the increase of the random navy group, the recall rate of each method is improved to a certain extent. It can be seen that the method of the present invention is superior to other methods not only in the recall ratio but also in the rate of increase of the recall ratio.
Fig. 6(a) and (b) show the variation curves of AUC for each method under different scale of extreme water army group and random water army group. In (a) it can be seen that the AUC performance of the IB method and the method of the invention are consistent, while GR and IGR are both significantly reduced. While in (b) it is clear that the AUC of each algorithm does not change significantly when the random naval group is increased, we see that the AUC of the method of the present invention is better than that of other methods.
Through the above analysis, the present invention has the following advantages: firstly, compared with other methods, the method has remarkable advantages in the aspect of water army identification aiming at big data; secondly, the method has stronger robustness and small influence degree by the water army; finally, the invention is easy to explain and understand, and is highly interpretable. In addition, the operation efficiency of the evaluation method is higher than that of other methods in the experimental process, and subsequent work shows that the user score calculated by the formula (5) is extremely poor and can be suitable for most of current reputation evaluation methods, the identification of the water army group is improved to different degrees, and the evaluation method has general applicability.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Claims (6)

1. An extremely poor e-commerce water army identification method is characterized by comprising the following steps:
step 1, defining a ternary group data structure G ═ i, j, k, and respectively representing users, commodities and scores;
step 2, initializing the credit of each user to be the same in an initial state, and calculating the commodity quality in the initial state based on the credit;
the commercial product quality was calculated according to the following formula:
Figure FDA0002881770550000011
in the above formula, QαDenotes the quality, U, of the commodity alphaαRepresenting a set of users purchasing alpha goods, rIndicating the grade of the user i on the commodity; riRepresenting the credit of the user i, and initializing the credit of each user to be 1, wherein the commodity quality obtained by calculation at the moment is the commodity quality in the initial state;
step 3, calculating the deviation between each user score and the commodity quality according to an information entropy theoretical formula; the specific process is as follows:
step 31, assume that the user i scores G for m commoditiesi={gi1,gi2...gimQ, m commodity qualities Q ═ Q1,q2...qmAnd then, the absolute value of the difference value of the two vectors is:
Di(Gi,Q)=|G-Q|={di1,di2...dim}
wherein d isimThe difference between the quality of the commodity purchased by the user and the quality of the commodity, Di(GiQ) represents the difference between the user's score of the purchased goods and the quality of the goods;
step 32, calculating the difference value of each user as in step 31, classifying the difference value, dividing n intervals according to the grade 1-n, and dividing the interval according to dimThe value is classified into corresponding intervals according to the value, and then the average difference of each interval of each user is countedThe value and the occupation ratio size of each interval are calculated in the following mode:
Figure FDA0002881770550000012
wherein, p (n)ij) Indicating the ratio of the difference interval j of the user i, LijIndicates the number of differences, L, between the user i intervals jiRepresenting the total number of the grading difference values of the user i;
step 33, calculating the deviation between the user score and the commodity quality according to the following formula according to the calculation mode of the information entropy:
Figure FDA0002881770550000021
wherein, DHiDenotes the deviation of user i, p (n)ij) Indicating the size of the occupation ratio of the interval j,
Figure FDA0002881770550000022
represents the average difference of interval j; DHiThe larger the behavior of the user deviates from the quality of the commodity, the lower the reputation is;
step 4, calculating the grade range of each user, thereby distinguishing the grade behaviors of normal users and water army;
counting the evaluation times of each user to each rating level, subtracting the rating with the least number from the rating level with the most number, and if the user does not hit a certain rating, not counting the rating into a poor calculation range; finally, normalizing the data, wherein the calculation formula is as follows:
Figure FDA0002881770550000023
therein, ζiIndicating a very poor rating of user i, rmaxIndicating the most frequent scoring, rminIndicates the least number of scores; if ζiThe smaller the score, the less the score is, the user does not have obvious scoring preference, and the reputation of the user is lower than that of the user with the obvious scoring preference;
step 5, calculating and obtaining the user credit of each user based on the deviation obtained in the step 3 and the grade extreme difference obtained in the step 4;
the reputation of user i is calculated according to the following equation:
Figure FDA0002881770550000024
wherein R isiRepresents the reputation, ζ, of user iiIndicating the significance of the user's preference, DHiIndicating the deviation of the user's score from the quality of the goods;
step 6, substituting each user credit obtained by the step 5 into a commodity quality calculation formula to obtain corresponding commodity quality, and repeating the steps 3-5 to obtain new user credits;
step 7, calculating the total credit change of the users, if the total credit change is larger than the credit change threshold, substituting the new credit of the users into a calculation formula of the commodity quality to obtain the corresponding commodity quality, and repeating the steps 3-6 until the total credit change of the users is smaller than the credit change threshold; if the credit change is smaller than the credit change threshold value, stopping iteration;
and 8, sequencing the obtained user reputations, and selecting the first N users with the lowest reputations as water army, wherein N is a set value.
2. The extremely poor-based e-commerce water military identification method of claim 1, wherein: in step 7, the calculation formula of the user reputation change sum Δ is:
Figure FDA0002881770550000031
wherein R isi' represents the new user reputation in the iterative computation and | U | represents the set of users.
3. The extremely poor-based e-commerce water military identification method of claim 2, wherein: the reputation change threshold is set to 10-6
4. The extremely poor-based e-commerce water military identification method of claim 1, wherein: in the step 8, the reputation of the users is sorted from small to large by using a bubble sorting method; for users with the same credit, the numbers are enlarged in front of the numbers with small numbers according to the user numbers of the users.
5. The extremely poor-based e-commerce water military identification method of claim 1, wherein: the method further comprises verifying with a recall rate:
respectively and manually manufacturing a certain number of extreme scoring navy groups and random navy groups and marking the extreme scoring navy groups, then obtaining L users with the lowest credit according to the steps 1-8, and counting the proportion of the number of marked users in the L users to the L, namely the recall rate:
Figure FDA0002881770550000032
wherein R isc(L) represents a recall rate, d' (L) represents the number of the water army recognized in the case of a length of L, and d represents the number of the water army; it is clear that Rc(L) falls in [0,1 ]]In between, a higher numerical value indicates a better recognition effect.
6. The extremely poor-based e-commerce water military identification method of claim 1, wherein: the method also adopts AUC to judge the advantages and disadvantages of the users, randomly selects the water army to compare with the normal users for N times, and counts the times N 'which are lower than the normal users and the times N' which are equal to the credit of the normal users:
Figure FDA0002881770550000033
wherein a higher AUC indicates a higher probability that the normal user reputation is higher than the water force, i.e., the better the method performs.
CN202010065827.6A 2020-01-20 2020-01-20 E-commerce water force identification method based on range difference Active CN111275526B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010065827.6A CN111275526B (en) 2020-01-20 2020-01-20 E-commerce water force identification method based on range difference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010065827.6A CN111275526B (en) 2020-01-20 2020-01-20 E-commerce water force identification method based on range difference

Publications (2)

Publication Number Publication Date
CN111275526A CN111275526A (en) 2020-06-12
CN111275526B true CN111275526B (en) 2021-04-13

Family

ID=71001813

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010065827.6A Active CN111275526B (en) 2020-01-20 2020-01-20 E-commerce water force identification method based on range difference

Country Status (1)

Country Link
CN (1) CN111275526B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270463A (en) * 2020-10-16 2021-01-26 北京师范大学珠海校区 Multi-party data aggregation processing method and device, electronic equipment and medium
CN112312169B (en) * 2020-11-20 2022-09-30 广州欢网科技有限责任公司 Method and equipment for checking program scoring validity
CN113674045B (en) * 2021-04-14 2022-08-26 南京财经大学 E-commerce water force identification method based on interval segmentation

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103745001B (en) * 2014-01-24 2016-10-05 福州大学 A kind of product comment spam person's detecting system
CN105630801A (en) * 2014-10-30 2016-06-01 国际商业机器公司 Method and apparatus for detecting deviated user
CN105469279A (en) * 2015-11-24 2016-04-06 杭州师范大学 Commodity quality evaluation method and apparatus thereof
CN109460508B (en) * 2018-10-10 2021-10-15 浙江大学 Efficient spam comment user group detection method

Also Published As

Publication number Publication date
CN111275526A (en) 2020-06-12

Similar Documents

Publication Publication Date Title
CN111275526B (en) E-commerce water force identification method based on range difference
TWI662421B (en) Community division method and device based on feature matching network
US10861054B2 (en) Heuristic customer clustering
US20160342963A1 (en) Tree pathway analysis for signature inference
CN104517052B (en) Invasion detection method and device
CN110991474A (en) Machine learning modeling platform
CN109948724A (en) A kind of electric business brush single act detection method based on improvement LOF algorithm
CN107633444A (en) Commending system noise filtering methods based on comentropy and fuzzy C-means clustering
CN113568368B (en) Self-adaptive determination method for industrial control data characteristic reordering algorithm
CN110532429B (en) Online user group classification method and device based on clustering and association rules
CN109977299B (en) Recommendation algorithm fusing project popularity and expert coefficient
CN112437053A (en) Intrusion detection method and device
US11734312B2 (en) Feature transformation and missing values
CN110765364A (en) Collaborative filtering method based on local optimization dimension reduction and clustering
CN111311276B (en) Identification method and device for abnormal user group and readable storage medium
CN109508350B (en) Method and device for sampling data
CN111242647B (en) Method for identifying malicious user based on E-commerce comment
CN113674045B (en) E-commerce water force identification method based on interval segmentation
CN117035983A (en) Method and device for determining credit risk level, storage medium and electronic equipment
CN111914930A (en) Density peak value clustering method based on self-adaptive micro-cluster fusion
CN111144430B (en) Card-keeping number identification method and device based on genetic algorithm
CN115082135B (en) Method, device, equipment and medium for identifying online time difference
CN107423319B (en) Junk web page detection method
CN112632219B (en) Method and device for intercepting junk short messages
Yang et al. Measuring scorecard performance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220719

Address after: 230000 Room 203, building 2, phase I, e-commerce Park, Jinggang Road, Shushan Economic Development Zone, Hefei City, Anhui Province

Patentee after: Hefei Jiuzhou Longteng scientific and technological achievement transformation Co.,Ltd.

Address before: 210023 No. 3 Wenyuan Road, Qixia District, Nanjing City, Jiangsu Province

Patentee before: NANJING University OF FINANCE AND ECONOMICS

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20221008

Address after: 511466 Room 1504, No. 5, Wangjiang Second Street, Huangge Town, Nansha District, Guangzhou, Guangdong Province (office only)

Patentee after: Guangzhou Lalamy Information Technology Co.,Ltd.

Address before: 230000 Room 203, building 2, phase I, e-commerce Park, Jinggang Road, Shushan Economic Development Zone, Hefei City, Anhui Province

Patentee before: Hefei Jiuzhou Longteng scientific and technological achievement transformation Co.,Ltd.

TR01 Transfer of patent right