CN113674045B - E-commerce water force identification method based on interval segmentation - Google Patents

E-commerce water force identification method based on interval segmentation Download PDF

Info

Publication number
CN113674045B
CN113674045B CN202110401328.4A CN202110401328A CN113674045B CN 113674045 B CN113674045 B CN 113674045B CN 202110401328 A CN202110401328 A CN 202110401328A CN 113674045 B CN113674045 B CN 113674045B
Authority
CN
China
Prior art keywords
user
score
users
scores
interval
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110401328.4A
Other languages
Chinese (zh)
Other versions
CN113674045A (en
Inventor
孙宏亮
刘国鑫
丁俊杰
钱子杰
卜湛
曹杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sifang Hengtong Network Information Co ltd
Original Assignee
Nanjing University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Finance and Economics filed Critical Nanjing University of Finance and Economics
Priority to CN202110401328.4A priority Critical patent/CN113674045B/en
Publication of CN113674045A publication Critical patent/CN113674045A/en
Application granted granted Critical
Publication of CN113674045B publication Critical patent/CN113674045B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0609Buyer or seller confidence or verification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Algebra (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Operations Research (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an e-commerce water army identification method based on interval segmentation, which comprises the following steps: the initial data was normalized using Z-score; calculating a correct scoring interval of each commodity, wherein the scoring in the interval is considered to be reasonable, and otherwise, the scoring is unreasonable; calculating the accuracy and range of each user score; and calculating the sum of the distances from each grade of the user to the correct grade interval of the corresponding commodity, and calculating the credit of the user by combining the accuracy and the range of the grade of the user. And finally, selecting the first N low-reputation users as water army. The method is tested on three data sets (MoiveLens, Netflix and Amazon), and the result shows that the method is good in performance in the aspects of calculating the reputation of the user and identifying the water army, high in accuracy and robustness and strong in expandability.

Description

E-commerce water force identification method based on interval segmentation
Technical Field
The invention belongs to the technical field of information in a computer network, comprises precise analysis of big data, relates to a method for evaluating user credit through scoring characteristics of a user in an electronic commerce system, and is an evaluation and calculation method for obtaining the user credit through analyzing a user-commodity bipartite graph with a right.
Background
With the development of economy and network technology, commodity transaction completed on the internet is more convenient and more affordable, so that the development of electronic commerce is greatly promoted, and in the transaction process, commodity scoring is an important standard when a user selects, so that a reliable scoring system needs to be established. In the current reputation scoring system, there are many problems, the most common of which is the problem of scoring deviation caused by random scoring or malicious scoring of a user, and especially there is a case that a water army is asked to make a large-scale good scoring or bad scoring so as to increase or decrease the scoring of a certain commodity, which can seriously mislead consumers. The water army group is numerous, scoring is not based on objective facts, meanwhile, the concealment is strong, and the scoring is mixed in a large number of normal users, so that adverse effects are caused on the scoring of commodities. This is detrimental to the development of e-commerce platforms and seriously affects the normal order of online trading of goods. Therefore, it is very important to construct a stable and reliable user and commodity reputation system, and if an algorithm is available, the algorithm can eliminate malicious users and evaluate the real quality of a commodity, which is very beneficial to the development of e-commerce and the development of the whole society.
The user-good reputation system requires a large amount of user's rating data to support. We react to the quality of the good by quantifying the user's impact on the good and compute the user's reputation. Although the water army is very strong in concealment, the difference between the water army and a normal user can be analyzed through the scoring record of the water army through a high-quality algorithm, and the water army is screened out. There are two very typical examples of water army groups, the random water army group and the extreme water army group. The random naval group randomly scores the commodities regardless of the quality of the commodities for reasons of insufficient understanding of the commodities and the like. Extreme water force groups score the goods highest or lowest in order to disrupt the normal scoring of the goods. In order to screen out the two typical naval groups, a plurality of quality credit evaluation algorithms emerge in recent years through a large number of experiments.
The idea based on correlation: laureti et al propose an Iterative (IR) method in which the reputation of a user is inversely proportional to the difference between his score and the quality of the corresponding object, the user reputation and the quality of the good being iteratively calculated until they become stable. Zhou et al propose a correlation-based Ranking (CR) method that is robust against malicious user attacks, where the reputation of a user is determined by the correlation coefficient between his score and the estimated quality of the object. Liao et al further improve the CR method by introducing a reputation redistribution process and two penalty factors.
The group-based idea: gao et al propose a Group-based Ranking (GR) method. Later Gao et al proposed an Iterative group-based ranking (IGR) method in 2015 and on the basis of GR, with the Iterative part added on the basis of GR. Iterative applications have been widely used in subsequent studies. Wu et al also applied the iterative thought in the method of Eliminating the deviation of user scores (IBR), and divided the deviation of user scores into three categories, one is negative, the other is positive, the third is no influence, and the quality of goods is corrected by the deviation of users. Although bias-based algorithms work well in extreme user attacks, there is no significant improvement for random users.
Based on the specific distribution assumption idea: LEE DAEKYUNG et al proposed a bias-based random malicious user screening (DR) method. The BR (Bayesian ranking) algorithm proposed by Wuying-Ying et al based on the Beta distribution assumption. The above algorithm is not very good in performance and poor in robustness under the conditions of large data volume and sparseness.
Disclosure of Invention
The invention aims to provide an e-commerce water army identification method based on interval division, which has better identification capability on extreme water army groups and random water army groups, can basically ensure the original identification capability under big data, and has strong robustness and strong expansibility.
In order to realize more accurate identification of the E-business water army, the solution of the invention is as follows:
an e-commerce water army identification method based on interval segmentation comprises the following steps:
step 1, defining a ternary group data structure G ═ { i, alpha, r }, and respectively representing users, commodities and scores;
step 2, assuming that the scores of the users obey normal distribution, and calculating the mean value and the variance of the historical scores of the commodities;
step 3, respectively calculating the accuracy of user scores, the sum of distances from each score to the corresponding score interval of the user and the extreme difference of the user scores;
in the step 3, the method for calculating the accuracy of the user score includes:
step A31, standardizing the user score by using a Z-score method;
step A32, classifying the scores after the standardization treatment, and determining a score interval, wherein the scores in the score interval are considered to be correct scores, otherwise, the scores are considered to be incorrect scores;
step A33, counting the times of correct scoring and incorrect scoring of the user, thereby calculating the correct scoring rate of the user;
step 4, according to the accuracy and range of user scores, the sum of the distances from each score of the user to the corresponding score interval is combined to obtain the credit of the user;
and 5, sequencing the user reputations, and selecting the first N users with the lowest reputations as water army, wherein N is a set value.
In the above step 2, the average value μ of the history scores of the commodity α is calculated according to the following formula α
Figure BDA0003020452150000031
Wherein, U α Set of users, | U, representing the purchase of a commodity α α I represents the number of users who purchased the product alpha, r Representing the grade of the user i on the commodity alpha;
the variance σ of the history score of the commodity α is calculated according to the following formula α
Figure BDA0003020452150000032
Wherein, U α Set of users, | U, representing the purchase of a commodity α α I represents the number of users who purchased the product alpha, r Represents the score, mu, of the user i on the commodity alpha α Represents the mean of the historical scores of the good α.
The specific content of the step a31 is:
the score of the normalized commercial product α is calculated according to the following formula:
Figure BDA0003020452150000033
wherein r is Represents the score, mu, of user i on the item alpha α Mean value, σ, representing historical score of item α α Representing the variance of the historical scores of the good alpha.
The specific content of the step a33 is: counting the times s of correct scoring of the user i And number of incorrect scores f i Then, the accuracy of the user score is calculated using the following formula:
Figure BDA0003020452150000041
wherein eta i Indicating the accuracy of the user i score.
In step 3, the formula for calculating the sum of the distances from each score to the corresponding score interval of the user is as follows:
Figure BDA0003020452150000042
wherein bd represents the boundary of the scoring interval determined according to step A32, i.e., -1 or 1, when r' 1, bd ≤ 1, when r ≤' Not less than 1, bd ═ 1, otherwise r' =bd;d i Represents the sum of the distances from each score of the user i to the corresponding score interval, O α A set of users i who score the commodity alpha; f. of i Indicating the number of times user i incorrectly scored.
In the step 3, the formula for calculating the range of the user score is as follows:
Figure BDA0003020452150000043
wherein, V i Representing the worst of the user i score, t ir Representing the number of scores given by user i under each score r.
In step 4, the user reputation calculation formula is:
Figure BDA0003020452150000044
wherein R is i Representing the reputation of user i, d i Representing the sum of the distances, η, from each score of the user i to the corresponding score interval i Indicating the accuracy of the user i score, V i Indicating a poor score for user i.
In the step 5, sorting the user credit from small to large by adopting a bubble sorting method; for users with the same credit, the numbers are enlarged in front of the numbers with small numbers according to the user numbers of the users.
And (3) adopting the recall rate to verify the superiority and inferiority:
simulating a certain number of random scoring water army groups and extreme scoring water army groups, marking the groups, calculating to obtain L users with lowest credit according to the steps 1-5, and calculating the percentage of marked users in the L users, namely the recall rate:
Figure BDA0003020452150000051
wherein R is c (L) represents the recall rate, d' (L) represents the number of marked artificial water armies in the selected L users, and d represents the number of set artificial water armies; r c The higher the value of (L) is, the more artificial water army is recognized, and the better the effect is.
And (3) verifying the accuracy of the user reputation calculated in the steps 1-5 by adopting an AUC (AUC) judgment method: comparing the reputation of the artificial water army with the reputations of other users in sequence, counting the times N 'that the reputation is lower than the reputation of a normal user and the times N' that the reputation is equal to the reputation of the normal user, and calculating a formula according to the following AUC:
Figure BDA0003020452150000052
wherein N is N 1 ×N 2 ,N 1 Number of artificial water army, N 2 The number of other users;
when the credits of the artificial water army are randomly selected to be equal to normal users, the AUC is 0.5, when the credits of the artificial water army are lower than the normal users, the AUC is 1, and then the AUC is in an interval of [0.5,1 ]; when the AUC is higher, the number of times that the reputation of the artificial navy is low is higher, namely, the identification degree is higher.
After adopting the scheme, the invention has the following characteristics:
(1) the algorithm adopted by the invention assumes that the scoring of the commodities follows normal distribution, and the standardization is carried out through Z-score, so that the scoring interval of the commodities is divided, and the scoring accuracy and range of each user are calculated; and finally, calculating the sum of the distances from the user scores to the scoring intervals of the corresponding commodities, and obtaining the reputation of the user by combining the scoring accuracy and the extremely poor of the user. When the scoring distance of a user is larger and the scoring accuracy is lower, the user is likely to be a member of the water army group;
(2) the method provided by the invention is tested, and the result shows that the method is good in user reputation calculation and malicious user filtering. In addition, the invention integrates the specific distribution hypothesis into the reputation calculation process. Even if some naval exists in the data set, the artificially generated random naval group can be well screened.
(3) The method does not relate to an iterative process, is low in time complexity and easy to expand, and can be applied to the fields of fraud detection, water army identification and the like.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is an exemplary illustration of the method of the present invention on small matrix data;
the method comprises the following steps of (a) representing an original weighted bipartite graph, (b) representing a user object scoring matrix, (c) representing variance and mean values in the (b), (d) representing a new scoring matrix obtained by standardizing the matrix in the (b) by using Z-score, (e) representing a rating rationality matrix, representing whether scoring in the new scoring matrix is correct, (f) representing a rationality statistical matrix, (g) representing a user scoring accuracy matrix, (h) representing a user distance matrix, (i) representing a user scoring time matrix, (j) representing a range matrix, and (k) representing a user reputation matrix;
FIG. 3 is a plot of recall rate versus L value for different datasets in accordance with the present invention;
the system comprises a plurality of groups, wherein (a) represents an extreme water army group, and (b) represents a random water army group, the number of the groups is 100, and the abscissa represents the first L selected lowest reputation users;
FIG. 4 is a variation curve of recall rate along with the value of L under different data sets with the number of water army being 100 and the variation range of L being 0-300, compared with other reputation evaluation methods;
wherein, (a) (b) (c) represents the condition of recall rate of each algorithm under the condition of an extreme water army group, and (d) (e) (f) represents the condition of recall rate of each algorithm under the condition of a random water army group; the variation range of p (p represents the proportion of the water army among all users) is 0-50 percent;
FIG. 5 is a graph showing the change of recall rate with the proportion of naval groups in different data sets compared with other reputation evaluation methods, wherein the value of L is equal to the number of naval groups by default;
FIG. 6 is a curve showing the ratio change of AUC with a water army group in a Netflix data set according to the invention and other reputation evaluation methods;
wherein (a) represents an extreme water army group, and (b) represents a random water army group.
Detailed Description
The technical result of the present invention will be described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the invention provides an e-commerce water military identification method based on interval segmentation, and the idea is as follows:
(1) and data storage is carried out by adopting triple storage. Since the user scoring matrix has the problem of sparse distribution, a data structure G of a triplet is defined as { i, α, r }, i, α, r sequentially represents a user, a commodity and a score. One triple data represents a behavior that a user carries out scoring once, and the data storage structure is constructed so that all scoring behaviors can be covered by smaller data storage amount;
(2) the mean and variance of the historical scores of the goods are calculated. The scoring of the commodity is assumed to be in accordance with normal distribution, and the mean value and the variance are calculated to facilitate data standardization;
(3) the scoring matrix is normalized by the Z-score method. Carrying out normalization processing on the score matrix by using the mean value and the variance of the historical scores by adopting a Z-score method so as to carry out further processing in the following;
(4) and classifying the user scores. The normalized score is subject to standard normal distribution, 68.268949% of the area under the function curve is within a standard deviation range around the average value, if the normalized score is in [ -1,1], the score is a correct score, otherwise, the score is an incorrect score;
(6) and calculating the accuracy of the user score. According to the rationality of the user scoring, counting the times of correct scoring and the times of incorrect scoring of the user, and further calculating the correct rate of the user scoring;
(7) and calculating the sum of the scoring distances of the users. And (4) based on the step (2), calculating the sum of the distances from the user score to the correct scoring interval by using the normalized score. The larger the sum of the scoring distances is, the more unreliable the user score is;
(8) calculate the worst of the user's scores. Based on the step (2), calculating the user score range according to a range formula, wherein the user score range is one of evaluation indexes of the user credit;
(9) and calculating the user reputation. Since the user scores are unreasonable but the sum of the scoring distances is still small, the user scores need to be corrected by using the accuracy and range of the user scores, and the sum of the user scores is combined with the accuracy and range to serve as the reputation of the user.
(10) Selecting a water army. And (4) obtaining credit values of all users according to the step (9), sequencing the credit values from small to large, and counting that the number of the artificially manufactured water army is N, and then capturing the first N users to be used as the water army and comparing the water army with the real water army to obtain the identification accuracy of the method.
Specifically, in the embodiment, the following steps are mainly included, and for convenience of understanding, the following steps will be described in conjunction with the small matrix test data in fig. 2:
step 1, data storage is performed by adopting triple storage, and because the distribution of a user scoring matrix is sparse, the occupied space of matrix storage is too much, a data structure G of a triple is defined to sequentially represent a user, a commodity and a score. A triple data represents a behavior of one time of scoring performed by a user, and the data storage structure can be constructed to cover all scoring behaviors by using a smaller data storage amount. For convenience of illustration, however, fig. 2(b) uses a matrix to store the user rating information.
Step 2, initially, a user scoring matrix is initialized and the scores of the users are assumed to be normally distributed. We first calculate the average mu of the historical scores of the goods α The calculation method is shown as formula (1):
Figure BDA0003020452150000081
in the formula (1), μ α Means of historical scores, U, representing a commodity alpha α Set of users, | U, representing purchase of alpha goods α I represents the number of users purchasing alpha merchandise, r Indicating the rating of the item alpha by the user i.
Step 3, calculating the variance sigma of the commodity historical score α . Based on the variance formula, the calculation is as follows:
Figure BDA0003020452150000082
in the above formula (2), U α Set of users, | U, representing the purchase of alpha goods α I represents the number of users purchasing an alpha commodity, r Represents the score, mu, of user i on the item alpha α Representing the mean of the historical scores for the item alpha. Fig. 2(c) shows the results of the product history score mean and the variance of the history score obtained by the formulas (1) and (2) in the small matrix.
And 4, at the moment, the mean value and the variance of the commodity historical scores already exist, and the Z-sc is utilizedThe score matrix is standardized by the ore method to obtain r'
Figure BDA0003020452150000083
R 'in formula (3)' For standardized product scoring, r Represents the score, mu, of user i on the item alpha α Means of historical scores, σ, representing the commercial product α α Representing the historical score variance of the good alpha. It is considered that the upper and lower limits of the score can be ignored after normalization, and the score of 1-5 or 1-10 can be normalized, so that more data can be processed conveniently. Fig. 2(d) is a result of normalizing the original matrix (B) by the formula (3), which is referred to as a matrix B.
Step 5, based on the user object standardized scoring matrix B obtained in step 4, the object O is predicted α The normalized score will follow a standard normal distribution N (0, 1).
Step 6, according to the analysis in step 5, the scores of the users are divided into two categories, if the normalized scores are in [ -1,1], the scores are a correct score, otherwise, the scores are incorrect scores, and [ -1,1] are the score interval of the embodiment. Recording as follows:
Figure BDA0003020452150000091
wherein delta 1 indicates that the user i scores the item α correctly, δ 0 indicates that the user i scores the item α incorrectly. In this way, the user scoring rationality is digitized to obtain the scoring rationality matrix C of fig. 2 (e).
Step 7, calculating the accuracy rate eta of the user score by using the result in the step 6 i
Figure BDA0003020452150000092
Eta in formula (5) i RepresentsAccuracy of user i rating, s i Representing the number of times user i scored correctly, f i Representing the number of times user i scores incorrectly. Fig. 2(f) is a result of counting the number of correct scores and the number of incorrect scores, and fig. 2(g) is a result of calculating the accuracy of the user scores.
Step 8, calculating the sum d of the distances between the user and the correct interval i
Figure BDA0003020452150000093
In formula (6), bd represents the boundary of the scoring interval, i.e. 1 or-1, according to the interval for determining the scoring correctness in step 6. R' R is ≤ 1, bd ═ 1, when r' Not less than 1, bd ═ 1, otherwise r' Bd, so 0 when the score is correct, only the wrong part of the score is listed; d i Representing the sum of the distances of the users from the correct interval, O i Set of items scored on behalf of this user i, f i Representing the number of times the user scored incorrectly. The addition of 0.001 to the molecule is intended to prevent d i The reciprocal time denominator is 0, which is convenient for calculation; the denominator is incremented by 1 to prevent a false score from not being scored, resulting in a divisor of 0. By calculating d i The degree of user deviation can be derived. The d matrix of fig. 2(h) is the result obtained.
Step 9, calculating the range V of the user score i
Figure BDA0003020452150000101
Maxt in equation (7) ir Representing the maximum value, mint, of the number of scores given by each score of the user ir Represents the minimum value.
Step 10, finally, the reputation R of the user is obtained i
Figure BDA0003020452150000102
In the formula (8), d i Representing the sum of the distances between the user scores and the scoring sections, η i Accuracy, V, of the score on behalf of the user i Representing the worst of the user's scores. R i I.e. the reputation of the user.
And 11, verifying the algorithm. In the experiment, L extreme navy groups and random navy groups are manufactured and marked manually, meanwhile, the credit values of all users are calculated through a formula (8) and are ranked from small to large, L users which are arranged at the front and have low credit values are taken, the L users are compared with the manually manufactured navy groups, and the number ratio of the marked navy of the L users, namely the Recall rate (Recall), is counted. The concrete formula is as follows:
Figure BDA0003020452150000103
in formula (9), R c (L) represents a recall rate, d' (L) represents the number of marked artificial water armies recognized in the case of a length of L, and d represents the number of set artificial water armies; from the above formula it is apparent that d' (L) is at [0, d]In this interval, i.e. R c (L) is in [0, 1]]In between, higher numerical value means more artificial water army is recognized, i.e. better effect.
Meanwhile, in order to verify the reliability of the experimental result more deeply, the calculation of AUC (area under ROC curve) is also adopted as an index for measuring the superiority and inferiority of the algorithm in the experiment. The AUC may be interpreted as the likelihood that a randomly selected naval reputation is higher than a non-naval reputation. Comparing the reputation of the artificial water army with the reputations of other users in sequence, and calculating the times N 'that the reputation is lower than the reputation of the normal user and is equal to the times N' of the reputation of the normal user according to an AUC calculation formula:
Figure BDA0003020452150000111
in formula (10), where N ═ N 1 ×N 2 (N 1 Number of artificial naves, N 2 Number of other users) when randomly choosing artificial naval reputationEqual to 0.5 AUC of normal user at the moment, and when the artificial water force credits are all lower than 1 AUC of normal user at the moment, the AUC should be [0.5,1]]When the AUC is higher, the number of times that the reputation of the artificial water army is low is more, that is, the algorithm identification degree is higher, and the robustness is stronger.
The validity of the method is verified based on the authenticity data. The data sets used in the experiment were one Amazon data set and one Netflix data set and one Movielens data set. Amazon has the data set address https:// snap.stanford.edu/data/# Amazon, Netflix has the data set address www.netflixprize.com, and Movielens has the data set address www.grouplens.org.
The network characteristics of each data set are shown in table 1, and the data fields in each column in the table respectively represent from left to right: the number of users M, the number of products N, the number of scores I, the user average degree < Ku > (the average number of products evaluated by each user), the product average degree < Ko > (the average number of products purchased per product), and the data sparsity S (the number of scores/(the number of users × the number of products)).
TABLE 1 network characteristics of three data sets
M N I <Ku> <K O > S
Movielens 943 1682 100000 106 59 0.06305
Netflix 5000 17768 3496614 699 169 0.03936
Amazon 9000 514105 1304872 145 3 0.0039
The recall rate of the three data sets of table 1 is plotted in fig. 3. It can be seen that the method of the present invention performed significantly differently on the three data sets for the random and extreme water force groups. In addition, comparing the two graphs (a) and (b), the method of the invention has better effect on the extreme water army group than the random water army group.
The recall rate comparison of the method of the invention and other reputation evaluation methods (GR, IGR, DR, IBR) on different data sets is shown in FIG. 4, wherein the abscissa L represents the first L selected users with low reputation, the ordinate is the recall rate, and FIGS. 4(a), (b) and (c) are extreme navy groups, so that the invention effect of the invention is better than that of other 4 methods, and the robustness on sparse data sets is stronger; fig. 4(d) (e) (f) shows a random naval group, and our method performs significantly better than other algorithms, especially on Amazon datasets, with excellent stability in screening for random naval.
The recall rate curves of the various methods as a function of the naval group scale are shown in FIG. 5. It can be seen that in terms of targeting extreme naval groups, the recall rate hardly changes when the naval groups increase, while both GR and IGR methods decrease to different extents (this is a disadvantage of GR and IGR employing group-partition-based methods). In the aspect of aiming at random naval groups, the growth trend of each method is consistent, but the early growth speed of the method is higher, and the performance is better.
Fig. 6(a) and (b) show the variation curves of AUC for each method under different scale of extreme water army group and random water army group. In (a) it can be seen that the AUC performance of the IBR method and the method of the invention is consistent, whereas there is a significant decrease in both GR and IGR (this is consistent with the phenomenon in fig. 5). While in (b) it is clear that each algorithm has a slight decrease in AUC when the random naval group increases, our method still performs stably, better than other methods.
Through the above analysis, the present invention has the following advantages: firstly, through the test of three data sets, the result shows that the algorithm well balances the screening of extremely malicious users and random malicious users, and has good accuracy and robustness; and the algorithm has slightly better screening capability than other algorithms whether in the data set with more malicious users or the data set with less malicious users.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Claims (6)

1. An e-commerce water army identification method based on interval segmentation is characterized by comprising the following steps:
step 1, defining a ternary group data structure G ═ { i, alpha, r }, and respectively representing users, commodities and scores;
step 2, assuming that the scores of the users obey normal distribution, and calculating the mean value and the variance of the historical scores of the commodities;
step 3, respectively calculating the accuracy of user scores, the sum of distances from each score to the corresponding score interval of the user and the extreme difference of the user scores;
in the step 3, the method for calculating the accuracy of the user score includes:
step A31, standardizing the user score by using a Z-score method;
step A32, classifying the scores after the standardization treatment, and determining a score interval, wherein the scores in the score interval are considered to be correct scores, otherwise, the scores are considered to be incorrect scores;
a33, counting the times of correct scoring and incorrect scoring of a user, thereby calculating the scoring accuracy of the user;
step 4, according to the accuracy and range of user scores, the sum of the distances from each score of the user to the corresponding score interval is combined to obtain the credit of the user;
in step 4, the user reputation calculation formula is:
Figure FDA0003720493490000011
wherein R is i Representing the reputation of user i, d i Represents the sum of the distances from each score of the user i to the corresponding score interval, eta i Indicating the accuracy of the user i score, v i Represents the extreme difference of the user i score;
step 5, sorting the user credit, and selecting the first N users with the lowest credit as water army, wherein N is a set value;
the specific content of the step A33 is as follows: counting the times s of correct scoring of the user i And is notNumber of correctly scored f i Then, the accuracy of the user score is calculated using the following formula:
Figure FDA0003720493490000012
wherein eta is i Indicating the accuracy of the user i score;
in step 3, the formula for calculating the sum of the distances from each score of the user to the corresponding score interval is as follows:
Figure FDA0003720493490000021
wherein bd represents the boundary of the scoring interval determined according to step A32, i.e., -1 or 1, when r' 1, bd ≤ 1, when r ≤' Not less than 1, bd ═ 1, otherwise r' =bd;d i Represents the sum of the distances from each score of the user i to the corresponding score interval, O i A set of users i who score the commodity alpha; f. of i Indicating the number of times user i scores incorrectly;
in step 3, the formula for calculating the range of the user score is as follows:
Figure FDA0003720493490000022
wherein v is i Extreme differences, t, representing user i scores ir Representing the number of points scored by user i under each score r.
2. The interval segmentation-based e-commerce water military identification method of claim 1, wherein: in the step 2, the mean value μ of the history scores of the commodity α is calculated according to the following formula α
Figure FDA0003720493490000023
Wherein, U α Set of users, | U, representing the purchase of a commodity α α I represents the number of users who purchased the product alpha, r Representing the grade of the user i on the commodity alpha;
the variance σ of the history score of the commodity α is calculated according to the following formula α
Figure FDA0003720493490000024
Wherein, U α Set of users, | U, representing the purchase of a commodity α α I represents the number of users who purchased the product alpha, r Represents the score, mu, of user i on the item alpha α Represents the mean of the historical scores of the item alpha.
3. The interval segmentation-based e-commerce water force identification method of claim 1, wherein: the specific content of the step A31 is as follows:
the score of the normalized commercial product α is calculated according to the following formula:
Figure FDA0003720493490000031
wherein r is Represents the score, mu, of user i on the item alpha α Mean value, σ, representing historical score of item α α Representing the variance of the historical scores of the good alpha.
4. The interval segmentation-based e-commerce water military identification method of claim 1, wherein: in the step 5, sorting the user credit from small to large by adopting a bubble sorting method; for users with the same credit, the numbers are enlarged in front of the numbers with small numbers according to the user numbers of the users.
5. The interval segmentation-based e-commerce water military identification method of claim 1, wherein: and (3) adopting the recall rate to verify the superiority and inferiority:
simulating a certain number of random scoring water army groups and extreme scoring water army groups, marking the groups, obtaining L users with the lowest reputation by calculation according to the steps 1-5, and calculating the percentage of the marked users in the L users, namely the recall rate:
Figure FDA0003720493490000032
wherein R is L Indicating recall rate, d L The number of marked artificial water armies in the selected L users is represented, and d represents the number of the set artificial water armies; r L The higher the value of (A) represents that the more the artificial water army is identified, the better the effect is.
6. The interval segmentation-based e-commerce water military identification method of claim 1, wherein: and (3) verifying the accuracy of the user reputation calculated in the steps 1-5 by adopting an AUC (AUC) judgment method: comparing the reputation of the artificial water army with the reputations of other users in sequence, counting the times N 'that the reputation is lower than the reputation of a normal user and the times N' that the reputation is equal to the reputation of the normal user, and calculating a formula according to the following AUC:
Figure FDA0003720493490000033
wherein N is N 1 ×N 2 ,N 1 Number of artificial water army, N 2 The number of other users;
when the credits of the artificial water army are randomly selected to be equal to normal users, the AUC is 0.5, when the credits of the artificial water army are lower than the normal users, the AUC is 1, and then the AUC is in an interval of [0.5,1 ]; when the AUC is higher, the number of times that the artificial water army is low is higher, namely the identification degree is higher.
CN202110401328.4A 2021-04-14 2021-04-14 E-commerce water force identification method based on interval segmentation Active CN113674045B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110401328.4A CN113674045B (en) 2021-04-14 2021-04-14 E-commerce water force identification method based on interval segmentation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110401328.4A CN113674045B (en) 2021-04-14 2021-04-14 E-commerce water force identification method based on interval segmentation

Publications (2)

Publication Number Publication Date
CN113674045A CN113674045A (en) 2021-11-19
CN113674045B true CN113674045B (en) 2022-08-26

Family

ID=78538076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110401328.4A Active CN113674045B (en) 2021-04-14 2021-04-14 E-commerce water force identification method based on interval segmentation

Country Status (1)

Country Link
CN (1) CN113674045B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11126679B2 (en) * 2018-02-08 2021-09-21 Arizona Board Of Regents On Behalf Of Arizona State University Systems and methods for detecting pathogenic social media accounts without content or network structure
CN109636181A (en) * 2018-12-11 2019-04-16 北京首汽智行科技有限公司 A kind of user credit divides calculation method and system
CN111275526B (en) * 2020-01-20 2021-04-13 南京财经大学 E-commerce water force identification method based on range difference
CN111242647B (en) * 2020-01-20 2021-04-13 南京财经大学 Method for identifying malicious user based on E-commerce comment

Also Published As

Publication number Publication date
CN113674045A (en) 2021-11-19

Similar Documents

Publication Publication Date Title
TWI662421B (en) Community division method and device based on feature matching network
CN111275526B (en) E-commerce water force identification method based on range difference
CN110930218B (en) Method and device for identifying fraudulent clients and electronic equipment
CN108648038B (en) Credit frying and malicious evaluation identification method based on subgraph mining
CN109948724A (en) A kind of electric business brush single act detection method based on improvement LOF algorithm
CN114168761B (en) Multimedia data pushing method and device, electronic equipment and storage medium
CN110992041A (en) Individual behavior hypersphere construction method for online fraud detection
CN112437053A (en) Intrusion detection method and device
CN111353529A (en) Mixed attribute data set clustering method for automatically determining clustering center
Pugazhenthi et al. Selection of optimal number of clusters and centroids for k-means and fuzzy c-means clustering: A review
CN113674045B (en) E-commerce water force identification method based on interval segmentation
CN108470065A (en) A kind of determination method and device of exception comment text
CN117035983A (en) Method and device for determining credit risk level, storage medium and electronic equipment
Santos et al. Bayesian Method with Clustering Algorithm for Credit Card Transaction Fraud Detection.
CN111914930A (en) Density peak value clustering method based on self-adaptive micro-cluster fusion
CN112288561A (en) Internet financial fraud behavior detection method based on DBSCAN algorithm
CN111311276A (en) Abnormal user group identification method, identification device and readable storage medium
CN107423319B (en) Junk web page detection method
Liao et al. Accumulative Time Based Ranking Method to Reputation Evaluation in Information Networks
CN111242647B (en) Method for identifying malicious user based on E-commerce comment
CN106779843A (en) A kind of competing method and apparatus for closing relationship analysis of trade company based on customer group&#39;s feature
CN108280766B (en) Transaction behavior risk identification method and device
CN113111935A (en) Same transaction subject judgment method based on transaction data real-time clustering in bulk commodity electronic commerce market
CN116029753A (en) Reputation ordering method based on user rating mode and rating deviation
KR20200113397A (en) Method of under-sampling based ensemble for data imbalance problem

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230506

Address after: 230000 Room 203, building 2, phase I, e-commerce Park, Jinggang Road, Shushan Economic Development Zone, Hefei City, Anhui Province

Patentee after: Hefei Jiuzhou Longteng scientific and technological achievement transformation Co.,Ltd.

Address before: 210023 No. 3 Wenyuan Road, Qixia District, Nanjing City, Jiangsu Province

Patentee before: NANJING University OF FINANCE AND ECONOMICS

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230529

Address after: 605, Floor 6, Building 1, Yard 3, Boda Road, Chaoyang District, Beijing, 100020

Patentee after: Beijing Sifang Hengtong Network Information Co.,Ltd.

Address before: 230000 Room 203, building 2, phase I, e-commerce Park, Jinggang Road, Shushan Economic Development Zone, Hefei City, Anhui Province

Patentee before: Hefei Jiuzhou Longteng scientific and technological achievement transformation Co.,Ltd.

TR01 Transfer of patent right