Background
In a network era, the scoring of users is beginning to play an increasingly important role, especially in the e-commerce field. Therefore, scoring systems are provided by many electronic commerce platforms (such as Taobao and Jingdong), a user can score commodities, meanwhile, the scores received by the commodities influence the decision of the user, and the fact and effectiveness of the scores are guaranteed to be very important. However, some malicious users exist in the scoring system, the scoring of the users is unreliable, one malicious user is a water army requested by a merchant, the goods of the user are scored high, the goods of an opponent are scored low, and the other malicious user is a malicious user who performs random scoring. The scoring of these malicious users disturbs the scoring system, and therefore a method is needed to screen out the malicious users, reduce the influence of the malicious scoring, and make the scoring more reliable.
And if malicious users are to be screened, a credit score can be given to each user according to the scoring condition of the users, and the users with low credit scores are taken as the malicious users. The method for determining the user credit score is various, and the method for determining the user credit score determines the quality of the algorithm.
At present, many algorithms for detecting malicious users are proposed, for example, a Correlation-based Ranking (CR) algorithm based on commodity quality is proposed, the CR algorithm calculates commodity estimated quality according to a score, then calculates a Correlation between a user score and the commodity quality, obtains a user reputation according to the Correlation, and then regards a user with a low reputation as a malicious user. Similarly, algorithms based on the quality of goods include ir (iterative ranking) algorithm, rr (iterative ranking) algorithm, ibm (iterative Balance model) algorithm, etc., but the user's score may have a certain difference from the real quality of goods and cannot be used to replace the quality of goods, so the algorithm based on the quality of goods has a certain irrationality.
Another type is a Group-based algorithm, such as the GR (Group-based Ranking) algorithm, in which if a user scores the same score for a certain item as for most people, the user is considered more trustworthy and should have a higher reputation score. The IGR (Iterative Group-based Ranking) algorithm adds the Iterative idea of the IR algorithm on the basis of the GR algorithm, so that the effect of the algorithm is improved, but the effect of the algorithm still has a space for improvement.
Still another category is based on the user's score distribution characteristics, such as the DR (development-based Ranking) algorithm, which considers the user's score to follow a normal distribution, and the BR (Bayesian reporting) algorithm, which considers the user's score to follow a beta distribution. However, the scoring distribution of users is different and is not completely subject to a certain distribution, so that the stability of the algorithm is poor.
Disclosure of Invention
The invention aims to provide a method for identifying malicious users based on E-commerce comments, which can improve the accuracy of screening the malicious users and increase the stability of calculation in the identification process.
In order to achieve the above purpose, the solution of the invention is:
a method for identifying malicious users based on E-commerce comments comprises the following steps:
step 1, constructing a triplet G [ U ]i,Oα,ωs]For storing rating data, UiRepresenting a user i, i ═ {1, …, m }, wherein m is the number of users; o isαRepresentative product α ═ {1, …, n }, where n is the product number, [ omega ]sThe representative scores s, m and n respectively represent the number of users and the number of commodities;
step 2, initializing the credit of all users to be 1;
step 3, calculating the size of the weighted group of each commodity under each score, wherein the weight is the user credit evaluating the commodity;
step 4, calculating a proportion matrix of each group in different commodities according to the size of the weighted group obtained in the step 3;
step 5, in the proportion matrix obtained in the step 4, the rows and the columns respectively correspond to the commodities and the scores, and the proportions are mapped into a matrix of which the rows and the columns respectively correspond to the users and the commodities;
step 6, calculating the average value and standard deviation of the ratio of the group where each user is located, and calculating the standard deviation of the user score;
step 7, calculating user credit according to the data calculated in the step 6;
step 8, based on the user reputation obtained by calculation in step 7, repeating steps 3-7, then calculating the difference value of the user reputations of two times, if the difference value is greater than the threshold value, repeating steps 3-7 with the new user reputation until the iteration is finished when the user reputation changes to be less than the threshold value;
and 9, sequencing the finally obtained user reputations, and taking the L users with the lowest reputations as malicious users.
In the above step 3, the weighted group size Λ of the product α at the score s is calculated according to the following equationsα:
Wherein R isisRepresenting a reputation score for user i who scored s for item α.
In the step 4, the proportion matrix of each group in different commodities is calculated according to the following formula:
wherein, ΛsαIs the size of the weighted group for item α at score s.
The specific method of the step 5 is as follows:
wherein when the score of the user i on the commodity α is s, the commodity in the commodity scoring matrix is scored
Mapping to user commodity proportion matrix to obtain A'
iαAnd if no score is found a value is not present.
In the step 6, the average value of the ratio of the group where the user i is located is calculated according to the following formula:
wherein k isiIs the degree of user i.
In the step 6, the standard deviation of the ratio of the group where the user i is located is calculated according to the following formula:
wherein k isiIs the degree of user i.
In the step 6, the specific process of calculating the standard deviation of the user score is;
first, the average score of each user is calculated:
wherein, ω isiαIndicates the score, k, of user i on item αiIs the degree of user i;
the standard deviation of the scores was then calculated:
in step 7, the user reputation is calculated according to the following formula:
wherein, mu (A'i)、σ(A′i)、σ(ωi) The average value of the percentage of the group where the user i is located, the standard deviation of the percentage of the group where the user i is located, and the standard deviation of the user score are respectively.
In step 8, the calculation formula of the difference between the user reputations in the two times is as follows:
Δ=|R-R'|=∑i(Ri-R′i)2/m
wherein, Delta is a difference value, R, R' is the credit of two users respectively, Ri、R′iTwo user reputations for the user respectively.
After the scheme is adopted, aiming at the condition that the scores of normal users are generally concentrated, the standard deviation of the scores of the users is added into the judgment standard, the standard deviation of the scores of the normal users is relatively small, the standard deviation of the scores of the malicious users is relatively large, and the credit score of the normal users can be increased and the credit score of the malicious users can be reduced by adding the scored standard deviation into the denominator. Therefore, when the method is used for screening the extremely malicious users and the randomly-scored malicious users, the accuracy of detecting the malicious users is improved; when malicious users increase and the data volume increases, the method has more stable performance and better robustness. The method can be applied to the detection of malicious users of the E-commerce website and the improvement of the E-commerce website scoring system.
Detailed Description
The technical solution and the advantages of the present invention will be described in detail with reference to the accompanying drawings.
The invention provides a method for identifying malicious users based on E-commerce comments, which mainly comprises the following four stages:
firstly, storing scoring data in a triple mode; the scoring data is stored by adopting the triples, so that the space is saved compared with an array when the data volume is large and the scoring data is sparse;
secondly, counting the size of the weighted group of each commodity under each score, taking the credit score of the user as the weight, and calculating the percentage of each score group in each commodity;
then, obtaining the credit score of the user according to the average value and the standard deviation of the percentage of the group where each user is located and the standard deviation of the user score, and iteratively calculating the credit score of the user until the credit score of the user tends to be stable; in the stage, the standard deviation of the user score is added when the user credit is calculated, so that the credit of normal users is improved, the credit of malicious users is reduced, and the accuracy of detecting the malicious users is improved;
and finally, taking the L users with the lowest credit scores as malicious users.
As shown in fig. 1, the specific process of the present invention is as follows:
step 1, constructing a triplet G [ U ]i,Oα,ωs]For storing rating data, UiRepresenting users i, OαRepresentative of the commercial product α, omegasRepresenting the score s.
Wherein, i is the number of the user, the value is i ═ 1, …, m is the number of users, α is the number of the goods, the value is α ═ 1, …, n is the number of the goods, each triple stores a piece of score data, the degree of the user is marked as kiDegree of merchandise is denoted as kα。
And 2, initializing the reputation of all users to be 1, namely, considering that the weights occupied by all users at the initial time are the same.
Step 3, calculating the size of the weighted group of each commodity under each score, wherein the weight is the user credit evaluating the commodity; the calculation method is shown in formula (1):
wherein R isisThe reputation score of user i who scores s for item α is expressed, and the sum of all user reputation scores s for item α is defined as the group size.
Step 4, calculating a proportion matrix of each group in different commodities according to the size of the weighted group obtained in the step 3:
wherein, the numerator of formula (2) is the weighted group size of commodity α under the score s, and the denominator used in the IGR algorithm is the degree k of commodity ααIn this embodiment, the effect of using the sum of the weighted group sizes of the scores of the items α as the denominator is better, and the obtained matrix is a real proportion matrix.
The larger the percentage of the group in which the user score is located, the closer the score representing the user and the public score is, and the more reliable the score is.
Step 5, matrix mapping:
the rows and columns of the proportion matrix obtained in step 4 are the commodity and the score respectively, and need to be mapped to the user and commodity matrixes, and the specific method is shown as formula (3):
in formula (3), when the score of the user i on the commodity α is s, the commodity in the commodity scoring matrix is scored
Mapping to user commodity proportion matrix to obtain A'
iαAnd if no score is found a value is not present.
Step 6, calculating the average value of the group ratio as shown in formula (4):
in the formula (4), the numerator is used for calculating the total proportion of all the commodities evaluated by the user, the denominator is the degree of the user, namely the number of the commodities evaluated by the user, and the larger the average value is, the higher the probability that the score of the user stands in the public population is.
The standard deviation for calculating the percentage of the group in which the user is located is shown in formula (5):
mu (A ') in formula (5)'i) The standard deviation of the user score is larger according to the formula (4), which indicates that the percentage of the group in which the user score is located is unstable, sometimes the user score is the same as the public score, sometimes the user score deviates from the public score, and the reputation score of the user should be smaller.
And 7, calculating the standard deviation of the user score:
first, the average score of each user is calculated as shown in formula (6):
ω in the formula (6)iαRepresents the user i's score for item α, the numerator is the sum of the user i's scores for all items, kiIs the degree of user i.
The way in which the numerator in formula (6) represents the sum of scores of the user i for each commodity, the denominator is the degree of the user, and then the standard deviation of the scores is calculated is shown in formula (7):
step 8, calculating the user reputation:
the user reputation is calculated as shown in equation (8):
mu (A ') in formula (8)'i)、σ(A′i) And σ (ω)i) The calculation is respectively carried out by formula (4), formula (5) and formula (7).
Step 9, iteratively calculating the credit score of the user until the change delta is less than 10-6The iteration is ended and the change Δ is calculated as shown in equation (9):
Δ=|R-R'|=∑i(Ri-R′i)2/m (9)
and step 10, sorting the user reputations from low to high, and taking the L users with the lowest reputations as malicious users.
The validity of the method is verified based on the real data as follows:
the data sets used in the experiment are two data sets of MovieLens and one data set of Netflix, and the data sets are respectively from the following sources:
the two dataset addresses for MovieLens are: the address of the https:// group, org/datasets/movilens/Netflix dataset is: http:// pan. baidu. com/s/1dDtmbW9
The Netflix data set used in the experiment is obtained by extracting 5000 users with scores greater than 50 times from the original data set, and taking the scoring network of the 5000 users as the experimental data set.
The number of users (m), the number of products (n), the number of data items (l), the user average degree (k (u)), the product average degree (k (o)), and the data sparsity degree (s ═ l/mn) in each data set are shown in table 1:
TABLE 1 data set network characteristics
Data set
|
m
|
n
|
l
|
K(U)
|
K(O)
|
S
|
MovieLens_10
|
943
|
1682
|
100000
|
106
|
59
|
0.06305
|
MovieLens_100
|
7120
|
130642
|
1048575
|
147
|
8
|
0.00113
|
Netflix
|
5000
|
17768
|
3496614
|
699
|
196
|
0.03936 |
The recall ratio is used as a measure in the experimental results, and the calculation method is shown as the formula (10):
wherein d represents the number of manufactured malicious users, L represents the L users with the lowest reputation score, and d' (L) represents the number of malicious users in the L users with the lowest reputation score.
Fig. 1 is a flowchart of the IGDR algorithm, wherein (a) an initial weighted two-network diagram. (b) The associated scoring matrix, rows and columns represent the user and the item, respectively. (c) And (4) scoring the triplets, wherein the triplets are adopted in the IGDR algorithm for saving space without using a matrix to store data. (d) The size of each scored weighted group is lambada, which is the sum of the user credit scores in the commodity scoring group, and is given as the commodity O
2Score 1 group is an example, Λ
2,1=R
2+
R 52. (e) Group proportion matrix Λ
*The weight of each group in each row in step (d) is used to obtain the product O
2The group with a score of 1 is taken as an example,
(f) the group proportion matrix A' is composed of the matrix A and the matrix Lambda
*Collectively, the rows and columns of the matrix are now user and merchandise, respectively. (g) The credit score of the user, R' represents the credit score obtained by the first iteration, and the IGDR algorithm needs to execute (d), (e) in an iteration way(f) and (g) until the user reputation score change value is less than 10
-4And obtaining the final user credit R, and taking the L users with the lowest credit as malicious users.
Fig. 2 is a graph of the recall rate r (L) with the value L for the GR algorithm (dotted line), the IGR algorithm (dashed line) and the IGDR algorithm (solid line) when 50 malicious users were manufactured. The graph marked with the malicious represents that 50 extremely malicious users are manufactured, and it can be seen that the effects of the three algorithms on the extremely malicious users are good, the GR algorithm and the IGR algorithm are similar in performance, the IGR algorithm is slightly better than the GR algorithm, and the IGDR algorithm is obviously better than the GR algorithm and the IGR algorithm. The graph labeled random represents the manufacture of 50 random malicious users, and it can be seen that the three malicious users with random scores are relatively poor in effect, and the IGDR algorithm is slightly better than the GR algorithm and the IGR algorithm.
Fig. 3 is a graph of the recall r (L) of the GR algorithm (dotted line), the IGR algorithm (dashed line) and the IGDR algorithm (solid line) as a function of the value L when 100 malicious users were manufactured. The graph labeled with malicious represents that the 100 manufactured malicious users are extreme malicious users, and the graph labeled with random represents that the 100 manufactured malicious users are randomly scored malicious users. The effect of each algorithm is similar to that of fig. 2, which shows that when the number of malicious users changes slightly, the relativity of the performance of each algorithm does not change.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.