CN109919191B

CN109919191B - Clustering-based application market brush list collusion group detection method

Info

Publication number: CN109919191B
Application number: CN201910090202.2A
Authority: CN
Inventors: 何道敬; 潘梦函; 唐宗力
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2019-01-30
Filing date: 2019-01-30
Publication date: 2023-05-02
Anticipated expiration: 2039-01-30
Also published as: CN109919191A

Abstract

The invention discloses a clustering-based application market brush list collusion group detection method, which comprises the following specific implementation steps: crawling the data set; initializing a core point set; determining a reviewer suspicion score threshold; finding out samples with reachable density from any core point to generate a cluster until all the core points are accessed; and outputting a cluster division result. According to the clustering-based application market brush list collusion group detection method disclosed by the invention, the characteristics of similar members in the collusion group are fully reflected by the algorithm, a better clustering effect is obtained, and the problem that the parameter setting of the traditional clustering algorithm in the application market is difficult is solved.

Description

Clustering-based application market brush list collusion group detection method

Technical Field

The invention relates to a collusion group detection method, in particular to a clustering-based application market collusion group detection method.

Background

With the rapid development of smartphones, the number of mobile applications has grown at a striking rate, and the mobile application market provides a convenient and efficient way for users to download mobile applications. If an application publishes a higher rank in the application market, meaning that the application has a higher exposure, the developer is more likely to get tremendous profits. Therefore, in the mobile application market, a novel marketing means, namely, an application brush list, is derived, and along with the flame explosion of the Taobao Tianma cat, the shop brush list becomes a marketing means for a merchant to make a fake gesture. The attacker uses the means of brushing the list to popularize the mobile application of the attacker in the application market to achieve larger profits, and the brushing list is the same as the Taobao brushing list person and works in the form of groups or teams, so the method is also called brushing list collusion group, and is uniformly managed by brushing list companies. The members in the group can simulate the behaviors of normal users, avoid detection algorithms in the application market, and provide challenges for the detection of the collusion group and the collusion person. Therefore, how to quickly and effectively detect the collusion group in the application market is a urgent need, and has important significance for maintaining the ecological balance of the application market and accelerating the competition and innovation of application software developers in the circle.

Currently, the e-commerce field has a mode of detecting the collusion group of the spam comment, and most of the e-commerce field adopts a supervised machine learning method, but one of the characteristics is that the method is severely dependent on a marked data set to train a classifier. However, model training requires a large number of samples with markers, which are difficult and costly to obtain, and this approach, which lacks adequate marker sample training, has proven to be inaccurate. Meanwhile, collusion group detection is relatively less in the application market field, and Xie Z and the like detect through analysis of the relationship between the reviewers and the relationship between the reviewers and the application and then establish a relationship diagram; chen H et al generated candidate collusion groups using Frequent Itemset Mining (FIM) techniques, and then detected the collusion groups by constructing a model of reviewer scores for applications. But such approaches can only find a dense collusion group and where each group member must comment on all target applications. Therefore, the clustering-based brush list collusion group detection method provided by the invention can fully utilize the characteristics of similar members in the collusion group, obtain a better clustering effect and solve the problem that the parameter setting of the traditional clustering algorithm in the application market is difficult.

Disclosure of Invention

Aiming at the problem of parameter setting in the prior application market brush list collusion group detection, the invention provides a clustering-based application market brush list collusion group detection method. Briefly, the core points are determined by reviewer suspicion scores and the radius of the neighborhood is determined by the similarity between reviewers. The algorithm not only solves the problem that the parameters of MinPts and Eps are difficult to be appointed in advance, but also has the advantage that the S-DBSCAN algorithm obtains better clustering effect compared with the direct use of the DBSCAN algorithm.

The specific technical scheme for realizing the aim of the invention is as follows:

a cluster-based application market brush list collusion group detection method comprises the following steps:

step 1: crawling a data set from an application market, and filtering by limiting the comment number of reviewers so as to obtain a reviewer set required by the final experiment; namely selecting the reviewers with the number of the reviews exceeding a certain threshold value as a data set;

step 2: firstly, selecting a core point in a data set as an initial set;

step 3: finding out all core points in the data set according to the initial parameters of the current data point, namely, the suspicion score of the reviewer as a threshold eta and the similarity between reviewers as a threshold epsilon;

step 4: taking any core point as a starting point, finding out samples with reachable density to generate a cluster until all the core points are accessed;

step 5: outputting a cluster division result, wherein the cluster division result comprises the number of clusters and detailed information of each data in the clusters.

The crawling of the data set in step 1 includes, but is not limited to, apple application market.

The data features in the data set obtained by crawling in step 1 include, but are not limited to: reviewer name, comment content, comment score, application of comment, number of applications of comment, and number of comment words.

The core points in the step 2 are determined by the suspicion scores of reviewers; the reviewer suspicion score RSS consists of three parts of calculated scores, including a reviewer score, a comment suspicion score and an application suspicion score; the calculation formula is as follows:

where i represents a reviewer, j represents a comment,k represents an application; RSS represents the reviewer suspicion score; SR (SR) _i Representing a reviewer score; n is n _i Representing the number of comments of reviewer i, SS _j Representing a comment suspicion score, c _ij The j-th comment representing comment i; m is m _k Representing the number of all comment applications of comment i, SA _k Representing the applied suspicion score, t _ik Represents the kth application of reviewer i review.

Step 3, similarity between the reviewers, namely, similarity SC between two reviewers x and y _(x,y) The calculation formula is as follows:

wherein S is _car(x,y) Represents application similarity, S _crr(x,y) Represents comment similarity, S _car1(x,y) Representing scoring similarity.

The reachable range of the density in the step 4 is defined as epsilon-field; for x ₄ E D, the epsilon-neighborhood of which contains the sum x in dataset D _j Objects with similarity threshold greater than E, i.e

The improved S-DBSCAN algorithm is applied to the detection of the collusion group of the application market, and from the characteristic that the behavior of the colluders in the collusion group has similarity, the reviewer suspicion score is used for replacing MinPts parameters and the similarity between two reviewers for replacing Eps parameters, so that the problem of difficult parameter setting in the traditional DBSCAN algorithm is solved, and meanwhile, experiments also show that the S-DBSCAN clustering algorithm obtains better clustering effect.

Drawings

Fig. 1 is a flow chart of the present invention.

Detailed Description

The invention will be described in further detail in connection with specific embodiments and with the accompanying drawings. The procedures, conditions, experimental methods, etc. for carrying out the present invention are common knowledge and common knowledge in the art, except for the following specific references, and the present invention is not particularly limited.

The invention comprises the following steps:

step 1: the data set is crawled from the application market and filtered by limiting the number of reviewers' reviews to obtain the set of reviewers needed for the final experiment.

Step 2: the S-DBSCAN algorithm firstly selects one core point in the data set as an initial set, and then starts from the set to determine a corresponding cluster.

Step 3: and finding out all core points according to given initial parameters, namely a reviewer suspicion score threshold eta and a similarity threshold epsilon between reviewers.

Step 4: and taking any core point as a starting point, finding out samples with reachable densities to generate a cluster until all the core points are accessed.

S-DBSCAN algorithms, including but not limited to Density-based clustering algorithms (Density-based methods). And converting the MinPts and Eps parameters in the DBSCAN algorithm into similarity connection between reviewers in the application market.

The S-DBSCAN algorithm replaces the MinPts parameter with the reviewer suspicion score threshold eta and replaces the Eps parameter with the similarity threshold epsilon between reviewers.

The present invention uses reviewer suspicion score RSS instead of the MinPts parameter, specifically the transformation of the MinPts parameter is as follows. The reviewer suspicion score consists of three parts of calculated scores, including a reviewer score, a review suspicion score, and an application suspicion score. The conversion formula is as follows:

wherein i represents a reviewer, j represents a comment, and k represents an application; RSS represents the reviewer suspicion score; SR (SR) _i Representing a reviewer score; n is n _i Representing all comments of comment iQuantity, SS _j Representing a comment suspicion score, c _ij The j-th comment representing comment i; m is m _k Representing the number of all comment applications of comment i, SA _k Representing the applied suspicion score, t _ik Represents the kth application of reviewer i review.

The invention uses the similarity SC between two reviewers x and y _(x,y) Instead of the Eps parameter. Specifically, the conversion of the Eps parameter is as follows. The conversion formula is as follows:

wherein S is _car(x,y) Represents application similarity, S _crr(x,y) Represents comment similarity, S _card(x,y) Representing scoring similarity.

The determination of the core points in the S-DBSCAN algorithm does not depend on the number of MinPts any more, but is determined according to the suspicion scores of reviewers of the current data points.

Examples

Taking an improved S-DBSCAN algorithm as an example, the invention specifically describes a cluster-based application market brush list collusion group detection method as follows:

step 1: the dataset was crawled from the application marketplace for a total of 19666225 reviews and 12315366 reviewers. The number of reviews is limited to filter, and the reviewers meeting the number of reviews exceeding 50 are selected as the final experimental data set. Together, 8853 reviewer sets, 818545 review sets, and 2188 application sets are obtained.

Step 2: a set of core points is initialized.

Step 3: setting a reviewer suspicion score threshold eta and a similarity threshold epsilon between reviewers, and finding out all core points. Wherein, the setting range of eta is selected 0.7,0.8,0.9, the setting range of epsilon is selected 0.7,0.8,0.9, and the total of 9 experimental results are arranged and combined.

Step 5: and outputting a cluster division result. The clustering cluster number is equal to the number of the collusion groups, and according to the experimental result, as the suspicion score threshold of the reviewer and the similarity threshold between the reviewers are larger and larger, the number of the collusion groups is smaller and smaller, wherein when eta and epsilon are both 0.7, the experimental result is 177, and 177 collusion groups are indicated; when η and ε are both 0.9, the experimental result was shown to be 24, indicating that there were 24 collusion groups.

Step 6: the invention uses the improved S-DBSCAN algorithm to cluster, and uses the contour coefficient to evaluate the clustering effect, the contour coefficient can be simply understood as the similarity degree of one node and the cluster to which the node belongs compared with other clusters, the value of the node is between [ -1,1], and the larger value indicates the better clustering effect. The profile factor in the experiment increases gradually with e from an initial 0.283 to 0.763. In the traditional DBSCAN algorithm, different MinPts and Eps parameters are used for clustering, the contour coefficient of the algorithm is 0.453 at most, and the clustering effect of the algorithm is obviously lower than that of the S-DBSCAN algorithm.

Specifically, the transformation of the MinPts parameter is as follows. The present invention uses reviewer suspicion score RSS instead of the MinPts parameter. The calculation steps are as follows:

step A1: determining reviewer score SR _i

The difference between the score of a reviewer for an application and the average score of the application is referred to as the score bias, defined as:

the number of reviewers ' all-positive reviews compared to the number of reviewers ' all-reviews is referred to as the reviewer's positive review ratio, defined as: />

The number of non-duplicate reviews posted by reviewer i is defined as: nd _i Comment burst frequency is defined as: />

Judging whether a reviewer i issues repeated reviews is called issuing repeated reviews, and is defined as: dup (dup) _i If the reviewer i issues repeated comments, the dup is generated _i The value is set to 1, otherwise to 0.

Finally, all the indexes are added up to obtain the suspicion score SR of the reviewer i _i The formula:

step A2: determining comment suspicion score SS _j

A near duplicate comment is called a suspicion comment; calculating suspicion score SS of comments using cosine similarity _j ：SS _j ＝max _i≠j Cosine′(review _j ,review _i ) And satisfies the presence of application k such that (j, k), (i, k) ∈E; where E is an edge in the comment graph, cosine' is a linear scaled version of the Cosine function, limiting the value of the Cosine function to [0,1]In (a) and (b);

step A3: determining an application suspicion score SA _k

The score obtained by a suspected application over a period of time is known as the explosive high score, defined as:

wherein nd _k For all non-duplicate comment numbers of application k, n _k Number of all comments for application k; the shortest top scoring time interval is defined as: sap (sap) _k The calculation is as follows: />

The longest top scoring time interval is defined as: lap _k The calculation is as follows:

where d is the time interval of each comment, np _k All frontal comment numbers for reviewer k.

Therefore, the suspicion score SA of the application k is obtained _k The calculation formula is as follows:

step A4: determining reviewer suspicion score RSS

The reviewer suspicion score is determined by reviewer score SR _i Comment suspicion score SS _j Applying suspicion score SA _k The calculation formula is determined as follows:

specifically, the conversion of the Eps parameter is as follows. The invention uses the similarity SC between two reviewers x and y _(x,y) Instead of the Eps parameter. The calculation steps are as follows:

step B1: calculating application similarity S _car(x,y)

Jaccard similarity is used to measure the similarity of application sets reviewed by two reviewers. Definition of the definition

Wherein M is _i Application set reviewed for reviewer i, +.>

An nth application reviewed for reviewer i. Thus, judging the similarity between the application sets reviewed by reviewer x and reviewer y can be expressed as:

step B2: calculating comment similarity S _crr(x,y)

The sum of the number of applications of two reviewer reviews for the same developer divided by the sum of the number of all reviews for each reviewer is defined as review similarity. Define the set of developers of an application reviewed by a reviewer as B, P _n∈B,i Representing the set of applications reviewed by reviewer i for developer b, P _i Representing all applications that reviewer i has reviewed. The specific calculation formula of comment similarity of the reviewer x and the reviewer y is obtained as follows:

step B3: calculating scoring similarity S _card(x,y)

The root mean square deviation is used for calculating the scoring similarity of the reviewer x and the reviewer y, the definition C represents an application set which is commonly reviewed by the reviewer x and the reviewer y, and a specific computing formula of the scoring similarity of the reviewer x and the reviewer y is obtained as follows:

step B4: calculating to obtain similarity SC of reviewer x and reviewer y _(x,y) The calculation formula is as follows:

/>

Claims

1. the utility model provides a cluster-based application market brush list collusion group detection method, which is characterized by comprising the following steps:

step 2: firstly, selecting a core point in a data set as an initial set;

step 3: finding out all core points in the data set according to the initial parameters of the current core points, namely, the suspicion score of the reviewer serving as a threshold eta and the similarity between reviewers serving as a threshold epsilon;

step 5: outputting a cluster division result, wherein the cluster division result comprises the number of clusters and detailed information of each data in the clusters; wherein:

the reviewer suspicion score RSS consists of three parts of calculated scores, including reviewer scores, comment suspicion scores and application suspicion scores; the calculation formula is as follows:

wherein i represents a reviewer, j represents a comment, and k represents an application; RSS represents the reviewer suspicion score; SR (SR) _i Representing a reviewer score; n is n _i Representing the number of comments of reviewer i, SS _j Representing a comment suspicion score, c _ij The j-th comment representing comment i; m is m _k Representing the number of all comment applications of comment i, SA _k Representing the applied suspicion score, t _ik A kth application representing a comment of a reviewer i;

similarity between reviewers, i.e. similarity SC between two reviewers x and y _(x,y) The calculation formula is as follows:

2. The method for detecting collusion group based on cluster based application market brush list as recited in claim 1, wherein the crawling of the data set in step 1 comprises apple application market.

3. The method for detecting collusion group in a cluster-based application market brush list as recited in claim 1, wherein the step 1 of crawling the data features in the obtained data set comprises: reviewer name, comment content, comment score, application of comment, number of applications of comment, and number of comment words.

4. The cluster-based application market brush list collusion group detection method of claim 1, wherein the core point is determined by a reviewer suspicion score.

5. The cluster-based application market brush list collusion group detection method of claim 1, wherein the density is reachable in a range defining epsilon-domain; for x _j E D, the epsilon-neighborhood of which contains the sum x in dataset D _j Objects with similarity threshold greater than E, i.e

/>