CN109919191B - Clustering-based application market brush list collusion group detection method - Google Patents
Clustering-based application market brush list collusion group detection method Download PDFInfo
- Publication number
- CN109919191B CN109919191B CN201910090202.2A CN201910090202A CN109919191B CN 109919191 B CN109919191 B CN 109919191B CN 201910090202 A CN201910090202 A CN 201910090202A CN 109919191 B CN109919191 B CN 109919191B
- Authority
- CN
- China
- Prior art keywords
- comment
- reviewer
- similarity
- application market
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a clustering-based application market brush list collusion group detection method, which comprises the following specific implementation steps: crawling the data set; initializing a core point set; determining a reviewer suspicion score threshold; finding out samples with reachable density from any core point to generate a cluster until all the core points are accessed; and outputting a cluster division result. According to the clustering-based application market brush list collusion group detection method disclosed by the invention, the characteristics of similar members in the collusion group are fully reflected by the algorithm, a better clustering effect is obtained, and the problem that the parameter setting of the traditional clustering algorithm in the application market is difficult is solved.
Description
Technical Field
The invention relates to a collusion group detection method, in particular to a clustering-based application market collusion group detection method.
Background
With the rapid development of smartphones, the number of mobile applications has grown at a striking rate, and the mobile application market provides a convenient and efficient way for users to download mobile applications. If an application publishes a higher rank in the application market, meaning that the application has a higher exposure, the developer is more likely to get tremendous profits. Therefore, in the mobile application market, a novel marketing means, namely, an application brush list, is derived, and along with the flame explosion of the Taobao Tianma cat, the shop brush list becomes a marketing means for a merchant to make a fake gesture. The attacker uses the means of brushing the list to popularize the mobile application of the attacker in the application market to achieve larger profits, and the brushing list is the same as the Taobao brushing list person and works in the form of groups or teams, so the method is also called brushing list collusion group, and is uniformly managed by brushing list companies. The members in the group can simulate the behaviors of normal users, avoid detection algorithms in the application market, and provide challenges for the detection of the collusion group and the collusion person. Therefore, how to quickly and effectively detect the collusion group in the application market is a urgent need, and has important significance for maintaining the ecological balance of the application market and accelerating the competition and innovation of application software developers in the circle.
Currently, the e-commerce field has a mode of detecting the collusion group of the spam comment, and most of the e-commerce field adopts a supervised machine learning method, but one of the characteristics is that the method is severely dependent on a marked data set to train a classifier. However, model training requires a large number of samples with markers, which are difficult and costly to obtain, and this approach, which lacks adequate marker sample training, has proven to be inaccurate. Meanwhile, collusion group detection is relatively less in the application market field, and Xie Z and the like detect through analysis of the relationship between the reviewers and the relationship between the reviewers and the application and then establish a relationship diagram; chen H et al generated candidate collusion groups using Frequent Itemset Mining (FIM) techniques, and then detected the collusion groups by constructing a model of reviewer scores for applications. But such approaches can only find a dense collusion group and where each group member must comment on all target applications. Therefore, the clustering-based brush list collusion group detection method provided by the invention can fully utilize the characteristics of similar members in the collusion group, obtain a better clustering effect and solve the problem that the parameter setting of the traditional clustering algorithm in the application market is difficult.
Disclosure of Invention
Aiming at the problem of parameter setting in the prior application market brush list collusion group detection, the invention provides a clustering-based application market brush list collusion group detection method. Briefly, the core points are determined by reviewer suspicion scores and the radius of the neighborhood is determined by the similarity between reviewers. The algorithm not only solves the problem that the parameters of MinPts and Eps are difficult to be appointed in advance, but also has the advantage that the S-DBSCAN algorithm obtains better clustering effect compared with the direct use of the DBSCAN algorithm.
The specific technical scheme for realizing the aim of the invention is as follows:
a cluster-based application market brush list collusion group detection method comprises the following steps:
step 1: crawling a data set from an application market, and filtering by limiting the comment number of reviewers so as to obtain a reviewer set required by the final experiment; namely selecting the reviewers with the number of the reviews exceeding a certain threshold value as a data set;
step 2: firstly, selecting a core point in a data set as an initial set;
step 3: finding out all core points in the data set according to the initial parameters of the current data point, namely, the suspicion score of the reviewer as a threshold eta and the similarity between reviewers as a threshold epsilon;
step 4: taking any core point as a starting point, finding out samples with reachable density to generate a cluster until all the core points are accessed;
step 5: outputting a cluster division result, wherein the cluster division result comprises the number of clusters and detailed information of each data in the clusters.
The crawling of the data set in step 1 includes, but is not limited to, apple application market.
The data features in the data set obtained by crawling in step 1 include, but are not limited to: reviewer name, comment content, comment score, application of comment, number of applications of comment, and number of comment words.
The core points in the step 2 are determined by the suspicion scores of reviewers; the reviewer suspicion score RSS consists of three parts of calculated scores, including a reviewer score, a comment suspicion score and an application suspicion score; the calculation formula is as follows:
where i represents a reviewer, j represents a comment,k represents an application; RSS represents the reviewer suspicion score; SR (SR) i Representing a reviewer score; n is n i Representing the number of comments of reviewer i, SS j Representing a comment suspicion score, c ij The j-th comment representing comment i; m is m k Representing the number of all comment applications of comment i, SA k Representing the applied suspicion score, t ik Represents the kth application of reviewer i review.
Step 3, similarity between the reviewers, namely, similarity SC between two reviewers x and y (x,y) The calculation formula is as follows:
wherein S is car(x,y) Represents application similarity, S crr(x,y) Represents comment similarity, S car1(x,y) Representing scoring similarity.
The reachable range of the density in the step 4 is defined as epsilon-field; for x 4 E D, the epsilon-neighborhood of which contains the sum x in dataset D j Objects with similarity threshold greater than E, i.e
The improved S-DBSCAN algorithm is applied to the detection of the collusion group of the application market, and from the characteristic that the behavior of the colluders in the collusion group has similarity, the reviewer suspicion score is used for replacing MinPts parameters and the similarity between two reviewers for replacing Eps parameters, so that the problem of difficult parameter setting in the traditional DBSCAN algorithm is solved, and meanwhile, experiments also show that the S-DBSCAN clustering algorithm obtains better clustering effect.
Drawings
Fig. 1 is a flow chart of the present invention.
Detailed Description
The invention will be described in further detail in connection with specific embodiments and with the accompanying drawings. The procedures, conditions, experimental methods, etc. for carrying out the present invention are common knowledge and common knowledge in the art, except for the following specific references, and the present invention is not particularly limited.
The invention comprises the following steps:
step 1: the data set is crawled from the application market and filtered by limiting the number of reviewers' reviews to obtain the set of reviewers needed for the final experiment.
Step 2: the S-DBSCAN algorithm firstly selects one core point in the data set as an initial set, and then starts from the set to determine a corresponding cluster.
Step 3: and finding out all core points according to given initial parameters, namely a reviewer suspicion score threshold eta and a similarity threshold epsilon between reviewers.
Step 4: and taking any core point as a starting point, finding out samples with reachable densities to generate a cluster until all the core points are accessed.
Step 5: outputting a cluster division result, wherein the cluster division result comprises the number of clusters and detailed information of each data in the clusters.
S-DBSCAN algorithms, including but not limited to Density-based clustering algorithms (Density-based methods). And converting the MinPts and Eps parameters in the DBSCAN algorithm into similarity connection between reviewers in the application market.
The S-DBSCAN algorithm replaces the MinPts parameter with the reviewer suspicion score threshold eta and replaces the Eps parameter with the similarity threshold epsilon between reviewers.
The present invention uses reviewer suspicion score RSS instead of the MinPts parameter, specifically the transformation of the MinPts parameter is as follows. The reviewer suspicion score consists of three parts of calculated scores, including a reviewer score, a review suspicion score, and an application suspicion score. The conversion formula is as follows:
wherein i represents a reviewer, j represents a comment, and k represents an application; RSS represents the reviewer suspicion score; SR (SR) i Representing a reviewer score; n is n i Representing all comments of comment iQuantity, SS j Representing a comment suspicion score, c ij The j-th comment representing comment i; m is m k Representing the number of all comment applications of comment i, SA k Representing the applied suspicion score, t ik Represents the kth application of reviewer i review.
The invention uses the similarity SC between two reviewers x and y (x,y) Instead of the Eps parameter. Specifically, the conversion of the Eps parameter is as follows. The conversion formula is as follows:
wherein S is car(x,y) Represents application similarity, S crr(x,y) Represents comment similarity, S card(x,y) Representing scoring similarity.
The determination of the core points in the S-DBSCAN algorithm does not depend on the number of MinPts any more, but is determined according to the suspicion scores of reviewers of the current data points.
Examples
Taking an improved S-DBSCAN algorithm as an example, the invention specifically describes a cluster-based application market brush list collusion group detection method as follows:
step 1: the dataset was crawled from the application marketplace for a total of 19666225 reviews and 12315366 reviewers. The number of reviews is limited to filter, and the reviewers meeting the number of reviews exceeding 50 are selected as the final experimental data set. Together, 8853 reviewer sets, 818545 review sets, and 2188 application sets are obtained.
Step 2: a set of core points is initialized.
Step 3: setting a reviewer suspicion score threshold eta and a similarity threshold epsilon between reviewers, and finding out all core points. Wherein, the setting range of eta is selected 0.7,0.8,0.9, the setting range of epsilon is selected 0.7,0.8,0.9, and the total of 9 experimental results are arranged and combined.
Step 4: and taking any core point as a starting point, finding out samples with reachable densities to generate a cluster until all the core points are accessed.
Step 5: and outputting a cluster division result. The clustering cluster number is equal to the number of the collusion groups, and according to the experimental result, as the suspicion score threshold of the reviewer and the similarity threshold between the reviewers are larger and larger, the number of the collusion groups is smaller and smaller, wherein when eta and epsilon are both 0.7, the experimental result is 177, and 177 collusion groups are indicated; when η and ε are both 0.9, the experimental result was shown to be 24, indicating that there were 24 collusion groups.
Step 6: the invention uses the improved S-DBSCAN algorithm to cluster, and uses the contour coefficient to evaluate the clustering effect, the contour coefficient can be simply understood as the similarity degree of one node and the cluster to which the node belongs compared with other clusters, the value of the node is between [ -1,1], and the larger value indicates the better clustering effect. The profile factor in the experiment increases gradually with e from an initial 0.283 to 0.763. In the traditional DBSCAN algorithm, different MinPts and Eps parameters are used for clustering, the contour coefficient of the algorithm is 0.453 at most, and the clustering effect of the algorithm is obviously lower than that of the S-DBSCAN algorithm.
Specifically, the transformation of the MinPts parameter is as follows. The present invention uses reviewer suspicion score RSS instead of the MinPts parameter. The calculation steps are as follows:
step A1: determining reviewer score SR i
The difference between the score of a reviewer for an application and the average score of the application is referred to as the score bias, defined as: the number of reviewers ' all-positive reviews compared to the number of reviewers ' all-reviews is referred to as the reviewer's positive review ratio, defined as: />The number of non-duplicate reviews posted by reviewer i is defined as: nd i Comment burst frequency is defined as: />Judging whether a reviewer i issues repeated reviews is called issuing repeated reviews, and is defined as: dup (dup) i If the reviewer i issues repeated comments, the dup is generated i The value is set to 1, otherwise to 0.
Finally, all the indexes are added up to obtain the suspicion score SR of the reviewer i i The formula:
step A2: determining comment suspicion score SS j
A near duplicate comment is called a suspicion comment; calculating suspicion score SS of comments using cosine similarity j :SS j =max i≠j Cosine′(review j ,review i ) And satisfies the presence of application k such that (j, k), (i, k) ∈E; where E is an edge in the comment graph, cosine' is a linear scaled version of the Cosine function, limiting the value of the Cosine function to [0,1]In (a) and (b);
step A3: determining an application suspicion score SA k
The score obtained by a suspected application over a period of time is known as the explosive high score, defined as:wherein nd k For all non-duplicate comment numbers of application k, n k Number of all comments for application k; the shortest top scoring time interval is defined as: sap (sap) k The calculation is as follows: />
Therefore, the suspicion score SA of the application k is obtained k The calculation formula is as follows:
step A4: determining reviewer suspicion score RSS
The reviewer suspicion score is determined by reviewer score SR i Comment suspicion score SS j Applying suspicion score SA k The calculation formula is determined as follows:
specifically, the conversion of the Eps parameter is as follows. The invention uses the similarity SC between two reviewers x and y (x,y) Instead of the Eps parameter. The calculation steps are as follows:
step B1: calculating application similarity S car(x,y)
Jaccard similarity is used to measure the similarity of application sets reviewed by two reviewers. Definition of the definitionWherein M is i Application set reviewed for reviewer i, +.>An nth application reviewed for reviewer i. Thus, judging the similarity between the application sets reviewed by reviewer x and reviewer y can be expressed as:
step B2: calculating comment similarity S crr(x,y)
The sum of the number of applications of two reviewer reviews for the same developer divided by the sum of the number of all reviews for each reviewer is defined as review similarity. Define the set of developers of an application reviewed by a reviewer as B, P n∈B,i Representing the set of applications reviewed by reviewer i for developer b, P i Representing all applications that reviewer i has reviewed. The specific calculation formula of comment similarity of the reviewer x and the reviewer y is obtained as follows:
step B3: calculating scoring similarity S card(x,y)
The root mean square deviation is used for calculating the scoring similarity of the reviewer x and the reviewer y, the definition C represents an application set which is commonly reviewed by the reviewer x and the reviewer y, and a specific computing formula of the scoring similarity of the reviewer x and the reviewer y is obtained as follows:
step B4: calculating to obtain similarity SC of reviewer x and reviewer y (x,y) The calculation formula is as follows:
Claims (5)
1. the utility model provides a cluster-based application market brush list collusion group detection method, which is characterized by comprising the following steps:
step 1: crawling a data set from an application market, and filtering by limiting the comment number of reviewers so as to obtain a reviewer set required by the final experiment; namely selecting the reviewers with the number of the reviews exceeding a certain threshold value as a data set;
step 2: firstly, selecting a core point in a data set as an initial set;
step 3: finding out all core points in the data set according to the initial parameters of the current core points, namely, the suspicion score of the reviewer serving as a threshold eta and the similarity between reviewers serving as a threshold epsilon;
step 4: taking any core point as a starting point, finding out samples with reachable density to generate a cluster until all the core points are accessed;
step 5: outputting a cluster division result, wherein the cluster division result comprises the number of clusters and detailed information of each data in the clusters; wherein:
the reviewer suspicion score RSS consists of three parts of calculated scores, including reviewer scores, comment suspicion scores and application suspicion scores; the calculation formula is as follows:
wherein i represents a reviewer, j represents a comment, and k represents an application; RSS represents the reviewer suspicion score; SR (SR) i Representing a reviewer score; n is n i Representing the number of comments of reviewer i, SS j Representing a comment suspicion score, c ij The j-th comment representing comment i; m is m k Representing the number of all comment applications of comment i, SA k Representing the applied suspicion score, t ik A kth application representing a comment of a reviewer i;
similarity between reviewers, i.e. similarity SC between two reviewers x and y (x,y) The calculation formula is as follows:
wherein S is car(x,y) Represents application similarity, S crr(x,y) Represents comment similarity, S card(x,y) Representing scoring similarity.
2. The method for detecting collusion group based on cluster based application market brush list as recited in claim 1, wherein the crawling of the data set in step 1 comprises apple application market.
3. The method for detecting collusion group in a cluster-based application market brush list as recited in claim 1, wherein the step 1 of crawling the data features in the obtained data set comprises: reviewer name, comment content, comment score, application of comment, number of applications of comment, and number of comment words.
4. The cluster-based application market brush list collusion group detection method of claim 1, wherein the core point is determined by a reviewer suspicion score.
5. The cluster-based application market brush list collusion group detection method of claim 1, wherein the density is reachable in a range defining epsilon-domain; for x j E D, the epsilon-neighborhood of which contains the sum x in dataset D j Objects with similarity threshold greater than E, i.e/>
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910090202.2A CN109919191B (en) | 2019-01-30 | 2019-01-30 | Clustering-based application market brush list collusion group detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910090202.2A CN109919191B (en) | 2019-01-30 | 2019-01-30 | Clustering-based application market brush list collusion group detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109919191A CN109919191A (en) | 2019-06-21 |
CN109919191B true CN109919191B (en) | 2023-05-02 |
Family
ID=66961032
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910090202.2A Active CN109919191B (en) | 2019-01-30 | 2019-01-30 | Clustering-based application market brush list collusion group detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109919191B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8561184B1 (en) * | 2010-02-04 | 2013-10-15 | Adometry, Inc. | System, method and computer program product for comprehensive collusion detection and network traffic quality prediction |
CN106294105A (en) * | 2015-05-22 | 2017-01-04 | 深圳市腾讯计算机系统有限公司 | Brush amount tool detection method and apparatus |
CN106682058A (en) * | 2016-08-08 | 2017-05-17 | 腾讯科技(深圳)有限公司 | Screening method, device and system of application programs |
CN107239694A (en) * | 2017-05-27 | 2017-10-10 | 武汉大学 | A kind of Android application permissions inference method and device based on user comment |
CN107391548A (en) * | 2017-04-06 | 2017-11-24 | 华东师范大学 | A kind of Mobile solution market brush list user's group detection method and its system |
CN107808093A (en) * | 2016-09-09 | 2018-03-16 | 长沙有干货网络技术有限公司 | A kind of Android malware family clustering method of Behavior-based control |
-
2019
- 2019-01-30 CN CN201910090202.2A patent/CN109919191B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8561184B1 (en) * | 2010-02-04 | 2013-10-15 | Adometry, Inc. | System, method and computer program product for comprehensive collusion detection and network traffic quality prediction |
CN106294105A (en) * | 2015-05-22 | 2017-01-04 | 深圳市腾讯计算机系统有限公司 | Brush amount tool detection method and apparatus |
CN106682058A (en) * | 2016-08-08 | 2017-05-17 | 腾讯科技(深圳)有限公司 | Screening method, device and system of application programs |
CN107808093A (en) * | 2016-09-09 | 2018-03-16 | 长沙有干货网络技术有限公司 | A kind of Android malware family clustering method of Behavior-based control |
CN107391548A (en) * | 2017-04-06 | 2017-11-24 | 华东师范大学 | A kind of Mobile solution market brush list user's group detection method and its system |
CN107239694A (en) * | 2017-05-27 | 2017-10-10 | 武汉大学 | A kind of Android application permissions inference method and device based on user comment |
Non-Patent Citations (2)
Title |
---|
"Release Planning of Mobile Apps Based on User Reviews";Lorenzo Villarroel etal.;《IEEE》;20161231;第14-24页 * |
"Toward Detecting Collusive Ranking Manipulation Attackers in Mobile App Markets";Hao Chen etal.;《ACM》;20171231;第58-70页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109919191A (en) | 2019-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Event detection and popularity prediction in microblogging | |
Grant et al. | Distance-based measures of inconsistency | |
CN104391942A (en) | Short text characteristic expanding method based on semantic atlas | |
CN111159395A (en) | Chart neural network-based rumor standpoint detection method and device and electronic equipment | |
CN105005589A (en) | Text classification method and text classification device | |
Gui et al. | A community discovery algorithm based on boundary nodes and label propagation | |
Cao et al. | Data mining for business applications | |
JP2009093650A (en) | Selection of tag for document by paragraph analysis of document | |
CN107423820B (en) | Knowledge graph representation learning method combined with entity hierarchy categories | |
CN106202032A (en) | A kind of sentiment analysis method towards microblogging short text and system thereof | |
CN110347897B (en) | Microblog network emotion community identification method based on event detection | |
CN105740404A (en) | Label association method and device | |
CN111191099B (en) | User activity type identification method based on social media | |
Ruan et al. | GADM: Manual fake review detection for O2O commercial platforms | |
Tian et al. | An improved method for functional similarity analysis of genes based on gene ontology | |
CN112463976A (en) | Knowledge graph construction method taking crowd sensing task as center | |
CN110489565A (en) | Based on the object root type design method and system in domain knowledge map ontology | |
Xin et al. | An overlapping semantic community detection algorithm base on the ARTs multiple sampling models | |
CN113422761A (en) | Malicious social user detection method based on counterstudy | |
CN110826315B (en) | Method for identifying timeliness of short text by using neural network system | |
CN108509588B (en) | Lawyer evaluation method and recommendation method based on big data | |
CN109919191B (en) | Clustering-based application market brush list collusion group detection method | |
CN109492924B (en) | Influence evaluation method based on second order of self and behavior value of microblog user | |
Babko-Malaya et al. | Characterizing communities of practice in emerging science and technology fields | |
CN115438274A (en) | False news identification method based on heterogeneous graph convolutional network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |