CN109919191B - Clustering-based application market brush list collusion group detection method - Google Patents

Clustering-based application market brush list collusion group detection method Download PDF

Info

Publication number
CN109919191B
CN109919191B CN201910090202.2A CN201910090202A CN109919191B CN 109919191 B CN109919191 B CN 109919191B CN 201910090202 A CN201910090202 A CN 201910090202A CN 109919191 B CN109919191 B CN 109919191B
Authority
CN
China
Prior art keywords
comment
reviewer
similarity
application market
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910090202.2A
Other languages
Chinese (zh)
Other versions
CN109919191A (en
Inventor
何道敬
潘梦函
唐宗力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN201910090202.2A priority Critical patent/CN109919191B/en
Publication of CN109919191A publication Critical patent/CN109919191A/en
Application granted granted Critical
Publication of CN109919191B publication Critical patent/CN109919191B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a clustering-based application market brush list collusion group detection method, which comprises the following specific implementation steps: crawling the data set; initializing a core point set; determining a reviewer suspicion score threshold; finding out samples with reachable density from any core point to generate a cluster until all the core points are accessed; and outputting a cluster division result. According to the clustering-based application market brush list collusion group detection method disclosed by the invention, the characteristics of similar members in the collusion group are fully reflected by the algorithm, a better clustering effect is obtained, and the problem that the parameter setting of the traditional clustering algorithm in the application market is difficult is solved.

Description

Clustering-based application market brush list collusion group detection method
Technical Field
The invention relates to a collusion group detection method, in particular to a clustering-based application market collusion group detection method.
Background
With the rapid development of smartphones, the number of mobile applications has grown at a striking rate, and the mobile application market provides a convenient and efficient way for users to download mobile applications. If an application publishes a higher rank in the application market, meaning that the application has a higher exposure, the developer is more likely to get tremendous profits. Therefore, in the mobile application market, a novel marketing means, namely, an application brush list, is derived, and along with the flame explosion of the Taobao Tianma cat, the shop brush list becomes a marketing means for a merchant to make a fake gesture. The attacker uses the means of brushing the list to popularize the mobile application of the attacker in the application market to achieve larger profits, and the brushing list is the same as the Taobao brushing list person and works in the form of groups or teams, so the method is also called brushing list collusion group, and is uniformly managed by brushing list companies. The members in the group can simulate the behaviors of normal users, avoid detection algorithms in the application market, and provide challenges for the detection of the collusion group and the collusion person. Therefore, how to quickly and effectively detect the collusion group in the application market is a urgent need, and has important significance for maintaining the ecological balance of the application market and accelerating the competition and innovation of application software developers in the circle.
Currently, the e-commerce field has a mode of detecting the collusion group of the spam comment, and most of the e-commerce field adopts a supervised machine learning method, but one of the characteristics is that the method is severely dependent on a marked data set to train a classifier. However, model training requires a large number of samples with markers, which are difficult and costly to obtain, and this approach, which lacks adequate marker sample training, has proven to be inaccurate. Meanwhile, collusion group detection is relatively less in the application market field, and Xie Z and the like detect through analysis of the relationship between the reviewers and the relationship between the reviewers and the application and then establish a relationship diagram; chen H et al generated candidate collusion groups using Frequent Itemset Mining (FIM) techniques, and then detected the collusion groups by constructing a model of reviewer scores for applications. But such approaches can only find a dense collusion group and where each group member must comment on all target applications. Therefore, the clustering-based brush list collusion group detection method provided by the invention can fully utilize the characteristics of similar members in the collusion group, obtain a better clustering effect and solve the problem that the parameter setting of the traditional clustering algorithm in the application market is difficult.
Disclosure of Invention
Aiming at the problem of parameter setting in the prior application market brush list collusion group detection, the invention provides a clustering-based application market brush list collusion group detection method. Briefly, the core points are determined by reviewer suspicion scores and the radius of the neighborhood is determined by the similarity between reviewers. The algorithm not only solves the problem that the parameters of MinPts and Eps are difficult to be appointed in advance, but also has the advantage that the S-DBSCAN algorithm obtains better clustering effect compared with the direct use of the DBSCAN algorithm.
The specific technical scheme for realizing the aim of the invention is as follows:
a cluster-based application market brush list collusion group detection method comprises the following steps:
step 1: crawling a data set from an application market, and filtering by limiting the comment number of reviewers so as to obtain a reviewer set required by the final experiment; namely selecting the reviewers with the number of the reviews exceeding a certain threshold value as a data set;
step 2: firstly, selecting a core point in a data set as an initial set;
step 3: finding out all core points in the data set according to the initial parameters of the current data point, namely, the suspicion score of the reviewer as a threshold eta and the similarity between reviewers as a threshold epsilon;
step 4: taking any core point as a starting point, finding out samples with reachable density to generate a cluster until all the core points are accessed;
step 5: outputting a cluster division result, wherein the cluster division result comprises the number of clusters and detailed information of each data in the clusters.
The crawling of the data set in step 1 includes, but is not limited to, apple application market.
The data features in the data set obtained by crawling in step 1 include, but are not limited to: reviewer name, comment content, comment score, application of comment, number of applications of comment, and number of comment words.
The core points in the step 2 are determined by the suspicion scores of reviewers; the reviewer suspicion score RSS consists of three parts of calculated scores, including a reviewer score, a comment suspicion score and an application suspicion score; the calculation formula is as follows:
Figure BDA0001963024820000021
where i represents a reviewer, j represents a comment,k represents an application; RSS represents the reviewer suspicion score; SR (SR) i Representing a reviewer score; n is n i Representing the number of comments of reviewer i, SS j Representing a comment suspicion score, c ij The j-th comment representing comment i; m is m k Representing the number of all comment applications of comment i, SA k Representing the applied suspicion score, t ik Represents the kth application of reviewer i review.
Step 3, similarity between the reviewers, namely, similarity SC between two reviewers x and y (x,y) The calculation formula is as follows:
Figure BDA0001963024820000022
wherein S is car(x,y) Represents application similarity, S crr(x,y) Represents comment similarity, S car1(x,y) Representing scoring similarity.
The reachable range of the density in the step 4 is defined as epsilon-field; for x 4 E D, the epsilon-neighborhood of which contains the sum x in dataset D j Objects with similarity threshold greater than E, i.e
Figure BDA0001963024820000023
The improved S-DBSCAN algorithm is applied to the detection of the collusion group of the application market, and from the characteristic that the behavior of the colluders in the collusion group has similarity, the reviewer suspicion score is used for replacing MinPts parameters and the similarity between two reviewers for replacing Eps parameters, so that the problem of difficult parameter setting in the traditional DBSCAN algorithm is solved, and meanwhile, experiments also show that the S-DBSCAN clustering algorithm obtains better clustering effect.
Drawings
Fig. 1 is a flow chart of the present invention.
Detailed Description
The invention will be described in further detail in connection with specific embodiments and with the accompanying drawings. The procedures, conditions, experimental methods, etc. for carrying out the present invention are common knowledge and common knowledge in the art, except for the following specific references, and the present invention is not particularly limited.
The invention comprises the following steps:
step 1: the data set is crawled from the application market and filtered by limiting the number of reviewers' reviews to obtain the set of reviewers needed for the final experiment.
Step 2: the S-DBSCAN algorithm firstly selects one core point in the data set as an initial set, and then starts from the set to determine a corresponding cluster.
Step 3: and finding out all core points according to given initial parameters, namely a reviewer suspicion score threshold eta and a similarity threshold epsilon between reviewers.
Step 4: and taking any core point as a starting point, finding out samples with reachable densities to generate a cluster until all the core points are accessed.
Step 5: outputting a cluster division result, wherein the cluster division result comprises the number of clusters and detailed information of each data in the clusters.
S-DBSCAN algorithms, including but not limited to Density-based clustering algorithms (Density-based methods). And converting the MinPts and Eps parameters in the DBSCAN algorithm into similarity connection between reviewers in the application market.
The S-DBSCAN algorithm replaces the MinPts parameter with the reviewer suspicion score threshold eta and replaces the Eps parameter with the similarity threshold epsilon between reviewers.
The present invention uses reviewer suspicion score RSS instead of the MinPts parameter, specifically the transformation of the MinPts parameter is as follows. The reviewer suspicion score consists of three parts of calculated scores, including a reviewer score, a review suspicion score, and an application suspicion score. The conversion formula is as follows:
Figure BDA0001963024820000031
wherein i represents a reviewer, j represents a comment, and k represents an application; RSS represents the reviewer suspicion score; SR (SR) i Representing a reviewer score; n is n i Representing all comments of comment iQuantity, SS j Representing a comment suspicion score, c ij The j-th comment representing comment i; m is m k Representing the number of all comment applications of comment i, SA k Representing the applied suspicion score, t ik Represents the kth application of reviewer i review.
The invention uses the similarity SC between two reviewers x and y (x,y) Instead of the Eps parameter. Specifically, the conversion of the Eps parameter is as follows. The conversion formula is as follows:
Figure BDA0001963024820000041
wherein S is car(x,y) Represents application similarity, S crr(x,y) Represents comment similarity, S card(x,y) Representing scoring similarity.
The determination of the core points in the S-DBSCAN algorithm does not depend on the number of MinPts any more, but is determined according to the suspicion scores of reviewers of the current data points.
Examples
Taking an improved S-DBSCAN algorithm as an example, the invention specifically describes a cluster-based application market brush list collusion group detection method as follows:
step 1: the dataset was crawled from the application marketplace for a total of 19666225 reviews and 12315366 reviewers. The number of reviews is limited to filter, and the reviewers meeting the number of reviews exceeding 50 are selected as the final experimental data set. Together, 8853 reviewer sets, 818545 review sets, and 2188 application sets are obtained.
Step 2: a set of core points is initialized.
Step 3: setting a reviewer suspicion score threshold eta and a similarity threshold epsilon between reviewers, and finding out all core points. Wherein, the setting range of eta is selected 0.7,0.8,0.9, the setting range of epsilon is selected 0.7,0.8,0.9, and the total of 9 experimental results are arranged and combined.
Step 4: and taking any core point as a starting point, finding out samples with reachable densities to generate a cluster until all the core points are accessed.
Step 5: and outputting a cluster division result. The clustering cluster number is equal to the number of the collusion groups, and according to the experimental result, as the suspicion score threshold of the reviewer and the similarity threshold between the reviewers are larger and larger, the number of the collusion groups is smaller and smaller, wherein when eta and epsilon are both 0.7, the experimental result is 177, and 177 collusion groups are indicated; when η and ε are both 0.9, the experimental result was shown to be 24, indicating that there were 24 collusion groups.
Step 6: the invention uses the improved S-DBSCAN algorithm to cluster, and uses the contour coefficient to evaluate the clustering effect, the contour coefficient can be simply understood as the similarity degree of one node and the cluster to which the node belongs compared with other clusters, the value of the node is between [ -1,1], and the larger value indicates the better clustering effect. The profile factor in the experiment increases gradually with e from an initial 0.283 to 0.763. In the traditional DBSCAN algorithm, different MinPts and Eps parameters are used for clustering, the contour coefficient of the algorithm is 0.453 at most, and the clustering effect of the algorithm is obviously lower than that of the S-DBSCAN algorithm.
Specifically, the transformation of the MinPts parameter is as follows. The present invention uses reviewer suspicion score RSS instead of the MinPts parameter. The calculation steps are as follows:
step A1: determining reviewer score SR i
The difference between the score of a reviewer for an application and the average score of the application is referred to as the score bias, defined as:
Figure BDA0001963024820000051
Figure BDA0001963024820000052
the number of reviewers ' all-positive reviews compared to the number of reviewers ' all-reviews is referred to as the reviewer's positive review ratio, defined as: />
Figure BDA0001963024820000053
The number of non-duplicate reviews posted by reviewer i is defined as: nd i Comment burst frequency is defined as: />
Figure BDA0001963024820000054
Judging whether a reviewer i issues repeated reviews is called issuing repeated reviews, and is defined as: dup (dup) i If the reviewer i issues repeated comments, the dup is generated i The value is set to 1, otherwise to 0.
Finally, all the indexes are added up to obtain the suspicion score SR of the reviewer i i The formula:
Figure BDA0001963024820000055
step A2: determining comment suspicion score SS j
A near duplicate comment is called a suspicion comment; calculating suspicion score SS of comments using cosine similarity j :SS j =max i≠j Cosine′(review j ,review i ) And satisfies the presence of application k such that (j, k), (i, k) ∈E; where E is an edge in the comment graph, cosine' is a linear scaled version of the Cosine function, limiting the value of the Cosine function to [0,1]In (a) and (b);
step A3: determining an application suspicion score SA k
The score obtained by a suspected application over a period of time is known as the explosive high score, defined as:
Figure BDA0001963024820000056
wherein nd k For all non-duplicate comment numbers of application k, n k Number of all comments for application k; the shortest top scoring time interval is defined as: sap (sap) k The calculation is as follows: />
Figure BDA0001963024820000057
Figure BDA0001963024820000058
The longest top scoring time interval is defined as: lap k The calculation is as follows:
Figure BDA0001963024820000059
where d is the time interval of each comment, np k All frontal comment numbers for reviewer k.
Therefore, the suspicion score SA of the application k is obtained k The calculation formula is as follows:
Figure BDA00019630248200000510
step A4: determining reviewer suspicion score RSS
The reviewer suspicion score is determined by reviewer score SR i Comment suspicion score SS j Applying suspicion score SA k The calculation formula is determined as follows:
Figure BDA0001963024820000061
specifically, the conversion of the Eps parameter is as follows. The invention uses the similarity SC between two reviewers x and y (x,y) Instead of the Eps parameter. The calculation steps are as follows:
step B1: calculating application similarity S car(x,y)
Jaccard similarity is used to measure the similarity of application sets reviewed by two reviewers. Definition of the definition
Figure BDA0001963024820000062
Wherein M is i Application set reviewed for reviewer i, +.>
Figure BDA0001963024820000063
An nth application reviewed for reviewer i. Thus, judging the similarity between the application sets reviewed by reviewer x and reviewer y can be expressed as:
Figure BDA0001963024820000064
step B2: calculating comment similarity S crr(x,y)
The sum of the number of applications of two reviewer reviews for the same developer divided by the sum of the number of all reviews for each reviewer is defined as review similarity. Define the set of developers of an application reviewed by a reviewer as B, P n∈B,i Representing the set of applications reviewed by reviewer i for developer b, P i Representing all applications that reviewer i has reviewed. The specific calculation formula of comment similarity of the reviewer x and the reviewer y is obtained as follows:
Figure BDA0001963024820000065
step B3: calculating scoring similarity S card(x,y)
The root mean square deviation is used for calculating the scoring similarity of the reviewer x and the reviewer y, the definition C represents an application set which is commonly reviewed by the reviewer x and the reviewer y, and a specific computing formula of the scoring similarity of the reviewer x and the reviewer y is obtained as follows:
Figure BDA0001963024820000066
step B4: calculating to obtain similarity SC of reviewer x and reviewer y (x,y) The calculation formula is as follows:
Figure BDA0001963024820000067
/>

Claims (5)

1. the utility model provides a cluster-based application market brush list collusion group detection method, which is characterized by comprising the following steps:
step 1: crawling a data set from an application market, and filtering by limiting the comment number of reviewers so as to obtain a reviewer set required by the final experiment; namely selecting the reviewers with the number of the reviews exceeding a certain threshold value as a data set;
step 2: firstly, selecting a core point in a data set as an initial set;
step 3: finding out all core points in the data set according to the initial parameters of the current core points, namely, the suspicion score of the reviewer serving as a threshold eta and the similarity between reviewers serving as a threshold epsilon;
step 4: taking any core point as a starting point, finding out samples with reachable density to generate a cluster until all the core points are accessed;
step 5: outputting a cluster division result, wherein the cluster division result comprises the number of clusters and detailed information of each data in the clusters; wherein:
the reviewer suspicion score RSS consists of three parts of calculated scores, including reviewer scores, comment suspicion scores and application suspicion scores; the calculation formula is as follows:
Figure FDA0003919636420000011
wherein i represents a reviewer, j represents a comment, and k represents an application; RSS represents the reviewer suspicion score; SR (SR) i Representing a reviewer score; n is n i Representing the number of comments of reviewer i, SS j Representing a comment suspicion score, c ij The j-th comment representing comment i; m is m k Representing the number of all comment applications of comment i, SA k Representing the applied suspicion score, t ik A kth application representing a comment of a reviewer i;
similarity between reviewers, i.e. similarity SC between two reviewers x and y (x,y) The calculation formula is as follows:
Figure FDA0003919636420000012
wherein S is car(x,y) Represents application similarity, S crr(x,y) Represents comment similarity, S card(x,y) Representing scoring similarity.
2. The method for detecting collusion group based on cluster based application market brush list as recited in claim 1, wherein the crawling of the data set in step 1 comprises apple application market.
3. The method for detecting collusion group in a cluster-based application market brush list as recited in claim 1, wherein the step 1 of crawling the data features in the obtained data set comprises: reviewer name, comment content, comment score, application of comment, number of applications of comment, and number of comment words.
4. The cluster-based application market brush list collusion group detection method of claim 1, wherein the core point is determined by a reviewer suspicion score.
5. The cluster-based application market brush list collusion group detection method of claim 1, wherein the density is reachable in a range defining epsilon-domain; for x j E D, the epsilon-neighborhood of which contains the sum x in dataset D j Objects with similarity threshold greater than E, i.e
Figure FDA0003919636420000021
/>
CN201910090202.2A 2019-01-30 2019-01-30 Clustering-based application market brush list collusion group detection method Active CN109919191B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910090202.2A CN109919191B (en) 2019-01-30 2019-01-30 Clustering-based application market brush list collusion group detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910090202.2A CN109919191B (en) 2019-01-30 2019-01-30 Clustering-based application market brush list collusion group detection method

Publications (2)

Publication Number Publication Date
CN109919191A CN109919191A (en) 2019-06-21
CN109919191B true CN109919191B (en) 2023-05-02

Family

ID=66961032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910090202.2A Active CN109919191B (en) 2019-01-30 2019-01-30 Clustering-based application market brush list collusion group detection method

Country Status (1)

Country Link
CN (1) CN109919191B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8561184B1 (en) * 2010-02-04 2013-10-15 Adometry, Inc. System, method and computer program product for comprehensive collusion detection and network traffic quality prediction
CN106294105A (en) * 2015-05-22 2017-01-04 深圳市腾讯计算机系统有限公司 Brush amount tool detection method and apparatus
CN106682058A (en) * 2016-08-08 2017-05-17 腾讯科技(深圳)有限公司 Screening method, device and system of application programs
CN107239694A (en) * 2017-05-27 2017-10-10 武汉大学 A kind of Android application permissions inference method and device based on user comment
CN107391548A (en) * 2017-04-06 2017-11-24 华东师范大学 A kind of Mobile solution market brush list user's group detection method and its system
CN107808093A (en) * 2016-09-09 2018-03-16 长沙有干货网络技术有限公司 A kind of Android malware family clustering method of Behavior-based control

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8561184B1 (en) * 2010-02-04 2013-10-15 Adometry, Inc. System, method and computer program product for comprehensive collusion detection and network traffic quality prediction
CN106294105A (en) * 2015-05-22 2017-01-04 深圳市腾讯计算机系统有限公司 Brush amount tool detection method and apparatus
CN106682058A (en) * 2016-08-08 2017-05-17 腾讯科技(深圳)有限公司 Screening method, device and system of application programs
CN107808093A (en) * 2016-09-09 2018-03-16 长沙有干货网络技术有限公司 A kind of Android malware family clustering method of Behavior-based control
CN107391548A (en) * 2017-04-06 2017-11-24 华东师范大学 A kind of Mobile solution market brush list user's group detection method and its system
CN107239694A (en) * 2017-05-27 2017-10-10 武汉大学 A kind of Android application permissions inference method and device based on user comment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Release Planning of Mobile Apps Based on User Reviews";Lorenzo Villarroel etal.;《IEEE》;20161231;第14-24页 *
"Toward Detecting Collusive Ranking Manipulation Attackers in Mobile App Markets";Hao Chen etal.;《ACM》;20171231;第58-70页 *

Also Published As

Publication number Publication date
CN109919191A (en) 2019-06-21

Similar Documents

Publication Publication Date Title
Zhang et al. Event detection and popularity prediction in microblogging
Grant et al. Distance-based measures of inconsistency
CN104391942A (en) Short text characteristic expanding method based on semantic atlas
CN111159395A (en) Chart neural network-based rumor standpoint detection method and device and electronic equipment
CN105005589A (en) Text classification method and text classification device
Gui et al. A community discovery algorithm based on boundary nodes and label propagation
Cao et al. Data mining for business applications
JP2009093650A (en) Selection of tag for document by paragraph analysis of document
CN107423820B (en) Knowledge graph representation learning method combined with entity hierarchy categories
CN106202032A (en) A kind of sentiment analysis method towards microblogging short text and system thereof
CN110347897B (en) Microblog network emotion community identification method based on event detection
CN105740404A (en) Label association method and device
CN111191099B (en) User activity type identification method based on social media
Ruan et al. GADM: Manual fake review detection for O2O commercial platforms
Tian et al. An improved method for functional similarity analysis of genes based on gene ontology
CN112463976A (en) Knowledge graph construction method taking crowd sensing task as center
CN110489565A (en) Based on the object root type design method and system in domain knowledge map ontology
Xin et al. An overlapping semantic community detection algorithm base on the ARTs multiple sampling models
CN113422761A (en) Malicious social user detection method based on counterstudy
CN110826315B (en) Method for identifying timeliness of short text by using neural network system
CN108509588B (en) Lawyer evaluation method and recommendation method based on big data
CN109919191B (en) Clustering-based application market brush list collusion group detection method
CN109492924B (en) Influence evaluation method based on second order of self and behavior value of microblog user
Babko-Malaya et al. Characterizing communities of practice in emerging science and technology fields
CN115438274A (en) False news identification method based on heterogeneous graph convolutional network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant