CN109919191A

CN109919191A - A kind of application market brush list collusion group detection method based on cluster

Info

Publication number: CN109919191A
Application number: CN201910090202.2A
Authority: CN
Inventors: 何道敬; 潘梦函; 唐宗力
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2019-01-30
Filing date: 2019-01-30
Publication date: 2019-06-21
Anticipated expiration: 2039-01-30
Also published as: CN109919191B

Abstract

The application market brush list collusion group detection method based on cluster that the invention discloses a kind of implements step are as follows: crawl data set；Initialize core point set；Determine commentator's suspicion score threshold；The reachable sample of density is found out from any core point and generates clustering cluster, until all core points are accessed；Export cluster division result.Application market brush list collusion group detection method disclosed by the invention based on cluster, the algorithm fully demonstrate the similar feature of collusion group group member, obtain better Clustering Effect, and solve the problems, such as that traditional clustering algorithm parameter setting in application market is difficult.

Description

A kind of application market brush list collusion group detection method based on cluster

Technical field

The present invention relates to a kind of brush list collusion group detection methods, specifically, i.e., a kind of application market based on cluster Brush list collusion group detection method.

Background technique

With the fast development of smart phone, the quantity of mobile phone application increases at an amazing speed, and mobile phone application market is User downloads mobile application and provides a kind of convenience and effective mode.If one is applied the ranking issued in application market to get over It is high, it is meant that the application has higher exposure rate, then developer more likely obtains huge profit.Then in mobile application Market has derived a kind of novel marketing methods --- early in the field of e-commerce, with hot, the shop of Taobao day cat Brush is single just to be become businessman and plays tricks a kind of marketing methods made a show of power --- using brush list.Attacker goes to promote him using the means of brush list Mobile application in application market to seek bigger profit, brush list person is the same with Taobao brush Dan Yuan, mostly with group or The form of group works, therefore also known as brush list collusion group, is managed collectively by Shua Bang company.Group member can imitate normal users Behavior, hide the detection algorithm in application market, bring challenges to the detection of brush list collusion group and brush list person.Therefore, how The brush list collusion group quickly and effectively detected in application market is a urgent issue, the life for maintenance application market State balance accelerates competition and innovation of the applied software development person in circle to be of great significance.

Currently, e-commerce field mostly uses greatly the machine of supervision there are the mode of detection comment spam collusion group Learning method, but its feature first is that depending critically upon markd data set to train classifier.However model training needs Will be largely with markd sample, and this kind of sample acquisition is difficult and cost is too high, lacks the party of enough marker samples training Method is proved to not accurate enough again.At the same time, relatively fewer in the detection of application market field brush list collusion group, Xie Z et al. is logical The relationship between analysis commentator and commentator and commentator and application is crossed, then opening relationships figure is detected；Chen H Et al. generate candidate brush list collusion group using frequent item set mining (FIM) technology, then scored by building commentator application Model its brush list collusion group detected.But such method is only able to find intensive brush list collusion group, and wherein Each group membership must comment on all target applications.Therefore, a kind of brush list collusion group detection based on cluster proposed by the present invention Method can make full use of the similar feature of collusion group group member, obtain better Clustering Effect, solve traditional cluster The problem of algorithm parameter setting difficulty in application market.

Summary of the invention

The purpose of the present invention is intended to detect aiming at the problem that in terms of parameter setting existing application market brush list collusion group, mentions A kind of application market brush list collusion group detection method based on cluster is gone out, which, which uses, is based on original DBSCAN algorithm Improved S-DBSCAN algorithm is changed into dependence current number by the determination of core point in algorithm by the number by MinPts Commentator's suspicion score at strong point determines.For simple, core point is determined by commentator's suspicion score, and the radius of neighborhood is by commenting Similarity determines between theorist.The algorithm not only solves the problem for specifying MinPts and Eps parameter difficulty in advance, while testing table Bright S-DBSCAN algorithm obtains better Clustering Effect using DBSCAN algorithm compared to directly.

Realizing the specific technical solution of the object of the invention is:

A kind of application market brush list collusion group detection method based on cluster, method includes the following steps:

Step 1: crawling data set from application market, filtered by limiting the number of reviews of commentator, to obtain to the end Commentator's set that experiment needs；Choosing number of reviews is more than these commentators of a certain threshold value as data set；

Step 2: a core point of optional data concentration is as initial sets first；

Step 3: according to the initial parameter of current data point, i.e., commentator's suspicion score is as phase between threshold value η and commentator All core points in the data set are found out as threshold value ∈ like degree；

Step 4: using any core point as starting point, finding out and clustering cluster is generated by the reachable sample of its density, until all Until core point is accessed；

Step 5: output cluster division result, wherein cluster division result includes the detailed of each data in cluster number of clusters and cluster Information.

Data set crawls in the step 1, including but not limited to apple application market.

Include but is not limited to by crawling the data characteristics in obtained data set in the step 1: commentator's name, Comment on content, comment score, the application of comment, the number of applications of comment and comment number of words.

Core point is determined by commentator's suspicion score in the step 2；Commentator suspicion score RSS is calculated by three parts to be divided Array at, including commentator's score, comment suspicion score and apply suspicion score；Calculation formula is as follows:

Wherein, i indicates that commentator, j indicate comment, and k indicates application；RSS indicates commentator's suspicion score；SR_iExpression is commented Theorist's score；n_iIndicate all number of reviews of commentator i, SS_jIndicate comment suspicion score, c_ijJ-th for indicating commentator i is commented By；m_kIndicate all comment number of applications of commentator i, SA_kIt indicates to apply suspicion score, t_ikIndicate k-th of commentator i comment Using.

Similarity SC between similarity i.e. two commentator x and y between commentator described in step 3_(x,y), calculation formula is as follows:

Wherein, S_car(x,y)It indicates to apply similarity, S_crr(x,y)Indicate comment similarity, S_car1(x,y)Indicate that scoring is similar Degree.

Density coverage will define ε-field in the step 4；To x₄∈ D, ε-neighborhood include data set D in x_j Similarity threshold be greater than ∈ object, i.e.,

The present invention applies to improved S-DBSCAN algorithm in the detection of application market brush list collusion group, conspires from brush list Brush list person's behavior has the characteristics that similitude is set out in group, and MinPts parameter is substituted using commentator's suspicion score and two is commented Similarity is between theorist come the problem of substituting Eps parameter, not only solve parameter tuning difficult in traditional DBSCAN algorithm, simultaneously Experiment also indicates that S-DBSCAN clustering algorithm obtains better Clustering Effect.

Detailed description of the invention

Fig. 1 is flow chart of the present invention.

Specific embodiment

Lower the contents of the section is by with specific embodiments and the drawings, and the present invention is described in further detail.Implement this The procedures, conditions, experimental methods etc. of invention are in addition to what is specifically mentioned below the universal knowledege and public affairs of this field Know common sense, there are no special restrictions to content by the present invention.

The present invention the following steps are included:

Step 1: crawling data set from application market, filtered by limiting the number of reviews of commentator, to obtain to the end Commentator's set that experiment needs.

The core point that step 2:S-DBSCAN algorithm optional data first is concentrated is gathered as initial sets, then thus It sets out and determines corresponding clustering cluster.

Step 3: according to given initial parameter, i.e. similarity threshold ∈ between commentator's suspicion score threshold η and commentator Find out all core points.

Step 4: using any core point as starting point, finding out and clustering cluster is generated by the reachable sample of its density, until all Until core point is accessed.

S-DBSCAN algorithm, including but not limited to density-based algorithms (Density-based Method).It will MinPts and Eps parameter is converted into application market the connection of the similarity between commentator in DBSCAN algorithm.

S-DBSCAN algorithm is using commentator's suspicion score threshold η instead of similarity threshold between MinPts parameter and commentator Value ∈ is instead of Eps parameter.

The present invention substitutes MinPts parameter using commentator suspicion score RSS, specifically, the conversion of MinPts parameter It is as follows.Commentator's suspicion score calculates score by three parts and forms, and dislikes comprising commentator's score, comment suspicion score and application Doubt score.Conversion formula is as follows:

The present invention uses similarity SC between two commentator x and y_(x,y)To substitute Eps parameter.Specifically, Eps parameter It converts as follows.Conversion formula is as follows:

Wherein, S_car(x,y)It indicates to apply similarity, S_crr(x,y)Indicate comment similarity, S_card(x,y)Indicate that scoring is similar Degree.

The determination of core point is no longer rely on the number of MinPts in S-DBSCAN algorithm, but commenting according to current data point Theorist's suspicion score determines.

Embodiment

The present invention examines the application market brush list collusion group based on cluster for based on improved S-DBSCAN algorithm Survey method does detailed description below:

Step 1: crawling data set from application market, amount to 19666225 comments and 12315366 commentators.Pass through The number of reviews of commentator is limited to filter, it is more than 50 these commentators as final experiment that selection, which meets number of reviews, Data set.Commentator is obtained and gathers 8853, comment set 818545, set of applications 2188.

Step 2: initialization core point set.

Step 3: similarity threshold ∈ between setting commentator's suspicion score threshold η and commentator finds out all core points.Its In, the setting range that the setting range of η chooses 0.7,0.8,0.9, ∈ chooses 0.7,0.8,0.9, the total 9 kinds of experiments of permutation and combination As a result.

Step 5: output cluster division result.Cluster number of clusters is equivalent to brush list collusion group quantity, finds from experimental result, with Similarity threshold is increasing between commentator's suspicion score threshold and commentator, and brush list collusion group is also fewer and fewer, wherein when η with When ∈ is 0.7, experimental result is shown as 177, shows there are 177 brush list collusion groups；When η and ∈ are 0.9, Experimental result is shown as 24, indicates 24 brush list collusion groups.

Step 6: the present invention is clustered using improved S-DBSCAN algorithm, while using silhouette coefficient evaluation cluster The quality of effect, it is similar compared to other clustering clusters to its affiliated clustering cluster that silhouette coefficient can simply be interpreted as a node Degree, value are between [- 1,1], and it is better to be worth bigger expression Clustering Effect.Silhouette coefficient is gradually increased with ∈ in experiment, It is increased to by initial 0.283 to 0.763.And gathered in tradition DBSCAN algorithm using different MinPts and Eps parameters Class, which is up to 0.453, and Clustering Effect is significantly lower than S-DBSCAN algorithm.

Specifically, the conversion of MinPts parameter is as follows.The present invention substitutes MinPts using commentator suspicion score RSS Parameter.Steps are as follows for calculating:

Step A1: commentator's score SR is determined_i

The scoring that commentator applies one is known as effort analysis using the difference of average mark with this, is defined as: The quantity of all positive number of reviews of commentator comments more all than upper commentator is known as commentator front Ratio is commented on, is defined as:Number of reviews is not repeated by what commentator i was issued is defined as: nd_i, by comment outburst frequency Rate is defined as:It will judge whether a commentator i issues overweight reexamine and repeat to comment on by referred to as publication, it is fixed Justice are as follows: dup_i, overweight reexamine dup if is issued if commentator i_iValue is set as 1, is otherwise provided as 0.

Finally, above-mentioned all indexs are added up to obtain the suspicion score SR of commentator i_iFormula:

Step A2: comment suspicion score SS is determined_j

It is known as suspicion comment close to duplicate comment for one；The suspicion score of comment is calculated using cosine similarity SS_j: SS_j=max_i≠jCosine′(review_j,review_i) and meet to exist and make (j, k) using k, (i, k) ∈ E；Wherein, E For the side in comment figure, Cosine ' is the linear scale form of Cosine function, and the value of Cosine function is limited in [0,1] In；

Step A3: it determines and applies suspicion score SA_k

Suspicion application is obtained into very high scoring whithin a period of time and is known as the scoring of explosion type height, is defined as:Wherein, nd_kAll for application k do not repeat number of reviews, n_kFor all number of reviews of application k；It will most Short higher assessment divides time interval is defined as: sap_k, it calculates as follows:

Longest higher assessment point time interval is determined Justice are as follows: lap_k, it calculates as follows:

Wherein, between the time that d is each comment Every np_kFor all positive number of reviews of commentator k.

So the suspicion score SA for the k that is applied_kCalculation formula:

Step A4: commentator's suspicion score RSS is determined

Commentator's suspicion score is by commentator's score SR_i, comment suspicion score SS_jAnd apply suspicion score SA_kIt determines, Calculation formula is as follows:

Specifically, the conversion of Eps parameter is as follows.The present invention uses similarity SC between two commentator x and y_(x,y)To replace For Eps parameter.Steps are as follows for calculating:

Step B1: it calculates and applies similarity S_car(x,y)

The set of applications similarity that two commentators commented on is measured using Jaccard similarity.DefinitionWherein, M_iFor commentator i comment set of applications,For n-th of application of commentator i comment.Cause This, judges that the similarity between set of applications that commentator x and commentator y were commented on can indicate are as follows:

Step B2: comment similarity S is calculated_crr(x,y)

Two commentators are commented on into the sum of number of applications of identical developer divided by all number of reviews of each commentator The sum of be defined as comment similarity.The developer's collection for defining the application that commentator commented on is combined into B, P_n∈B,iIndicate i pairs of commentator The set of applications that developer b was commented on, P_iIndicate all applications that commentator i was commented on.Therefore commentator x and commentator are obtained The comment similarity specific formula for calculation of y is as follows:

Step B3: scoring similarity S is calculated_card(x,y)

The scoring similarity of commentator x and commentator y are calculated using root-mean-square-deviation, defining C indicates commentator x and comment The set of applications that theorist y was commented on jointly, the scoring similarity specific formula for calculation for obtaining commentator x and commentator y are as follows:

Step B4: the similarity SC of commentator x and commentator y is calculated_(x,y), calculation formula is as follows:

Claims

1. a kind of application market brush list collusion group detection method based on cluster, which is characterized in that method includes the following steps:

Step 1: crawling data set from application market, filtered by limiting the number of reviews of commentator, tested to the end with obtaining The commentator's set needed；Choosing number of reviews is more than these commentators of a certain threshold value as data set；

Step 2: a core point of optional data concentration is as initial sets first；

Step 3: according to the initial parameter of current data point, i.e., commentator's suspicion score is as similarity between threshold value η and commentator All core points in the data set are found out as threshold value ∈；

Step 4: using any core point as starting point, finding out and clustering cluster is generated by the reachable sample of its density, until all cores Until point is accessed；

Step 5: output cluster division result, wherein cluster division result includes the detailed letter of each data in cluster number of clusters and cluster Breath.

2. the application market brush list collusion group detection method according to claim 1 based on cluster, which is characterized in that described Data set crawls in step 1, including but not limited to apple application market.

3. the application market brush list collusion group detection method according to claim 1 based on cluster, which is characterized in that described Include but is not limited to by crawling the data characteristics in obtained data set in step 1: commentator's name, is commented comment content By score, the application of comment, the number of applications of comment and comment number of words.

4. the application market brush list collusion group detection method according to claim 1 based on cluster, which is characterized in that described Core point is determined by commentator's suspicion score in step 2.

5. the application market brush list collusion group detection method according to claim 1 based on cluster, which is characterized in that described Commentator suspicion score RSS by three parts calculate score form, including commentator's score, comment suspicion score and apply suspicion Score；Calculation formula is as follows:

Wherein, i indicates that commentator, j indicate comment, and k indicates application；RSS indicates commentator's suspicion score；SR_iIndicate commentator point Number；n_iIndicate all number of reviews of commentator i, SS_jIndicate comment suspicion score, c_ijIndicate j-th of comment of commentator i；m_kTable Show all comment number of applications of commentator i, SA_kIt indicates to apply suspicion score, t_ikIndicate k-th of application of commentator i comment.

6. the application market brush list collusion group detection method according to claim 1 based on cluster, which is characterized in that step Similarity SC between similarity i.e. two commentator x and y between 3 commentators_(x,y), calculation formula is as follows:

Wherein, S_c(r(x,y)It indicates to apply similarity, S_crr(x,y)Indicate comment similarity, S_c(rd(x,y)Indicate scoring similarity.

7. the application market brush list collusion group detection method according to claim 1 based on cluster, which is characterized in that described Density coverage will define ε-field in step 4；To x_j∈ D, ε-neighborhood include data set D in x_jSimilarity threshold Object greater than ∈, i.e.,