CN109919191A - A kind of application market brush list collusion group detection method based on cluster - Google Patents

A kind of application market brush list collusion group detection method based on cluster Download PDF

Info

Publication number
CN109919191A
CN109919191A CN201910090202.2A CN201910090202A CN109919191A CN 109919191 A CN109919191 A CN 109919191A CN 201910090202 A CN201910090202 A CN 201910090202A CN 109919191 A CN109919191 A CN 109919191A
Authority
CN
China
Prior art keywords
commentator
cluster
comment
application market
detection method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910090202.2A
Other languages
Chinese (zh)
Other versions
CN109919191B (en
Inventor
何道敬
潘梦函
唐宗力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN201910090202.2A priority Critical patent/CN109919191B/en
Publication of CN109919191A publication Critical patent/CN109919191A/en
Application granted granted Critical
Publication of CN109919191B publication Critical patent/CN109919191B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The application market brush list collusion group detection method based on cluster that the invention discloses a kind of implements step are as follows: crawl data set;Initialize core point set;Determine commentator's suspicion score threshold;The reachable sample of density is found out from any core point and generates clustering cluster, until all core points are accessed;Export cluster division result.Application market brush list collusion group detection method disclosed by the invention based on cluster, the algorithm fully demonstrate the similar feature of collusion group group member, obtain better Clustering Effect, and solve the problems, such as that traditional clustering algorithm parameter setting in application market is difficult.

Description

A kind of application market brush list collusion group detection method based on cluster
Technical field
The present invention relates to a kind of brush list collusion group detection methods, specifically, i.e., a kind of application market based on cluster Brush list collusion group detection method.
Background technique
With the fast development of smart phone, the quantity of mobile phone application increases at an amazing speed, and mobile phone application market is User downloads mobile application and provides a kind of convenience and effective mode.If one is applied the ranking issued in application market to get over It is high, it is meant that the application has higher exposure rate, then developer more likely obtains huge profit.Then in mobile application Market has derived a kind of novel marketing methods --- early in the field of e-commerce, with hot, the shop of Taobao day cat Brush is single just to be become businessman and plays tricks a kind of marketing methods made a show of power --- using brush list.Attacker goes to promote him using the means of brush list Mobile application in application market to seek bigger profit, brush list person is the same with Taobao brush Dan Yuan, mostly with group or The form of group works, therefore also known as brush list collusion group, is managed collectively by Shua Bang company.Group member can imitate normal users Behavior, hide the detection algorithm in application market, bring challenges to the detection of brush list collusion group and brush list person.Therefore, how The brush list collusion group quickly and effectively detected in application market is a urgent issue, the life for maintenance application market State balance accelerates competition and innovation of the applied software development person in circle to be of great significance.
Currently, e-commerce field mostly uses greatly the machine of supervision there are the mode of detection comment spam collusion group Learning method, but its feature first is that depending critically upon markd data set to train classifier.However model training needs Will be largely with markd sample, and this kind of sample acquisition is difficult and cost is too high, lacks the party of enough marker samples training Method is proved to not accurate enough again.At the same time, relatively fewer in the detection of application market field brush list collusion group, Xie Z et al. is logical The relationship between analysis commentator and commentator and commentator and application is crossed, then opening relationships figure is detected;Chen H Et al. generate candidate brush list collusion group using frequent item set mining (FIM) technology, then scored by building commentator application Model its brush list collusion group detected.But such method is only able to find intensive brush list collusion group, and wherein Each group membership must comment on all target applications.Therefore, a kind of brush list collusion group detection based on cluster proposed by the present invention Method can make full use of the similar feature of collusion group group member, obtain better Clustering Effect, solve traditional cluster The problem of algorithm parameter setting difficulty in application market.
Summary of the invention
The purpose of the present invention is intended to detect aiming at the problem that in terms of parameter setting existing application market brush list collusion group, mentions A kind of application market brush list collusion group detection method based on cluster is gone out, which, which uses, is based on original DBSCAN algorithm Improved S-DBSCAN algorithm is changed into dependence current number by the determination of core point in algorithm by the number by MinPts Commentator's suspicion score at strong point determines.For simple, core point is determined by commentator's suspicion score, and the radius of neighborhood is by commenting Similarity determines between theorist.The algorithm not only solves the problem for specifying MinPts and Eps parameter difficulty in advance, while testing table Bright S-DBSCAN algorithm obtains better Clustering Effect using DBSCAN algorithm compared to directly.
Realizing the specific technical solution of the object of the invention is:
A kind of application market brush list collusion group detection method based on cluster, method includes the following steps:
Step 1: crawling data set from application market, filtered by limiting the number of reviews of commentator, to obtain to the end Commentator's set that experiment needs;Choosing number of reviews is more than these commentators of a certain threshold value as data set;
Step 2: a core point of optional data concentration is as initial sets first;
Step 3: according to the initial parameter of current data point, i.e., commentator's suspicion score is as phase between threshold value η and commentator All core points in the data set are found out as threshold value ∈ like degree;
Step 4: using any core point as starting point, finding out and clustering cluster is generated by the reachable sample of its density, until all Until core point is accessed;
Step 5: output cluster division result, wherein cluster division result includes the detailed of each data in cluster number of clusters and cluster Information.
Data set crawls in the step 1, including but not limited to apple application market.
Include but is not limited to by crawling the data characteristics in obtained data set in the step 1: commentator's name, Comment on content, comment score, the application of comment, the number of applications of comment and comment number of words.
Core point is determined by commentator's suspicion score in the step 2;Commentator suspicion score RSS is calculated by three parts to be divided Array at, including commentator's score, comment suspicion score and apply suspicion score;Calculation formula is as follows:
Wherein, i indicates that commentator, j indicate comment, and k indicates application;RSS indicates commentator's suspicion score;SRiExpression is commented Theorist's score;niIndicate all number of reviews of commentator i, SSjIndicate comment suspicion score, cijJ-th for indicating commentator i is commented By;mkIndicate all comment number of applications of commentator i, SAkIt indicates to apply suspicion score, tikIndicate k-th of commentator i comment Using.
Similarity SC between similarity i.e. two commentator x and y between commentator described in step 3(x,y), calculation formula is as follows:
Wherein, Scar(x,y)It indicates to apply similarity, Scrr(x,y)Indicate comment similarity, Scar1(x,y)Indicate that scoring is similar Degree.
Density coverage will define ε-field in the step 4;To x4∈ D, ε-neighborhood include data set D in xj Similarity threshold be greater than ∈ object, i.e.,
The present invention applies to improved S-DBSCAN algorithm in the detection of application market brush list collusion group, conspires from brush list Brush list person's behavior has the characteristics that similitude is set out in group, and MinPts parameter is substituted using commentator's suspicion score and two is commented Similarity is between theorist come the problem of substituting Eps parameter, not only solve parameter tuning difficult in traditional DBSCAN algorithm, simultaneously Experiment also indicates that S-DBSCAN clustering algorithm obtains better Clustering Effect.
Detailed description of the invention
Fig. 1 is flow chart of the present invention.
Specific embodiment
Lower the contents of the section is by with specific embodiments and the drawings, and the present invention is described in further detail.Implement this The procedures, conditions, experimental methods etc. of invention are in addition to what is specifically mentioned below the universal knowledege and public affairs of this field Know common sense, there are no special restrictions to content by the present invention.
The present invention the following steps are included:
Step 1: crawling data set from application market, filtered by limiting the number of reviews of commentator, to obtain to the end Commentator's set that experiment needs.
The core point that step 2:S-DBSCAN algorithm optional data first is concentrated is gathered as initial sets, then thus It sets out and determines corresponding clustering cluster.
Step 3: according to given initial parameter, i.e. similarity threshold ∈ between commentator's suspicion score threshold η and commentator Find out all core points.
Step 4: using any core point as starting point, finding out and clustering cluster is generated by the reachable sample of its density, until all Until core point is accessed.
Step 5: output cluster division result, wherein cluster division result includes the detailed of each data in cluster number of clusters and cluster Information.
S-DBSCAN algorithm, including but not limited to density-based algorithms (Density-based Method).It will MinPts and Eps parameter is converted into application market the connection of the similarity between commentator in DBSCAN algorithm.
S-DBSCAN algorithm is using commentator's suspicion score threshold η instead of similarity threshold between MinPts parameter and commentator Value ∈ is instead of Eps parameter.
The present invention substitutes MinPts parameter using commentator suspicion score RSS, specifically, the conversion of MinPts parameter It is as follows.Commentator's suspicion score calculates score by three parts and forms, and dislikes comprising commentator's score, comment suspicion score and application Doubt score.Conversion formula is as follows:
Wherein, i indicates that commentator, j indicate comment, and k indicates application;RSS indicates commentator's suspicion score;SRiExpression is commented Theorist's score;niIndicate all number of reviews of commentator i, SSjIndicate comment suspicion score, cijJ-th for indicating commentator i is commented By;mkIndicate all comment number of applications of commentator i, SAkIt indicates to apply suspicion score, tikIndicate k-th of commentator i comment Using.
The present invention uses similarity SC between two commentator x and y(x,y)To substitute Eps parameter.Specifically, Eps parameter It converts as follows.Conversion formula is as follows:
Wherein, Scar(x,y)It indicates to apply similarity, Scrr(x,y)Indicate comment similarity, Scard(x,y)Indicate that scoring is similar Degree.
The determination of core point is no longer rely on the number of MinPts in S-DBSCAN algorithm, but commenting according to current data point Theorist's suspicion score determines.
Embodiment
The present invention examines the application market brush list collusion group based on cluster for based on improved S-DBSCAN algorithm Survey method does detailed description below:
Step 1: crawling data set from application market, amount to 19666225 comments and 12315366 commentators.Pass through The number of reviews of commentator is limited to filter, it is more than 50 these commentators as final experiment that selection, which meets number of reviews, Data set.Commentator is obtained and gathers 8853, comment set 818545, set of applications 2188.
Step 2: initialization core point set.
Step 3: similarity threshold ∈ between setting commentator's suspicion score threshold η and commentator finds out all core points.Its In, the setting range that the setting range of η chooses 0.7,0.8,0.9, ∈ chooses 0.7,0.8,0.9, the total 9 kinds of experiments of permutation and combination As a result.
Step 4: using any core point as starting point, finding out and clustering cluster is generated by the reachable sample of its density, until all Until core point is accessed.
Step 5: output cluster division result.Cluster number of clusters is equivalent to brush list collusion group quantity, finds from experimental result, with Similarity threshold is increasing between commentator's suspicion score threshold and commentator, and brush list collusion group is also fewer and fewer, wherein when η with When ∈ is 0.7, experimental result is shown as 177, shows there are 177 brush list collusion groups;When η and ∈ are 0.9, Experimental result is shown as 24, indicates 24 brush list collusion groups.
Step 6: the present invention is clustered using improved S-DBSCAN algorithm, while using silhouette coefficient evaluation cluster The quality of effect, it is similar compared to other clustering clusters to its affiliated clustering cluster that silhouette coefficient can simply be interpreted as a node Degree, value are between [- 1,1], and it is better to be worth bigger expression Clustering Effect.Silhouette coefficient is gradually increased with ∈ in experiment, It is increased to by initial 0.283 to 0.763.And gathered in tradition DBSCAN algorithm using different MinPts and Eps parameters Class, which is up to 0.453, and Clustering Effect is significantly lower than S-DBSCAN algorithm.
Specifically, the conversion of MinPts parameter is as follows.The present invention substitutes MinPts using commentator suspicion score RSS Parameter.Steps are as follows for calculating:
Step A1: commentator's score SR is determinedi
The scoring that commentator applies one is known as effort analysis using the difference of average mark with this, is defined as: The quantity of all positive number of reviews of commentator comments more all than upper commentator is known as commentator front Ratio is commented on, is defined as:Number of reviews is not repeated by what commentator i was issued is defined as: ndi, by comment outburst frequency Rate is defined as:It will judge whether a commentator i issues overweight reexamine and repeat to comment on by referred to as publication, it is fixed Justice are as follows: dupi, overweight reexamine dup if is issued if commentator iiValue is set as 1, is otherwise provided as 0.
Finally, above-mentioned all indexs are added up to obtain the suspicion score SR of commentator iiFormula:
Step A2: comment suspicion score SS is determinedj
It is known as suspicion comment close to duplicate comment for one;The suspicion score of comment is calculated using cosine similarity SSj: SSj=maxi≠jCosine′(reviewj,reviewi) and meet to exist and make (j, k) using k, (i, k) ∈ E;Wherein, E For the side in comment figure, Cosine ' is the linear scale form of Cosine function, and the value of Cosine function is limited in [0,1] In;
Step A3: it determines and applies suspicion score SAk
Suspicion application is obtained into very high scoring whithin a period of time and is known as the scoring of explosion type height, is defined as:Wherein, ndkAll for application k do not repeat number of reviews, nkFor all number of reviews of application k;It will most Short higher assessment divides time interval is defined as: sapk, it calculates as follows:
Longest higher assessment point time interval is determined Justice are as follows: lapk, it calculates as follows:
Wherein, between the time that d is each comment Every npkFor all positive number of reviews of commentator k.
So the suspicion score SA for the k that is appliedkCalculation formula:
Step A4: commentator's suspicion score RSS is determined
Commentator's suspicion score is by commentator's score SRi, comment suspicion score SSjAnd apply suspicion score SAkIt determines, Calculation formula is as follows:
Specifically, the conversion of Eps parameter is as follows.The present invention uses similarity SC between two commentator x and y(x,y)To replace For Eps parameter.Steps are as follows for calculating:
Step B1: it calculates and applies similarity Scar(x,y)
The set of applications similarity that two commentators commented on is measured using Jaccard similarity.DefinitionWherein, MiFor commentator i comment set of applications,For n-th of application of commentator i comment.Cause This, judges that the similarity between set of applications that commentator x and commentator y were commented on can indicate are as follows:
Step B2: comment similarity S is calculatedcrr(x,y)
Two commentators are commented on into the sum of number of applications of identical developer divided by all number of reviews of each commentator The sum of be defined as comment similarity.The developer's collection for defining the application that commentator commented on is combined into B, Pn∈B,iIndicate i pairs of commentator The set of applications that developer b was commented on, PiIndicate all applications that commentator i was commented on.Therefore commentator x and commentator are obtained The comment similarity specific formula for calculation of y is as follows:
Step B3: scoring similarity S is calculatedcard(x,y)
The scoring similarity of commentator x and commentator y are calculated using root-mean-square-deviation, defining C indicates commentator x and comment The set of applications that theorist y was commented on jointly, the scoring similarity specific formula for calculation for obtaining commentator x and commentator y are as follows:
Step B4: the similarity SC of commentator x and commentator y is calculated(x,y), calculation formula is as follows:

Claims (7)

1. a kind of application market brush list collusion group detection method based on cluster, which is characterized in that method includes the following steps:
Step 1: crawling data set from application market, filtered by limiting the number of reviews of commentator, tested to the end with obtaining The commentator's set needed;Choosing number of reviews is more than these commentators of a certain threshold value as data set;
Step 2: a core point of optional data concentration is as initial sets first;
Step 3: according to the initial parameter of current data point, i.e., commentator's suspicion score is as similarity between threshold value η and commentator All core points in the data set are found out as threshold value ∈;
Step 4: using any core point as starting point, finding out and clustering cluster is generated by the reachable sample of its density, until all cores Until point is accessed;
Step 5: output cluster division result, wherein cluster division result includes the detailed letter of each data in cluster number of clusters and cluster Breath.
2. the application market brush list collusion group detection method according to claim 1 based on cluster, which is characterized in that described Data set crawls in step 1, including but not limited to apple application market.
3. the application market brush list collusion group detection method according to claim 1 based on cluster, which is characterized in that described Include but is not limited to by crawling the data characteristics in obtained data set in step 1: commentator's name, is commented comment content By score, the application of comment, the number of applications of comment and comment number of words.
4. the application market brush list collusion group detection method according to claim 1 based on cluster, which is characterized in that described Core point is determined by commentator's suspicion score in step 2.
5. the application market brush list collusion group detection method according to claim 1 based on cluster, which is characterized in that described Commentator suspicion score RSS by three parts calculate score form, including commentator's score, comment suspicion score and apply suspicion Score;Calculation formula is as follows:
Wherein, i indicates that commentator, j indicate comment, and k indicates application;RSS indicates commentator's suspicion score;SRiIndicate commentator point Number;niIndicate all number of reviews of commentator i, SSjIndicate comment suspicion score, cijIndicate j-th of comment of commentator i;mkTable Show all comment number of applications of commentator i, SAkIt indicates to apply suspicion score, tikIndicate k-th of application of commentator i comment.
6. the application market brush list collusion group detection method according to claim 1 based on cluster, which is characterized in that step Similarity SC between similarity i.e. two commentator x and y between 3 commentators(x,y), calculation formula is as follows:
Wherein, Sc(r(x,y)It indicates to apply similarity, Scrr(x,y)Indicate comment similarity, Sc(rd(x,y)Indicate scoring similarity.
7. the application market brush list collusion group detection method according to claim 1 based on cluster, which is characterized in that described Density coverage will define ε-field in step 4;To xj∈ D, ε-neighborhood include data set D in xjSimilarity threshold Object greater than ∈, i.e.,
CN201910090202.2A 2019-01-30 2019-01-30 Clustering-based application market brush list collusion group detection method Active CN109919191B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910090202.2A CN109919191B (en) 2019-01-30 2019-01-30 Clustering-based application market brush list collusion group detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910090202.2A CN109919191B (en) 2019-01-30 2019-01-30 Clustering-based application market brush list collusion group detection method

Publications (2)

Publication Number Publication Date
CN109919191A true CN109919191A (en) 2019-06-21
CN109919191B CN109919191B (en) 2023-05-02

Family

ID=66961032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910090202.2A Active CN109919191B (en) 2019-01-30 2019-01-30 Clustering-based application market brush list collusion group detection method

Country Status (1)

Country Link
CN (1) CN109919191B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8561184B1 (en) * 2010-02-04 2013-10-15 Adometry, Inc. System, method and computer program product for comprehensive collusion detection and network traffic quality prediction
CN106294105A (en) * 2015-05-22 2017-01-04 深圳市腾讯计算机系统有限公司 Brush amount tool detection method and apparatus
CN106682058A (en) * 2016-08-08 2017-05-17 腾讯科技(深圳)有限公司 Screening method, device and system of application programs
CN107239694A (en) * 2017-05-27 2017-10-10 武汉大学 A kind of Android application permissions inference method and device based on user comment
CN107391548A (en) * 2017-04-06 2017-11-24 华东师范大学 A kind of Mobile solution market brush list user's group detection method and its system
CN107808093A (en) * 2016-09-09 2018-03-16 长沙有干货网络技术有限公司 A kind of Android malware family clustering method of Behavior-based control

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8561184B1 (en) * 2010-02-04 2013-10-15 Adometry, Inc. System, method and computer program product for comprehensive collusion detection and network traffic quality prediction
CN106294105A (en) * 2015-05-22 2017-01-04 深圳市腾讯计算机系统有限公司 Brush amount tool detection method and apparatus
CN106682058A (en) * 2016-08-08 2017-05-17 腾讯科技(深圳)有限公司 Screening method, device and system of application programs
CN107808093A (en) * 2016-09-09 2018-03-16 长沙有干货网络技术有限公司 A kind of Android malware family clustering method of Behavior-based control
CN107391548A (en) * 2017-04-06 2017-11-24 华东师范大学 A kind of Mobile solution market brush list user's group detection method and its system
CN107239694A (en) * 2017-05-27 2017-10-10 武汉大学 A kind of Android application permissions inference method and device based on user comment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HAO CHEN ETAL.: ""Toward Detecting Collusive Ranking Manipulation Attackers in Mobile App Markets"", 《ACM》 *
LORENZO VILLARROEL ETAL.: ""Release Planning of Mobile Apps Based on User Reviews"", 《IEEE》 *

Also Published As

Publication number Publication date
CN109919191B (en) 2023-05-02

Similar Documents

Publication Publication Date Title
CN105373597B (en) The user collaborative filtered recommendation method merging based on k medoids item cluster and partial interest
CN104063801B (en) A kind of moving advertising recommend method based on cluster
WO2018014610A1 (en) C4.5 decision tree algorithm-based specific user mining system and method therefor
CN109299380B (en) Exercise personalized recommendation method based on multi-dimensional features in online education platform
CN107451619A (en) A kind of small target detecting method that confrontation network is generated based on perception
CN108764366A (en) Feature selecting and cluster for lack of balance data integrate two sorting techniques
CN109190033A (en) A kind of user's friend recommendation method and system
CN109902235A (en) User preference based on bat optimization clusters Collaborative Filtering Recommendation Algorithm
CN107368856A (en) Clustering method and device, the computer installation and readable storage medium storing program for executing of Malware
CN109903086A (en) A kind of similar crowd's extended method, device and electronic equipment
CN102902981A (en) Violent video detection method based on slow characteristic analysis
CN106789338B (en) Method for discovering key people in dynamic large-scale social network
CN108132964A (en) A kind of collaborative filtering method to be scored based on user item class
CN106803039A (en) The homologous decision method and device of a kind of malicious file
CN110275910A (en) A kind of oversampler method of unbalanced dataset
CN108305181A (en) The determination of social influence power, information distribution method and device, equipment and storage medium
CN113037410A (en) Channel identification method, device, transmission method, transmission equipment, base station and medium
CN105678047A (en) Wind field characterization method with empirical mode decomposition noise reduction and complex network analysis combined
CN109525577A (en) Malware detection method based on HTTP behavior figure
CN109871770A (en) Property ownership certificate recognition methods, device, equipment and storage medium
CN111382278A (en) Social network construction method and system based on space-time trajectory
CN106780258A (en) A kind of method for building up and device of minor crime decision tree
CN105574183A (en) App (application) recommendation method based on collaborative filtering recommendation algorithm-KNN (K-nearest neighbor) classification algorithm
CN107566389A (en) A kind of imitation URL link fishing domain name recognition methods based on C4.5 decision trees
Shao et al. Community detection via local dynamic interaction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant